Appendix B: Item Graveyard

Bikes and Scooters 1

Bikes and Scooters 2

Bikes and Scooters 3

Births per Day

Movie Budgets 3

Model Comparison

Training or Validation

TV Show Wrangling

Shopping Wrangling

Park Wrangling

Data Cleaning

As a way to help the environment some cities in the U.S. are adding bike and e-scooter share stations which allow people to rent a bike or e-scooter for commuting or pleasure. The bikes and scooters are often kept at electronic docking stations at multiple locations around the cities. The following graphs were created using data from the Department of Transportation Statistics¹ about public use of these shared dock systems in four U.S. cities—Chicago, Minneapolis, Portland, and Topeka.

Which of the four cities had the most bike docks in 2020? Explain how you determined this. Or if you cannot answer it from the visualization, explain why not.

The bar chart (a) shows the the percentage of docks in each city that were bike docks for each year from 2015–2020.

The pie chart (b) was created by plotting the percentage of bike docks for each year in one particular city. Unfortunately, the data scientist has forgotten which city this is. Using the information in the bar chart, identify the city. Explain how you determined this or if you cannot answer it from the visualization, explain why not.

Which of the four cities had the biggest increase in the number of bike docks from 2018 to 2020? Explain how you determined this or if you cannot answer it from the following line graph visualization, explain why not.

The map below shows the number of cities in each region of the United States that have docked bikes, dockless bikes, or e-Scooters in both 2018 and 2020. Use that information to answer each of the following questions. For each question, explain how you determined your answer, or if you cannot answer it from the visualization, explain why not.

How many cities in the Southeast had e-Scooters in 2018?

In 2020, the Southwest region has more docked bike stations than dockless bike stations.

The majority of regions decreased the number of e-scooter stations from 2018 to 2020.

Across the majority of regions, the trend is that over time, cities tend to be adopting dockless bikes rather than docked bikes.

Across the majority of regions, the trend is that over time, there are fewer cities that are making docked bikes, dockless bikes, and e-scooters available. Explain how you determined this or if you cannot answer it from the visualization, explain why not.

An electric bike, also known as an e-bike, is a bicycle with a battery-powered “assist” that comes via pedaling. An online product recommendation service that tests and reviews products has gathered a representative sample of 15 e-bikes from a single manufacturer and measured their ranges (how far they can go on a full battery without recharging). Based on this sample, they calculated an average range of 60 kilometers, plus or minus 10 kilometers. Suppose you’re in the market for an e-bike and during your research you come across the following two items:

An e-bike with a range of 85 kilometers.
A report from a different product recommendation service that has also gathered data from a different, but also representative sample of 15 e-bikes from this same manufacturer, with a mean range of 85 kilometers.

Which one of these make you doubt the original report more?

A data scientist for a large urban hospital examined a sample of data to estimate the mean number of births that took place on Fridays and Saturdays. The plots below show the number of births that took place on either a Friday or Saturday for that sample.

To estimate the mean number of births that took place on Fridays and Saturdays, the data scientist computed confidence intervals (mean $\pm$ margin of error) for both days. Unfortunately they forgot which mean and margin of error was associated with each day.

Which mean is associated with Friday? Explain.

8350
11,800

Which margin of error is associated with Friday? Explain.

The data scientist was asked to use the fitted regression model to make a prediction for the revenue for a horror movie using two different potential budgets; a budget of $\$25\mathrm{M}$ and a budget of $\$50\mathrm{M}$ They were also asked to compute a prediction interval for these two predictions to estimate the uncertainty in the prediction. The scatterplot and fitted regression line for Horror movies is displayed below.

Which of the predictions would have a greater predicted revenue associated with it? Explain

$\mathrm{budget} = \$25\mathrm{M}$
$\mathrm{budget} = \$50\mathrm{M}$
They are the same.
Not enough information to determine this.

A data scientist has trained four different classification models (null model, Naive Bayes model, k Nearest Neighbors (kNN) model, and random forest model) on a set of data. The observed responses and the model predictions for a set of 10 observations from a validation set of data are shown in the table below.

	Prediction
Observed Responses	Null	Naive Bayes	kNN	Random Forest
No	No	No	No	No
No	No	Yes	Yes	No
No	No	No	No	No
No	No	No	No	No
Yes	No	No	No	No
No	No	Yes	Yes	No
No	No	No	No	No
No	No	No	No	No
Yes	No	Yes	Yes	Yes
Yes	No	Yes	Yes	Yes

Are the predictions from the kNN model more, less, or equally as accurate as the results from the null model? Explain.

The figure shows a plot of prediction error as a function of model complexity for a training and validation sample. Which sample (training or validation) is associated with the orange, solid line? Explain.

The following three items were reworked into one context, as the current Movie Wrangling (pseudocode; column-wise operations; joins).

The two tables below provide data about several TV shows.

Creator Table
Creator	TV_Show	TV_Show_ID
Aguirre-Sacasa, Roberto	Riverdale	I
Blair, April	All-American	A
Dunham, Lena	Girls	L
Glover, Donald	Atlanta	B
Levitan, Steven	Modern Family	F
Lloyd, Christopher	Modern Family	F
Murphy, Kevin	Hellcats	E
Scheuring, Paul T.	Prison Break	H
Sherman-Palladino, Amy	Bunheads	C
Sherman-Palladino, Amy	Gilmore Girls	D
Star, Darren	Sex and the City	M
Star, Darren	Younger	K
Watson, Sarah	The Bold Type	J

TV Show Table
TV_Show	TV_Show_ID	Network	Seasons
All-American	01	CW	4
Atlanta	02	FX	4
Bunheads	03	ABC Family	2
Gilmore Girls	04	WB	7
Hellcats	05	CW	1
Modern Family	06	ABC	11
Ozark	07	Netflix	4
Prison Break	08	Fox	6
Riverdale	09	CW	6
The Bold Type	10	Freeform	5
Younger	11	TV Land	7

Consider the two following sets of pseudocode (ie. code recipe). Would they produce the same results? Explain.

start_with(the Creator table) and_then

add_columns_from(the TV Show table matching_by the TV Show column) and_then

count_of(the number of rows for the CW network)

start_with(the TV Show table) and_then

add_columns_from(the Creator table matching_by the TV Show column) and_then

count_of(the number of rows for the CW network)

Consider the two following sets of pseudocode (ie. code recipe). Would they produce the same results? Explain.

start_with(the Creator table) and_then

add_columns_from(the TV Show table matching_by the TV Show column) and_then

count_of(the number of rows for the CW network)

start_with(the TV Show table) and_then

add_columns_from(the Creator table matching_by the TV Show ID column) and_then

count_of(the number of rows for the CW network)

The dataset below contains information on 8 people; we know their names and how many items they purchased online today.

name	number_of_items	visited_online_retailer
Miriam	10
Marcel	2
Ayesha	0
Rebecca	3
Lola	0
Laurence	1
Tomos	9
Abdul	0

You have been tasked with adding a new column called visited_online_retailer which indicates whether or not each person visited the website of an online retailer (“yes” if they did, “no” if they did not). Is there sufficient information in this dataset to generate this new column? Explain.

The data set park_visits contains the number of annual visitors to 376 national park sites in the United States from 1904–2016. The data were originally collected from the National Park Service. There are 20,920 total records in the data set, since the parks were open for the entire date range. A few rows of park_visits data are shown below.

year	state	park_site	visitors
1904	AR	Hot Springs National Park	101000
1904	CA	Kings Canyon National Park	1000
1904	OR	Crater Lake National Park	1500
$\cdots$	$\cdots$	$\cdots$	$\cdots$
2016	WY	Devils Tower National Monument	496210
2016	WY	Fort Laramie National Historic Site	57444

A data scientist would like to find the most popular park in each state in 2016. To do so, they decided to create a new data table named most_visited_2016 that includes the national park site in each state with the most visitors in 2016. The final table will include 51 rows (one for each state and Washington D.C.) and the columns year, state, park_site and visitors. Four of the 51 rows of the table are shown below.

year	state	park_site	visitors
2016	AK	Klondike Gold Rush National Historical Park	912351
2016	AL	Little River Canyon National Preserve	462700
2016	AR	Buffalo National River	1785359
$\cdots$	$\cdots$	$\cdots$	$\cdots$
2016	WY	Yellowstone National Park	4257177

Arrange the steps to get from the original data set park_visits to the final table most_visited_2016.

Steps:

Start with park_visits
FILTER(year == 2016): Filter for observations in 2016
GROUP_BY(state): Group by state/Perform subsequent lines of code within each state
ARRANGE(DESC((visitors)): Sort the number of visits in descending order
SLICE(1) : Take the first observation.
End with most_visited_2016

Docked Bikes

Dockless Bikes

e-Scooters

Docked Bikes

Dockless Bikes

e-Scooters

A researcher randomly selects 10 students in a school and collects data about their age and number of siblings. They enter the data into a spreadsheet and are interested in calculating descriptive statistics. The software that they are using displays the data as shown below.

row	X1	X10	X3
1	2	11	1
2	3	12	3
$\cdots$	$\cdots$	$\cdots$	$\cdots$
8	9	12	2
9	10	8	0

10	Total	113	21

Would it be safe to assume that the average age of children in the sample is 11.3? Explain.

If you could change the way that the data are displayed in this software, would you change anything? If yes, then list the thing(s) that you would change.

Çetinkaya-Rundel, Mine, and Victoria Ellison. 2021. “A Fresh Look at Introductory Data Science.” Journal of Statistics and Data Science Education 29 (sup1): S16–26. https://doi.org/10.1080/10691898.2020.1804497.

De Veaux, Richard D., Mahesh Agarwal, Maia Averett, Benjamin S. Baumer, Andrew Bray, Thomas C. Bressoud, Lance Bryant, et al. 2017. “Curriculum Guidelines for Undergraduate Programs in Data Science.” Annual Review of Statistics and Its Application 4 (1): 15–30. https://doi.org/10.1146/annurev-statistics-060116-053930.

Delmas, Robert C., Joan Garfield, Ann Ooms, and Beth L. Chance. 2007. “Assessing Students’ Conceptual Understanding After a First Course in Statistics.” Statistics Education Research Journal 6: 28–58. https://doi.org/10.52041/serj.v6i2.483.

Epstein, Jerome. 2013. “The Calculus Concept InventoryMeasurement of the Effect of Teaching Methodology in Mathematics.” Notices of the American Mathematical Society 60 (08): 1018. https://doi.org/10.1090/noti1033.

Godfrey, Kelly E., and Sanja Jagesic. 2016. “Validating College Course Placement Decisions Based on CLEP Exam Scores: CLEP Placement Validity Study Results. Statistical Report.” https://eric.ed.gov/?id=ED574772.

Jorion, Natalie, Brian D. Gane, Katie James, Lianne Schroeder, Louis V. DiBello, and James W. Pellegrino. 2015. “An Analytic Framework for Evaluating the Validity of Concept Inventory Claims.” Journal of Engineering Education 104 (4): 454–96. https://doi.org/10.1002/jee.20104.

Jorion, Natalie, Brian Gane, Louis DiBello, and James Pellegrino. 2015. “2015 ASEE Annual Conference and Exposition.” In, 26.497.1–12. Seattle, Washington: ASEE Conferences. https://doi.org/10.18260/p.23836.

Mulford, Douglas R., and William R. Robinson. 2002. “An Inventory for Alternate Conceptions Among First-Semester General Chemistry Students.” Journal of Chemical Education 79 (6): 739. https://doi.org/10.1021/ed079p739.

Reinhart, Alex, Ciaran Evans, Amanda Luby, Josue Orellana, Mikaela Meyer, Jerzy Wieczorek, Peter Elliott, Philipp Burckhardt, and Rebecca Nugent. 2022. “Think-Aloud Interviews: A Tool for Exploring Student Statistical Reasoning.” Journal of Statistics and Data Science Education 30 (2): 100–113. https://doi.org/10.1080/26939169.2022.2063209.

Schanzer, Emmanuel, Nancy Pfenning, Flannery Denny, Sam Dooman, Joe Gibbs Politz, Benjamin S. Lerner, Kathi Fisler, and Shriram Krishnamurthi. 2022. “SIGCSE 2022: The 53rd ACM Technical Symposium on Computer Science Education.” In, 22–28. Providence RI USA: ACM. https://doi.org/10.1145/3478431.3499311.

Solomon, Erin D., Julie M. Bugg, Shaina F. Rowell, Mark A. McDaniel, Regina F. Frey, and Paul S. Mattson. 2021. “Development and Validation of an Introductory Psychology Knowledge Inventory.” Scholarship of Teaching and Learning in Psychology 7 (2): 123–39. https://doi.org/10.1037/stl0000172.

Study, Nancy, Steven Nozaki, Sheryl Sorby, Mary Sadowski, Heidi Steinhauer, Ronald Miller, and Kaloki Nabutola. 2018. “2018 ASEE Annual Conference & Exposition.” In, 30231. Salt Lake City, Utah: ASEE Conferences. https://doi.org/10.18260/1-2--30231.

Swanstrom, Ryan. n.d. “Data Science Colleges and Universities.” https://ryanswanstrom.com/colleges/.

Zhang, Zhiyong, and Danyang Zhang. 2021. “What Is Data Science? An Operational Definition Based on Text Mining of Data Science Curricula.” Journal of Behavioral Data Science 1 (1): 1–16. https://doi.org/10.35566/jbds/v1n1/p1.

Data from: https://data.bts.gov/d/7m5x-ubud/visualization↩︎

row	X1	X10	X3
1	2	11	1
2	3	12	3
\(\cdots\)	\(\cdots\)	\(\cdots\)	\(\cdots\)
8	9	12	2
9	10	8	0

10	Total	113	21