Appendix B: Item Graveyard

Bikes and Scooters 1

As a way to help the environment some cities in the U.S. are adding bike and e-scooter share stations which allow people to rent a bike or e-scooter for commuting or pleasure. The bikes and scooters are often kept at electronic docking stations at multiple locations around the cities. The following graphs were created using data from the Department of Transportation Statistics1 about public use of these shared dock systems in four U.S. cities—Chicago, Minneapolis, Portland, and Topeka.

Which of the four cities had the most bike docks in 2020? Explain how you determined this. Or if you cannot answer it from the visualization, explain why not.

The bar chart (a) shows the the percentage of docks in each city that were bike docks for each year from 2015–2020.

The pie chart (b) was created by plotting the percentage of bike docks for each year in one particular city. Unfortunately, the data scientist has forgotten which city this is. Using the information in the bar chart, identify the city. Explain how you determined this or if you cannot answer it from the visualization, explain why not.

Which of the four cities had the biggest increase in the number of bike docks from 2018 to 2020? Explain how you determined this or if you cannot answer it from the following line graph visualization, explain why not.

Bikes and Scooters 2

The map below shows the number of cities in each region of the United States that have docked bikes, dockless bikes, or e-Scooters in both 2018 and 2020. Use that information to answer each of the following questions. For each question, explain how you determined your answer, or if you cannot answer it from the visualization, explain why not.

2018 2020
Docked Bikes Dockless Bikes e-Scooters Docked Bikes Dockless Bikes e-Scooters
West 17 14 7 13 7 3
Southwest 11 7 5 8 6 4
Midwest 27 5 8 16 6 6
Northeast 17 18 5 10 3 1
Southeast 29 13 9 14 11 11


How many cities in the Southeast had e-Scooters in 2018?

In 2020, the Southwest region has more docked bike stations than dockless bike stations.

The majority of regions decreased the number of e-scooter stations from 2018 to 2020.

Across the majority of regions, the trend is that over time, cities tend to be adopting dockless bikes rather than docked bikes.

Across the majority of regions, the trend is that over time, there are fewer cities that are making docked bikes, dockless bikes, and e-scooters available. Explain how you determined this or if you cannot answer it from the visualization, explain why not.

Bikes and Scooters 3

An electric bike, also known as an e-bike, is a bicycle with a battery-powered “assist” that comes via pedaling. An online product recommendation service that tests and reviews products has gathered a representative sample of 15 e-bikes from a single manufacturer and measured their ranges (how far they can go on a full battery without recharging). Based on this sample, they calculated an average range of 60 kilometers, plus or minus 10 kilometers. Suppose you’re in the market for an e-bike and during your research you come across the following two items:

  1. An e-bike with a range of 85 kilometers.

  2. A report from a different product recommendation service that has also gathered data from a different, but also representative sample of 15 e-bikes from this same manufacturer, with a mean range of 85 kilometers.

Which one of these make you doubt the original report more?

Births per Day

A data scientist for a large urban hospital examined a sample of data to estimate the mean number of births that took place on Fridays and Saturdays. The plots below show the number of births that took place on either a Friday or Saturday for that sample.

To estimate the mean number of births that took place on Fridays and Saturdays, the data scientist computed confidence intervals (mean \(\pm\) margin of error) for both days. Unfortunately they forgot which mean and margin of error was associated with each day.

Which mean is associated with Friday? Explain.

  1. 8350
  2. 11,800

Which margin of error is associated with Friday? Explain.

  1. 100
  2. 280

Movie Budgets 3

The data scientist was asked to use the fitted regression model to make a prediction for the revenue for a horror movie using two different potential budgets; a budget of \(\$25\mathrm{M}\) and a budget of \(\$50\mathrm{M}\) They were also asked to compute a prediction interval for these two predictions to estimate the uncertainty in the prediction. The scatterplot and fitted regression line for Horror movies is displayed below.

Which of the predictions would have a greater predicted revenue associated with it? Explain

  1. \(\mathrm{budget} = \$25\mathrm{M}\)
  2. \(\mathrm{budget} = \$50\mathrm{M}\)
  3. They are the same.
  4. Not enough information to determine this.

Model Comparison

A data scientist has trained four different classification models (null model, Naive Bayes model, k Nearest Neighbors (kNN) model, and random forest model) on a set of data. The observed responses and the model predictions for a set of 10 observations from a validation set of data are shown in the table below.

Prediction
Observed Responses Null Naive Bayes kNN Random Forest
No No No No No
No No Yes Yes No
No No No No No
No No No No No
Yes No No No No
No No Yes Yes No
No No No No No
No No No No No
Yes No Yes Yes Yes
Yes No Yes Yes Yes

Are the predictions from the kNN model more, less, or equally as accurate as the results from the null model? Explain.

Training or Validation

The figure shows a plot of prediction error as a function of model complexity for a training and validation sample. Which sample (training or validation) is associated with the orange, solid line? Explain.

The following three items were reworked into one context, as the current Movie Wrangling (pseudocode; column-wise operations; joins).

TV Show Wrangling

The two tables below provide data about several TV shows.

Creator Table
Creator TV_Show TV_Show_ID
Aguirre-Sacasa, Roberto Riverdale I
Blair, April All-American A
Dunham, Lena Girls L
Glover, Donald Atlanta B
Levitan, Steven Modern Family F
Lloyd, Christopher Modern Family F
Murphy, Kevin Hellcats E
Scheuring, Paul T. Prison Break H
Sherman-Palladino, Amy Bunheads C
Sherman-Palladino, Amy Gilmore Girls D
Star, Darren Sex and the City M
Star, Darren Younger K
Watson, Sarah The Bold Type J
TV Show Table
TV_Show TV_Show_ID Network Seasons
All-American 01 CW 4
Atlanta 02 FX 4
Bunheads 03 ABC Family 2
Gilmore Girls 04 WB 7
Hellcats 05 CW 1
Modern Family 06 ABC 11
Ozark 07 Netflix 4
Prison Break 08 Fox 6
Riverdale 09 CW 6
The Bold Type 10 Freeform 5
Younger 11 TV Land 7
  1. Consider the two following sets of pseudocode (ie. code recipe). Would they produce the same results? Explain.

A.

start_with(the Creator table) and_then

    add_columns_from(the TV Show table matching_by the TV Show column) and_then

    count_of(the number of rows for the CW network)

B.

start_with(the TV Show table) and_then

    add_columns_from(the Creator table matching_by the TV Show column) and_then

    count_of(the number of rows for the CW network)

  1. Consider the two following sets of pseudocode (ie. code recipe). Would they produce the same results? Explain.

A.

start_with(the Creator table) and_then

    add_columns_from(the TV Show table matching_by the TV Show column) and_then

    count_of(the number of rows for the CW network)

B.

start_with(the TV Show table) and_then

    add_columns_from(the Creator table matching_by the TV Show ID column) and_then

    count_of(the number of rows for the CW network)

Shopping Wrangling

The dataset below contains information on 8 people; we know their names and how many items they purchased online today.

name number_of_items visited_online_retailer
Miriam 10
Marcel 2
Ayesha 0
Rebecca 3
Lola 0
Laurence 1
Tomos 9
Abdul 0

You have been tasked with adding a new column called visited_online_retailer which indicates whether or not each person visited the website of an online retailer (“yes” if they did, “no” if they did not). Is there sufficient information in this dataset to generate this new column? Explain.

Park Wrangling

The data set park_visits contains the number of annual visitors to 376 national park sites in the United States from 1904–2016. The data were originally collected from the National Park Service. There are 20,920 total records in the data set, since the parks were open for the entire date range. A few rows of park_visits data are shown below.

year state park_site visitors
1904 AR Hot Springs National Park 101000
1904 CA Kings Canyon National Park 1000
1904 OR Crater Lake National Park 1500
\(\cdots\) \(\cdots\) \(\cdots\) \(\cdots\)
2016 WY Devils Tower National Monument 496210
2016 WY Fort Laramie National Historic Site 57444

A data scientist would like to find the most popular park in each state in 2016. To do so, they decided to create a new data table named most_visited_2016 that includes the national park site in each state with the most visitors in 2016. The final table will include 51 rows (one for each state and Washington D.C.) and the columns year, state, park_site and visitors. Four of the 51 rows of the table are shown below.

year state park_site visitors
2016 AK Klondike Gold Rush National Historical Park 912351
2016 AL Little River Canyon National Preserve 462700
2016 AR Buffalo National River 1785359
\(\cdots\) \(\cdots\) \(\cdots\) \(\cdots\)
2016 WY Yellowstone National Park 4257177

Arrange the steps to get from the original data set park_visits to the final table most_visited_2016.

Steps:

  • Start with park_visits
  • FILTER(year == 2016): Filter for observations in 2016
  • GROUP_BY(state): Group by state/Perform subsequent lines of code within each state
  • ARRANGE(DESC((visitors)): Sort the number of visits in descending order
  • SLICE(1) : Take the first observation.
  • End with most_visited_2016

Data Cleaning

A researcher randomly selects 10 students in a school and collects data about their age and number of siblings. They enter the data into a spreadsheet and are interested in calculating descriptive statistics. The software that they are using displays the data as shown below.

row X1 X10 X3
1 2 11 1
2 3 12 3
\(\cdots\) \(\cdots\) \(\cdots\) \(\cdots\)
8 9 12 2
9 10 8 0
10 Total 113 21

Would it be safe to assume that the average age of children in the sample is 11.3? Explain.

If you could change the way that the data are displayed in this software, would you change anything? If yes, then list the thing(s) that you would change.


  1. Data from: https://data.bts.gov/d/7m5x-ubud/visualization↩︎