Assessment Development

Phase 0: Initial Cleaning and Feedback

In January 2022, I inherited a GitHub repository with several documents: many were background information on what topics would be included in the assessment, with one containing all currently-written questions at various stages of completion. These questions were not yet organized into specific groups of stem and items, many sections were commented out or overwritten, and it was clearly something that had been written piece-wise by a group of people. My first true task was to run through the current questions myself, answer them how I would in an assessment-like context, and provide feedback on clarity, wording, and reflect on the topic covered.

From this initial feedback, I created three pull requests attempting to improve some of these concerns. These first fixes were minorly substantial edits: changing “most” to “the majority of” to make a question less ambiguous, adding a “fill-in-the-blank” slot to make even more clear what a question is asking, and background information on confidence intervals to Births per Day (interpret visualization; summary statistics). The first two were approved by the rest of the team, but they rejected the third, asserting that the question stem already provided sufficient motivation on confidence intervals.

I also cleaned up any “obvious” fixes from my initial review. These “no-brainer” edits were mainly typos, distorted or obscured plots, and items that are phrased in a way that don’t actually pose a question. It was also time to clean up the current document by converting it to a Quarto book, which would allow for easy webpage-like navigation. This marked my first exposure to Quarto, and I quickly grew to love its improvements to the RMarkdown workflow, like ease of rendering to different formats and intuitive structure of the index and YAML files. This was also when I received my first moment of creative liberty with the project, as I was tasked with dividing the ungrouped set of questions into discrete passages consisting of a stem and one or more corresponding items. Each of these passages—26 at the time—was given a title to identify it in the Quarto book sidebar.

The final step before presenting the assessment externally was to improve reproducibility, as many figures were not being rendered with each update, but rather embedded as images. I used the current images as guides to recreate a map of the US colored by region, movie-themed data tables, and pseudocode chunks. I had some slight HTML and CSS exposure in previous classes, but this new challenge of matching an existing format allowed me to expand my knowledge about web layout and styles.

By April 2022, concluding my first semester working on the project, we had a polished, reproducible, website-hosted prototype of the assessment ready to present externally for feedback.

Phase 1: Faculty Interviews

The remainder of the time spent working on the assessment—April 2022 to February 2023—was spent gathering feedback via interviews, iteratively updating items, and removing weaker items identified in group discussions. We conducted a series of three think-aloud informational interviews with various faculty nationwide who teach or have taught introductory data science courses. Participants were recruited from a shortlist of Dr. Çetinkaya-Rundel’s data science education contacts, and were specifically chosen to represent a breadth of data science curricula. The three interviewees chosen were, in order of interview:

Name Department Affiliation Institution Type Field of Ph.D. Dissertation
Professor X Statistics Liberal arts college Biostatistics
Professor Y Computer Science R1 research university Computer science education
Professor Z Computer and Information Science Liberal arts school within university Statistics

Each interview was scheduled for two hours long and consisted of three sections: open-ended introductory and concluding discussions, sandwiching an item-by-item run through of the assessment. In the initial discussion, we asked participants two big-picture questions: What topics must be in an introductory data science course? And, what topics are nice to have in an introductory data science course? For each item in the assessment run-through, we asked interviewees to narrate out loud their thinking process from start to finish: any initial reactions, their process to arrive at an answer, and what that answer would be. We then asked for additional comments or suggestions, ranging from small- (formatting changes) to large-scale (removing the question entirely). We then concluded with three big-picture questions: What are the strengths of the current assessment? What topics are missing from the current assessment? And, what is in the current assessment, but doesn’t belong?

While I scheduled and directed the flow of the interviews myself, I was joined by Dr. Çetinkaya-Rundel for all three, Dr. Legacy for the second two, and Dr. Beckman for the third. Silently observing, these team members took notes in real-time while I was conversing with the interviewee as a supplement to the transcript. Following the interviews, I spot-checked the accuracy of the automatic transcription, focusing on moments where I remembered the interviewee offered a powerful insight. I then augmented the interview notes with such quotes and clarified where I could while the conversation was still fresh in my mind. Ultimately, I hoped to leave a clear record of the following crucial information for our team discussions: what the interviewees selected as a response to each item, whether their reasoning for said response differed from our intention, and whether they would include or omit each item.

Discipline-Specific Perspectives

After conducting each of three interviews, we met to discuss the new feedback and, when appropriate, made modifications and deletions to the current assessment. While each interviewee came from distinct backgrounds, there were some salient themes about their perspectives on introductory data science that emerged through patterns in their responses.

Professor X (Statistics) seemed to have a somewhat similar perspective on “what is data science” as the research team—chiefly, visualization and wrangling. He noted positively when items featured multivariate data, particularly when interpreting visualizations. He also notably brought a wealth of pedagogical experience with real-life data to our consideration: such as that movie budgets and revenue display less compelling of a relationship than one would think, that movie-themed data in general is less relevant to younger generations, and that county data is generally too heterogeneous to be useful in most contexts. He also grounded our conceptions of what a pre-test student would know by claiming that residuals are now part of the Common Core in K-12 education. However, he thought residual analysis and other related modeling topics like supervised learning didn’t align with his experience teaching introductory data science. But, as in the case of Realty Tree (regression tree), he acknowledged that some topics may fall outside the scope of an introductory class, but could be reasoned through by an introductory student and provide them motivation to learn more.

Professor Z (Statistics and Information Science) held similar opinions on the focus of data science as did Professor X—visualization and wrangling. The main takeaway from her feedback was that we needed to consider carefully the scenarios posed in question stems. Hurricane data, like in Storm Paths (simulated data; interpret uncertainty), she claimed, may provoke discomfort in students from hurricane-prone areas. She noted that the former wording of He Said She Said (interpret visualization), in which the verbs in the items’ writing were in the present tense while those on the plot itself were in the past tenset, may present a burden to non-native English speakers. The SAT, referenced in Banana Conclusions (causation; statistical communication) may need more explanation for international students who are less familiar with the US admissions process. She also, like Professor X, thought that some of our more technical questions breached the scope of introductory data science, such as residual analysis and the distinction between training and validation sets. She was similarly a fan of Realty Tree (regression tree). However, while Professor X was in favor of summary statistic questions and mapping them to graphs, Professor Z called out language like “margin of error” in Births per Day (interpret visualization; summary statistics) as too statistical, and less appropriate here.

As a computer science education representative, we knew Professor Y’s interview would offer a unique perspective. This was evident from the start, when her first response to my question of “what is essential” was to ask me whether this course assumed prior coding experience or not. After I clarified that we are assuming no coding experience, she answered that learning coding “in the order that matters for data science”, e.g. functions first, is a primary topic, as is some understanding of transforming and cleaning data. But, she qualified, not too much as much cleaning can be done in advance for students in an introductory class. It wasn’t until the “what is nice to have” question that she mentioned data visualization. This CS-focused perspective continued when she pointed out that our pseudocode didn’t specify which type of join would be performed, that basic English vocabulary like “filter” and “select” might not be known in their data wrangling context, and generally dissuaded us from using pseudocode on something that could be a pre-test. Another notable pattern was that some of her suggested edits went against data visualization best practices. She suggested we change from a rainbow color scheme to a gradient one for the categorical data in Bikes and Scooters 1, and to reorder the y-axis in He Said She Said (interpret visualization) to be in alphabetical rather than sorted order. Finally, she led us to remove the logarithmic transformation present in Movie Budgets 1 (compare summary statistics visually) and all related questions, explaining that since the transformation itself isn’t relevant to the desired learning outcome, students seeing those words continually repeated may start losing sight of the item’s objective.

Regrouping to synthesize feedback

While the team met once between each faculty interview to make any minor changes before the next one, the bulk of the revision took place after all were completed, in the Fall 2022 semester. A central theme from all interviews was that our questions designed to test tricky, nuanced concepts (e.g. reading in data to statistical software) in a multiple choice format weren’t landing as we hoped. While we thought we had written a former item, Data Cleaning (computing with data; numerical reasoning), with enough language agnosticism, it ended up being too R-specific and baffling to faculty. To this item, and several others, Professor Z summed up the crux of our dilemma well, paraphrasing: “hmm.. I see what concept you’re getting at, but this question doesn’t really get there… Okay, well, this is something that’s hard to write to be auto-gradable but also actually measure. I don’t know how I would fix this but I really like the underlying idea.”

Encouraged to keep these concepts on the assessment with modifications, we found it difficult to cut out any of the ~50 pilot questions we had at this point. One notable large-scale fix was combining the concepts tested on three former wrangling passages (Park Wrangling (pseudocode; data wrangling), Shopping Wrangling (data wrangling; column-wise operations), and TV Show Wrangling (data wrangling; joins)) into a single context, hoping to further reduce students’ cognitive load. We knew we needed to cut items down to the ~30 range, but found it very difficult to identify those items that were bringing the least to the current assessment. Interestingly, this involved almost no large disagreements between team members—there were no particularly polarizing items, like we observed during the faculty interviews—rather, we all genuinely couldn’t identify any items we wanted to cut.

We did eventually make rounds of sacrificial cuts, choosing to prioritize the culling of items that even slightly overlapped each other, and those that were the most straightforward. We acknowledged the trade-off that this may lead to students taking the assessment in a pre-test context unable to confidently answer a majority of the questions. However, we were constrained by the need for an instrument that could reasonably be given in an hour, and cutting out several entire passages was the only way forward. Nevertheless, there were a handful of similar items that we decided to keep for the student interviews, such as two pseudocode chunks (in Movie Wrangling (pseudocode; column-wise operations; joins)) that varied by a single statement. Here, we were ambivalent on which would be more effective, and hoped to let student feedback dictate whether one “stuck” better.

Phase 2: Student Interviews

The chief purpose of Phase 2 was to see if the topics covered would be at the appropriate level for students with data science exposure, as well as to continue to refine wording and pacing. Having previously discussed “what is data science” with faculty, a similar series of interviews with Duke Statistical Science student teaching assistants (TAs) allowed us to start drilling down and seeing if items landed the way we thought they would.

To reflect this difference in the type of feedback being solicited, we omitted the initial broader scope questions when interviewing TAs. Combined with the fact that that the assessment was now about half its original length, these interviews were only one hour long. The final portion of big picture questions was kept, though, with slightly modified prompts: Are the pacing and length appropriate? Based on what you remember learning in intro data science, what topics are missing from the current assessment? Based on what you remember learning in intro data science, what is in the current assessment, but doesn’t belong?

We first reached out in November 2022 to all TAs at that time for STA 199 or STA 198, its health-themed analog. However, we were unable to recruit any students so close to finals, and ended up reaching out to the same group in January. The interviews were conducted in early February, thus consisting of students who were at least TAs for STA199 in the previous Fall semester. The three TAs recruited were, in order of interview:

Name Degree Year and Level Program
TA A 2nd year Masters Statistical Science
TA B 4th year Undergraduate Economics major, Statistical Science minor
TA C 1st year Masters Statistical Science

In general, while still engaging well and demonstrating a strong command of data science skills, the TAs tended to be much less verbose in their feedback. Across all interviews, a former item on Realty Tree (regression tree) that asked students to trace a non-trivial regression tree four separate times stood out as an interruption to the otherwise smooth flow of the assessment. Another common trend across TA interviews was the suggestion to modify plots’ themes for clarity–this surprised me, as I had attempted to match the cosmetic options of Duke Statistical Science courses wherever I could.

Another issue revealed in TA interviews was the question order. We originally arranged items near others testing a similar topic to facilitate our iterative editing. However, there was a clear case of cognitive priming in TA A’s interview. Having just correctly answered Image Recognition (ethics; representativeness of data), he then immediately jumped to examining ethical implications in the following item, Application Screening (ethics; proxy variable), while the faculty members who had answered the questions in a different order required more consideration here. As well, TA B was the first interviewee to answer incorrectly to the item I think is the single trickiest (#3 in He Said She Said (interpret visualization)), and best written in terms of getting students to think critically about what exactly plots are displaying. This stood out to me, as she was the only undergraduate interviewed; we observed a stratification of data science comfortability even among TAs. Finally, TA C was the quickest interviewee to run through the questions, answering all correctly and citing the logic we were looking for. As the most relative newcomer to the Duke Statistical Science program and having graduated from an undergraduate data science program that used Python, we interpreted his notable ease in getting through all questions in much less than an hour as a good sign that the current length might be ideal for students, who would be less experienced with the material but would also not have to think aloud or give feedback.

Regrouping to synthesize feedback/final pilot assessment

Not many changes were made following the student interviews in Phase 2–a sign that we were converging on a viable prototype for distribution. To remedy the priming issue encountered in TA A’s interview, we shuffled the question order between the TA B and C interviews and landed on a solid layout: distributing topics well throughout and gradually ramping up our intended difficulty to prevent less-knowledgeable students from getting frustrated and giving up early on. As part of this pacing reform, we improved the flow of the assessment by reformatting Movie Budgets 2 (\(R^2\); compare trends visually) and Disease Screening (compare classification diagnostics visually) from groups of several nearly-identical items into single matrices.

Final assessment themes

The following table summarizes the learning objectives for each item in the most current version of the assessment.

Passage Learning Objective(s)
Storm Paths modeling; simulated data; interpret uncertainty
Movie Budgets 1 compare summary statistics visually
Movie Budgets 2 modeling; \(R^2\); compare trends visually
Application Screening ethics; modeling; proxy variable
Banana Conclusions causation; statistical communication
COVID Map interpret complex visualization; spatial data; time series data
interpret complex visualization; sophisticated scales
He Said She Said interpret basic visualization
interpret basic visualization
interpret basic visualization; sophisticated scales
Build-a-Plot data to visualization process
Disease Screening compare classification diagnostics visually
Realty Tree modeling; regression tree
modeling; regression tree; variable selection
Website Testing interpret trends visually; visualize uncertainty; modeling; time series data
interpret trends visually; visualize uncertainty; modeling; time series data; extrapolation
interpret trends visually; visualize uncertainty; modeling; time series data; extrapolation
Image Recognition ethics; modeling; representativeness of training data
Data Confidentiality ethics; data deidentification; statistical communication
Activity Journal structure data; store data
Movie Wrangling data cleaning; column-wise operations; string operations
data cleaning; column-wise operations; string operations
data cleaning; extrapolation
data wrangling; pseudocode; joins
data wrangling; pseudocode; joins
data wrangling; pseudocode; joins

Phase 3: Large-scale Student Pilot

Looking ahead, the next step after individual item tweaking and refinement will be real-world measurements of pacing and length and the feasibility for introductory data science students. Collaborating with the professors of Duke’s Introductory Data Science Course STA199, students across three sections will be invited to take the assessment as part of a typical homework assignment for the course. We will not solicit general feedback like in the previous interviews, instead asking just a single question at the end to get a rough estimate of how long it took to complete, since students will be completing it on their own time. The score students receive will not reflect their performance on the assessment, but rather will be simply awarded proportionally to how much was completed. To ensure students are actively engaging with the questions, there will be two attention checks added that must also be passed to earn credit.

In order to distribute the assessment at large and record students’ responses, the Quarto book has been converted into a Qualtrics survey. Once complete, large-scale analysis will be conducted, both to ensure that all questions are indeed appropriate for introductory level students and to explore potential instrument subscales.