Assessment Background

Background

An essential component of any educational research is a validated, relevant instrument to measure students’ learning outcomes. Whether to award college credit like the College Board’s AP and CLEP exams, to gauge students’ background knowledge at the start of a course, or just as a tool upon which to evaluate educational interventions, such assessments have been developed and statistically analyzed for a wide range of subjects such as Spanish, Psychology, Chemistry, and Calculus (Godfrey and Jagesic 2016; Solomon et al. 2021; Mulford and Robinson 2002; Epstein 2013).

In the field of statistics, previous work on measuring students’ reasoning skills led to the development of the Comprehensive Assessment of Outcomes in Statistics (CAOS). The revised CAOS 4, comprising 40 multiple-choice items on a variety of commonly-taught first-semester introductory concepts, was first administered in 2005 and allowed instructors to measure whether their courses were successfully resulting in their desired learning outcomes. However, many instructors noted that their findings reflected a much lower understanding than expected, specifically regarding the topics of data visualization and data collection (Delmas et al. 2007). Notably, a key feature of the CAOS was the lack of hard computation nor need to recall specific formulas or definitions, allowing greater accessibility for a variety of statistics-adjacent uses. In fact, while initially motivated for research instrument purposes, early pilots of the CAOS found that instructors used the assessment results “for a variety of purposes, namely, to assign a grade in the course, for review before a course exam, or to assign extra credit” (Delmas et al. 2007).

In 2023, as more high schools and universities continue to support the emerging field of data science via specific courses, concentrations, or even majors, there is a need to measure students’ learning outcomes in these introductory classes analogously to the subjects named above (Swanstrom, n.d.; Schanzer et al. 2022). Lacking a clearly-defined scope, empirical studies of so called “data science” curricula suggest that the field can be thought of as an augmentation of traditional statistical modeling concepts, with emphases on computing, data visualization and manipulation, as well as a consideration of ethics and the role that data plays in society (Zhang and Zhang 2021).

Specifically, a review of five introductory data science courses found that, while choice of language varied, all curricula involved some amount of computing or pseudocode (Çetinkaya-Rundel and Ellison 2021). The next highest frequency topics among curricula were inference and modeling, closely followed by data visualization and data wrangling, with most courses also having some smaller component of communication and ethics. This empirical set of topics corroborates theoretical results from an earlier, larger conference of 25 data science-adjacent faculty, which identified six key competencies for undergraduate data science majors–computational thinking, mathematical foundations, model building and assessment, algorithms and software foundation, data curation, and communication (De Veaux et al. 2017).

Thus motivates the need for a language-agnostic, broad-scope data science assessment that can be tailored further to best meet the needs of specific programs. Given the breadth of diversity captured in the five curricula outlined in Çetinkaya-Rundel and Ellison (2021), collaborating with a group of data science faculty to write such an assessment allows for a wide array of subjects to be covered, while still letting each member develop questions based on material they personally teach. While this still does not guarantee that all possible concepts from an introductory data science course would be covered in such an assessment, the goal then becomes achieving saturation within themes from faculty feedback; there should be no single topic identified across faculty as missing from a comprehensive assessment, even though individual perceptions of coverage may vary (Delmas et al. 2007). In order to let items best measure students’ thinking processes, think-aloud interviews with students are essential, not just to clarify potentially confusing wording, but also to ensure that students respond to each item via the thought process intended by the researchers (Reinhart et al. 2022). Previous work has found that concept inventory-style assessments, while succeeding in measuring students’ overall mastery of a topic, fail to specifically measure students’ inaccurate perceptions. Thus, a sound assessment should result from observing misconceptions during think-aloud interviews and arriving at a series of questions that target broader learning objectives rather than specific misconceptions (Jorion, Gane, James, et al. 2015).

In addition to measuring students’ overall data science mastery, the breadth of topics covered lends naturally to the development of subscales, or subsets of the overall scale’s items from which a student’s subscore can be calculated for a particular topic. Common methods include “subscale alphas, exploratory factor analysis, and confirmatory factor analysis” (Jorion, Gane, DiBello, et al. 2015). While the present focus of the work is the early stages of item creation, iteration, and think-aloud interviewing, keeping track of crucial opportunities to administer a pilot assessment to a class on a large-scale must be carefully monitored, given the limited opportunities to do so per semester (Study et al. 2018).