Package Development

In addition to developing the introductory data science assessment, my work focused on another angle of data science education: the instructional material itself. Given that the motivation for such an assessment is an increased number of introductory data science learners, studying and developing tutorials for such learners will help contribute to an comprehensive understanding of introductory data science education. Thus, my thesis project also consists of my work on the dsbox R package. According to the package’s website, “The goal of dsbox is to supplement the Data Science Course in a Box project. The package contains the datasets that are used in the materials in Data Science Course in a Box as well as the learnr tutorials.” The larger Data Science in a Box project seeks to contribute to the accessibility of data science education by providing open source resources in the form of a fully-fledged, university-level introductory data science course. The website contains various lecture slides, assignments, weekly course topic outlines, and more, all available for free download. The dsbox package hopes to further facilitate access to this curriculum by condensing the material into 10 freestanding, interactive, automatically-graded learnr tutorials. learnr is a package that combines the design of HTML and CSS and the interactivity of Shiny with R coding to turn RMarkdown documents into self-guided, lesson-style tutorials capable of grading themselves. Most of the scalable autograding features, such as the option to provide a code chunk as a correct solution, are provided by the gradethis package.

Phase 1: Initial tidying work

During my first semester (Spring 2022) working on the project, our goal was to get myself oriented with learnr and the process of package creation on GitHub, and create all content for publication. At this time, the package had nine tutorials already created as well as their associated data sets and documentation.

During this time, my first edits and associated pull requests mostly had to do with the wording of answer choices. Given the introductory nature of the tutorials, I added more descriptive language for explaining statistical concepts (e.g. “multiple peaks” vs “bimodal”) in several places. When giving feedback on a multiple choice question, gradethis allows you to specify a specific feedback message alongside each possible answer choice. In several places, I took advantage of these messages to give more scaffolded feedback on what was incorrect about the chosen answer, versus just “try again.”

The most substantial change I made was in the second tutorial, which explores UK traffic accident data. One question asked learners to filter for a particular level of the binary urban/rural variable and report how many rows remained. The data dictionary did not explain which level (“1” or “2”) corresponded to urban or rural, so I went searching in the data’s source link. The link, which was broken and I ended up replacing, gave me the necessary information. However, the levels were reversed from what was recorded as the correct answer in gradethis, and thus I had to edit the grading logic to reflect this change.

While updating the data dictionary, I became well acquainted with the process for writing data documentation for R package publication. This involved creating a document following a specific LaTeX-style format, filling in the necessary background, variable descriptions, and example code, and letting the roxygen2 package generate the files. The result is the familiar, RStudio “Help” tab-style markdown documentation that exists for every released R function and dataset.

Phase 2: Creating a new tutorial

Once familiar with the package goals, structure, and build process, it was time to complete the scope of the tutorials by adding more. With nine tutorials already, the only homework assignment from the Data Science in a Box curriculum that had yet to be converted dealt with data on locations of Denny’s restaurants and LaQuinta inns. Skeleton .Rd files for these datasets already existed in the dsbox files, and thus I imagined the original plan for dsbox was to have a tenth tutorial working with these data.

While the assignment existed in a typical format on the Data Science in a Box website, I was tasked with thinking critically about how to best translate the learning objectives into a learnr tutorial. Essentially, the key was mirroring the pacing between background information and coding exercises early on, mimicking the missing lecture component in these self-guided online tutorials. Breaking up exercises from single questions on the homework to a series of code steps was essential for flow, as well as to ensure that the final tasks would be possible even if the intermediate steps had not been entirely mastered. I also intentionally left the final set of exercises less scaffolded and somewhat open-ended, with the goal of serving as a more summative assessment of students’ data science skills over the course of all ten tutorials.

Phase 3: CRAN submission

In the final semester of the project, the goal was to complete all necessary checks for publication to the Comprehensive R Archive Network (CRAN). CRAN, among other functions, hosts an archive of all packages published for the R programming language and facilitates an easy installation process, due to its rigorous specifications required when developres upload packages. However, we encountered a major roadblock when trying to release dsbox this spring 2023. One of the packages that dsbox depends on, gradethis, has not yet been released on CRAN itself. This prevents us from releasing a package that would require its installation.

My learning this semester mostly had to do with higher-level software development such as checks and automated GitHub Actions. In order to consider a package ready for publication, it must pass a series of CMD checks, which essentially ensure the source files can be properly installed and built on a variety of machines. The running of these checks is facilitated by GitHub Actions, which allows us to automatically run CMD checks every time we push new commits to GitHub.

The first task turned out to be updating the versions of these Actions checks, as it had been so long since they were run that the version of the checks used, version one, had been deprecated. This took some trial and error, but I eventually figured out how to replace the necessary pieces of each Action check while keeping the necessary dsbox-specific components. Inspired to update other deprecated elements, I replaced all instances of the magrittr pipe (%>%) with the dependency-free base R pipe (|>). With all tutorials using the base pipe, our package then necessitated a requirement of R version 4.1 or above. Other project-wide edits made at this stage were standardizing all tutorials to use American English, updating all broken links, and decreasing the sizes of tutorials’ cover photos to meet the 5MB maximum allowed for a package on CRAN.

While updating the R version requirement, we took the time to ensure the rest of the DESCRIPTION file contained the necessary elements. When R packages are created, the necessary DESCRIPTION file contains information on basics like the package’s title, description, and version number, as well as details on which packages it depends on to be installed alongside it. Given our use of particular features in the gradethis package, we needed not only for it to be installed but also for it to be at least a specific version number. However, here we encountered the major roadblock that gradethis has not yet been released on CRAN itself.

We have submitted a feature request to the package developers to release a CRAN version–according to the website, the current version is passing all necessary CMD checks and has a healthy level of interest and engagement from the community. In the case that this does not occur in the near future, we hope to still publicize dsbox to the data science education community.