231
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Increasing interest in data literacy: The quantitative public health data literacy training program

, , , , & ORCID Icon
Accepted author version posted online: 07 May 2024
Accepted author version

Abstract

Due to the COVID-19 pandemic, the presentation of public health data to lay audiences has increased without most people having the knowledge to understand what these statistics mean. Recognizing that minoritized populations are deeply impacted by the pandemic and wanting to improve the racial representation in biostatistics we developed a training program aimed at increasing the data literacy of high school and college students from minoritized groups. The program introduced the basics of public health, data literacy, statistical software, descriptive statistics, and data ethics. The instructors taught eight synchronous sessions consisting of lectures and experiential group exercises. Five of the sessions were also offered asynchronously.

Of the 209 students, 76% were college students; 90% identified as Black, Asian, or Latino/a/x; and the average age was 21 years. In synchronous sessions, 56% of students attended all sessions. All course sessions were rated as good/excellent by most (≥70%) students.

The program recruited, engaged, and retained a large cohort (N > 100) of underrepresented students in biostatistics/data science for a virtual data literacy training. The program demonstrates the feasibility of developing and implementing public health training programs designed to increase racial and gender diversity in the field.

Disclaimer

As a service to authors and researchers we are providing this version of an accepted manuscript (AM). Copyediting, typesetting, and review of the resulting proofs will be undertaken on this manuscript before final publication of the Version of Record (VoR). During production and pre-press, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal relate to these versions also.

1. INTRODUCTION

Recent evidence demonstrates that U.S. residents are lagging their global counterparts in quantitative literacy. The Social Progress Index for 2020 ranked the United States as the highest-ranked country in the world in quality of universities while ranking the United States at 91 in access to quality basic education (Kristof 2020). The United States also ranked 37th in math (out of 79 countries and economies) in a global assessment of 15-year-old students performed by the Programme for International Student Assessment (Organisation for Economic Co-operation and Development [OECD] 2019). Racial-ethnic and gender disparities in terms of who has access to math and STEM classes may be contributing to this low proficiency.

In 2018, the U.S. Department of Education (2018) reported that predominately Black or Latino/a/x schools tend to offer a lower number of advanced math courses (e.g., Algebra II and Calculus) than their predominately White counterpart schools. As a result, White students represented nearly 60% of students enrolled in these courses, whereas Black and Hispanic students represented only 8% and 16%, respectively (U.S. Department of Education, 2018). Dasgupta (2022) proposes that this “lack of math opportunity” continues through college. This may explain why Blacks and Hispanics are underrepresented in quantitative public health fields (Goodman et al., 2022).

The lack of quantitative literacy in the United States became particularly concerning during the COVID-19 pandemic. Unlike in previous pandemics, in this pandemic, an abundance of data was publicly available—generated by multiple outlets (e.g., news, social media) with the goal of informing individual, institutional, and societal decisions. The World Health Organization (2020) cautioned that this was an “overabundance of information—some accurate and some not.” (pg. 2) To address this, the Johns Hopkins COVID-19 Dashboard (Dong et al. 2020) and the WHO COVID-19 Dashboard (World Health Organization 2021) were developed to provide credible sources of information, data visualizations, and data-driven tracking of the COVID-19 pandemic. As a result, the interest in understanding these metrics grew among the general public. People wanted to know more about testing, hospitalizations, deaths, and vaccine distribution, as well as the implications for their communities. For some communities, the implications were considerable. For instance, mounting evidence emerged showing that historically marginalized people were experiencing higher age-adjusted rates of hospitalization and deaths related to COVID-19 (Acosta et al. 2021; Rubin-Miller et al 2020). Many Americans may not have grasped the depth of these inequities or understood the technical jargon commonly associated with these data (e.g., risk factor, confounder, flattening the curve). A lack of understanding is also likely to stymie one’s ability to be empathetic and to advocate for structural changes, and a lack of understanding may limit their ability to make informed decisions. Moreover, the increase of quantitative public health data without context and knowledge may further contribute to hesitancy in trusting data, evidence, and institutions (Misra & Schmidt 2020).

The virtual landscape of work and school that was necessitated by the need to physically distance during the pandemic created an ideal time and the opportunity to teach minoritized people from around the world about quantitative public health data. Those whose communities experienced the most painful impact were eager to gain the skills to understand the data and develop solutions. Previous research (Dorner et al. 2007; Orellana et al. 2003) has shown that Black and Brown students often “translate” information for their parents; therefore, equipping them with quantitative skills could create a multiplier effect. Further, they could gain this knowledge in real time and be a part of the solution for the future. We also wanted to introduce them to careers (e.g., biostatistician, data scientist, and academic researchers) where they could make an impact on public health. In sum, our program supported multiple needs by addressing the skills, translational knowledge, and pipeline gaps that exist in quantitative literacy and in public health. The GrassROOTS Community Foundation (Dr. Janice Johnson Dias) in partnership with the New York University (NYU) School of Global Public Health (Dr. Melody Goodman) co-designed an innovative, virtual, 4-week (8-session) training program on quantitative public health data literacy (QPHDL) in summer 2020. This pilot study was designed to examine the feasibility of the program’s implementation (e.g., recruitment, engagement, retention, and satisfaction). The goals of the course were the following: (1) increase data literacy among underrepresented students and provide them with an introduction to public health and research methodology (2) provide them with an introduction to programming and data visualization using R/Stata, (3) increase participants’ self-reported comfort with numeric data, and (4) spark interest in a career involving the collection and analysis of data.

2. Background and previous studies

2.1 Defining and measuring data literacy, research literacy, and research knowledge

Data literacy is a complex concept that refers to competency in many dimensions of interacting with data. Carlson and colleagues (2011) defined data literacy as “understanding what data mean, including how to read charts appropriately, draw correct conclusions from data, and recognize when data are being used in misleading or inappropriate ways”. (p. 5) Ridsdale et. Al. (2015) defined it more concisely, stating that “data literacy is the ability to collect, manage, evaluate, and apply data, in a critical manner.” (p. 3) The consensus in the academic literature is that being “data literate” should not be equated to being a data specialist but instead should be thought of as the ability of non specialists to make use of data (Frank et al. 2016). We adapted this definition for the QPHDL program since we targeted a population of high school and undergraduate students.

Various data literacy competency frameworks define and categorize data literacy into conceptual, core, or advanced competencies. Bonikowska et al. (2019) discussed various competency frameworks and existing approaches to measuring data literacy, which include self-assessment and objective measures for individual and organizational assessments. Ridsdale et. Al. (2015) categorized competencies as conceptual, core, or advanced. Wolff et al. (2016) designed competencies based on a problem, plan, data, analysis, and conclusion inquiry process. Grillenberger & Romeike (2018) developed their competency framework around data management and data science. Since our training program is focused more on conceptual and core competencies, we present results based on the Ridsdale et al. (2015) framework (online Appendix A). In particular, we focused on the following conceptual competencies: introduction to data, critical thinking, data culture, and data ethics. Among the core competencies that Ridsdale et al. (2015) mentioned, the following were the focus: data interpretation (understanding data), identifying problems using data, data visualization, and data sharing. Some online data literacy tools, provided by private companies, measure individual-level data literacy—for instance, a tool called myDatabilities (Data to the People 2018) developed by an Australia-based company and a 10-question survey by software company Qlik (2018). However, both are specific to their target employee population in their respective business sectors and are not well suited to measure data literacy of students, are not available freely to the general public, and are not yet validated.

Given the lack of a reliable and validated scale to measure data literacy and the relatively short timing between program inception and program implementation, we measured each participant’s subjective numeracy using the Subjective Numeracy Scale (SNS). In this instance, numeracy was defined as the perceived ability to perform various mathematical tasks and the preference to use numerical versus prose information (Fagerlin et al. 2007). The SNS correlates well with mathematical test measures of objective numeracy but shows much lower rates of missing or incomplete data and can be administered in less time and with less burden for participants (Fagerlin et al. 2007). The SNS was used to examine how comfortable participants reported being with data (Program Goal # 3).

In addition to data literacy, the program’s focus was also on public health research methodology and research knowledge. Increasing research literacy and research knowledge can better prepare underrepresented populations to engage in public health research as partners, rather than participants (Fagerlin et al. 2007). Research literacy was defined as the ability to understand and to critically appraise scientific research, including basic knowledge of research methodology, study design, and research terminology (Komaie et al.2017). Research knowledge was defined as the theoretical and practical understanding of the concepts regarding scientific research covered in the course.

2.2 Other Data Literacy Projects

Many existing programs focus on increasing data literacy among high school and college students. For example, “Calling Bullshit” is an open educational course that introduces students to tools and techniques to make sense of data economy by sorting through information (Bergstrom & West 2021). Although this program has a similar goal to our QPHDL program—to increase data literacy by introducing students to real-world case studies—it does not have a programming component where students can perform hands-on activities and work with data. Gould (2021) argues that one way to improve data literacy among primary and secondary students is to teach a course devoted to data science that covers data-scientific thinking with statistics at its core, coupled with computational thinking and basic mathematics. For high school students, Stanford University researchers developed an 8-unit yearlong course called “Exploration in Data Science” (Youcubed n.d.). This project-based course provides students with opportunities to understand the data science process of asking questions, gathering relevant data, analyzing and synthesizing the data, then communicating the findings (LaMar & Boaler 2021). Another example is the yearlong “Introduction to Data Science” course developed at UCLA and provided to school districts for high school students. In this course, students use their mobile device to collect, share, and analyze data about their communities, gaining a greater understanding about their world. Although these are excellent programs, they are at least a semester long or a yearlong residential program. In addition, they require additional trainings for teachers as well as, in some cases, approval by the local school or university board.

Although several data literacy programs for youth exist, they often lack students from socioeconomically disadvantaged groups due to geographic location, lack of funding, and personnel shortages (Deahl 2014). In addition to having a need for funding, data literacy programs “rely on skilled personnel to devote considerable time to development and implementation. These difficulties leave underserved communities ill equipped to create and implement such initiatives” (Deahl 2014, p.105). In addition, above mentioned semester or yearlong programs focus on recruiting undergraduates with some technical background. To our knowledge, before our program, there were no short-term data literacy programs focused on training students from non-technical backgrounds and/or socioeconomically disadvantage groups. To address this need, we piloted a short course designed for underrepresented minorities that can be accessed outside of the traditional school/college environment.

Key elements for successful in-person data literacy programs have been identified for high school and college students. The specific definition of “success” varied within the literature but was broadly defined as learning gains in data/statistical literacy as well as indicators of potential changes in students’ attitudes towards statistics/data science (Carlson & Bracke 2015; Dichev & Dicheva 2017). The elements for program success include facilitating small group activities (Carlson & Bracke 2015; Dichev & Dicheva 2017), integrating relatable data (e.g., community-based survey data) into the curriculum (D’Ignazio 2017; Everson and Garfield 2008; Ridsdale et al. 2015), and conducting ongoing program evaluation (Everson & Garfield 2008). We incorporated these elements into the design of the QPHDL program.

3. MATERIALS AND METHODS

3.1 Program Recruitment

Program flyers (online Appendix B) for the free 4-week program were posted on social media outlets and distributed via email to New York City and New Jersey public schools, community colleges, the City University of New York, and Historically Black Colleges and Universities (HBCUs). The flyers encouraged the participation of students who are interested in learning data analysis software (R and Stata) and in getting hands-on experience in programming using real-world data. The software component was added because it is a tangible skillset that we knew would be attractive to students and would encourage them to apply. We had planned to give certificates to the students at the end of the program, but no compensation or certificate was advertised on the flyers. More details on certificates are in the Program Description (section 3.3) and Results (section 4) sections to come.

The program application included demographic questions (e.g., age, gender, race, and ethnicity). Applicants who were not currently enrolled in school were asked whether they are public health professionals or non–public health community members. They were also asked to rank their skills in public health, quantitative data, R/Studio, and Stata as either beginner, intermediate, or advanced. The application was open for 12 days from June 27 to July 8, 2020.

3.2 Program Admissions/Enrollment

After reviewing applications, the program’s staff assigned the candidates to one of three tracks: students (Track 1), non–public health professionals (Track 2), or public health professionals (Track 3). Since our program was initially designed for students, we selected only applicants from Track 1 for this pilot study, and we present only the results for Track 1. However, given the demand of such a program among the non-student population, indicated by the number of applicants in Track 2 (n = 117) and Track 3 (n = 67), we implemented the program for Track 2 in February 2021 and Track 3 in summer 2022.

Even within the student population (Track 1, n = 511), we received more applications than anticipated. Since we had limited resources and one instructor, we selected our target population for participation in the live session, which was Track 1A: students who are less likely to have access or opportunity to participate in similar quantitative programs, including students from racial-ethnic groups who are underrepresented in data science or biostatistics (n = 206). These students were then sent a one-question survey, informing them about class date/time and confirming their participation in the program. Most students (81%; n = 166) completed the survey with 164 of them agreeing to participate in the program. These 164 students were sent the baseline assessment; 154 of them completed the assessment, and 118 of them attended the first session. Those who completed the baseline assessment but did not attend the first session were contacted to confirm their participation in the program. Two of the students contacted us and stated that they could not attend the first session but that they wanted to continue with the program; therefore, they were enrolled. The final class size for Track 1A was 120 students, for whom the results are presented. The selection process is also displayed in Figure 1.

Many of the students not placed in Track 1A were either current or incoming NYU students, or they attended specialized high schools in New York City where they had access to similar courses. Although we could accommodate only 120 students in the live sessions, we wanted to give other students the opportunity to learn asynchronously. Therefore, we notified the students not assigned to Track 1A that, while we could not admit them to live training, they could still receive session videos. Eighty-nine students (37%) indicated that they were interested in receiving lecture recordings. They completed the baseline assessment and were admitted to Track 1B. We present the results for the pilot with Track 1A and Track 1B.

3.3 Program Description

For Track 1A, the 4-week synchronous sessions (July 17–August 10, 2020) were held twice per week (Monday and Friday) for 2 hours each. Sessions were initially scheduled for 90 minutes, but we extended them to 2 hours based on feedback after the third session (when coding was first introduced). This extension allowed ample time for lecture, breakout group activities, and large group discussions. Table 1 lists session topics and learning objectives. The course material was adapted from a first-semester graduate course called “Introduction to Data Management and Statistical Computing,” which is taught by an instructor in the biostatistics Master of Public Health (MPH) program at NYU. For this course, we adapted a small portion of the course material (the first few class sessions) and focused on introduction to R and Stata as well as data management and data visualization. The learning objectives for the original graduate-level MPH course (from which a subset of material was covered) included describing, summarizing, manipulating, and formatting datasets as well as performing other data management tasks and producing graphic and tabular reports using statistical software. A course with the goal of covering all aspects of data literacy and longer course duration would have covered additional competencies of data literacy including data discovery and collection, metadata creation and use, identification of problems using data, and data-driven decision-making. The QPHDL training session topics and learning objectives align with many of the data literacy competencies laid out by Ridsdale et al (2015), as shown in Table 1. For the QPHDL course, we had one main instructor, a guest lecturer (Session 5), and 10 course assistants (CAs), including four men, six women, five Asian persons, three White persons, one Black person, and one CA who identified as multiracial. Of the 10 CAs, three were biostatistics Ph.D. students, two were biostatistics M.S. students, and the rest graduated with a master’s degree in biostatistics or epidemiology in May 2020.

Prior to the first session, participants received free access to Stata and instructions on how to install the statistical software programs used in the training, including Stata version 16 (StataCorp 2019) and R version 4.0.2 with RStudio 1.3.1073) (RStudio Team 2020). Five of the eight sessions focused on statistical software (Stata and R/RStudio) and basic research methods, whereas the remaining sessions focused on data literacy, data visualization, and data ethics. Our course was quickly adapted from an existing data management/statistical computing course. Therefore, we did not have the allotted time to focus on statistical terminology and interpretation of results which would be included in ideal data literacy course. Moreover, since our intent was also to teach students programming and software skills to boost their résumé, we focused the majority of the class sessions on hands-on practice with data using statistical software. Participants had five homework assignments (descriptive statistics in Stata, descriptive statistics in R, graphing in Stata, graphing in R, and data manipulation in R). We chose Stata because of its point-and-click interface and simple commands, which make it accessible to students with minimal statistical programming experience. Since Stata would require a paid license after the course was complete, we also taught how to execute the same statistical tasks in R. In addition, replicating the same calculations on the same datasets in two different software programs provided students with options and content repetition to enforce key concepts. It also accommodated different learning styles or preferences for students with different skill levels and training needs.

Each coding session started with a 45-minute lecture that included use of statistical software in a walk-through of an example problem. During the five sessions that included coding, the instructor shared her screen to code live and walked students step by step through importing a dataset, performing descriptive statistics, and creating graphs. After the lecture, the program leaders randomized students to breakout groups to code using real-world data. CAs led each breakout room to help students with in-class activities. The breakout room activity usually included importing an instructor-provided, real-world dataset (such as The National Health and Nutrition Examination Survey or The Behavioral Risk Factor Surveillance System) into statistical software and performing descriptive statistics or creating graphs as a group. Students reported to a larger group at the end of each class to answer questions based on the small-group breakout room activities. To increase engagement and participation of students (Kappers & Cutler 2014; Sarvary & Giffors 2017), the instructional team quizzed students during class using PollEV (www.pollev.com) software, which students can access via the PollEV website or by using their cell phone via text. We recorded the first half of the session, which was the lecture component. The recording and the lecture slides were shared with Track 1B participants. Five of the eight session recordings were shared with Track 1B students. Stata programming sessions were excluded since Track 1B students did not have free access to Stata because of limited free licenses. Since we originally did not plan to do an asynchronous track and had limited resources, we did not require Track 1B students to submit the homework, and we did not give them certificates. The idea was to provide them with tools for self-learning, similar to the model of Massive Open Online Courses, rather than rejecting them from the program altogether. Therefore, although no replacement was offered to Track 1B for the breakout room or in-class activities, students were encouraged to watch the videos in a timely manner each week and to reach out to CAs via email. Track 1A students were required to submit four out of the five homework assignments to receive honors in the course. CAs also held weekly office hours to help Track 1A students with homework and to answer questions.

3.4 Special Design Considerations

In order to create an environment to support students from minoritized groups, unique decisions were made for program implementation. The first session began with an introduction video featuring Black, Brown, and LGBTQ + faculty who use quantitative data to solve real-world problems (https://drive.google.com/file/d/1SycIvPfc3xigqkihw-yGG7sxEJxNqnQO/view). The video was created to show students a large number of diverse scholars who use quantitative data to address many different research questions. Drs. Johnson Dias and Goodman used their academic network and contacted colleagues via email with directions on how to submit their videos, resulting in 26 submissions. At the beginning of every session, as we awaited the arrival of students in the Zoom environment, music was played from a curated playlist developed by Drs. Johnson Dias and Goodman. This served as a vibrant and inclusive welcome ritual for each session. Additionally, although we did not intentionally set out to recruit diverse CAs, having diverse CAs helped serve the diverse student population. We created a pathway program designed to roll out a red carpet to those who have often felt unwelcome in public health. They provide three main resources: 1) A window that allows participants to glimpse a career and what people working in the field do; 2) A mirror that allows participants to see their reflection—someone in the field who looks like them; and 3) An open, sliding glass door they can walk through to get some hands-on experience. Having a diverse instructional team provided mirrors for the program participants. It allowed them to see the diversity of the field (and people who look like them) as public health (biostatistics and epidemiology) students.

3.5 Program Evaluation

To examine the feasibility of program evaluation, we conducted a comprehensive evaluation of the QPHDL program primarily focused on participant satisfaction. The New York University Institutional Review Board/University Committee on Activities Involving Human Subjects, Office of Research Compliance designated the QPHDL evaluation as exempt. R version 4.1.2 was used for analysis of quantitative evaluation data.

3.6 Self-Assessment of Participant Knowledge

Participants completed a baseline and final evaluation in which they assessed their numeracy, research literacy, research knowledge, and their understanding of class materials. The same set of questions were used for both baseline and post training evaluation. To assess numeracy, we used the subjective Numeracy Scale, which consists of seven items, four of which are 5-point ranking-based questions and three of which are questions with 6-point ranking response options (Fagerlin et al. 2007; Zikmund-Fisher et al. 2007). Numeracy scores were calculated by summing the scores of seven items, with Item 7 being reverse coded, resulting in a minimum of 7 points and maximum of 38 points.

No valid scale exists to assess research literacy and research knowledge. To assess research literacy, we used an 10-item scale and a 5-item data subscale developed by Dr. Goodman and colleagues. Items were adapted from existing measures from the Community Research Fellows Training (CRFT) research team(Coats et al. 2015; Haga et al. 2013), and the Test of Scientific Literacy Skills (TOSLS) (Gormalle et al. 2012). Among five questions in the subscale, one question (Item 4: interpretation of study findings) was created by the CRFT research team led by Dr. Goodman (Coats et al.2015), one question (Item 1: genetics and disease risk) was modified from Haga et al. (2013), and the remaining three questions (Item 2: diet soda, Item 3: data-driven hypothesis development, and Item 5: interpreting results from a graph) were from the TOSLS (Gormally et al. 2012). One item (Item 3) was directly drawn from TOSLS with no modifications, and the other two were modified slightly based on feedback from the cognitive interviews from the CRFT program evaluation (Coats et al. 2015). Each correct answer was given 1 point, resulting in a maximum score of 5 points for the data subscale and 10 points for the full research literacy scale. The 10-item version was developed for another training course and is focused on research literacy. We selected the five items related to data literacy concepts covered in the training. The subscale captures more relevant evaluation of research literacy taught in our short course (Program Goal #2).

To assess research knowledge, we used a 7-point scale with close-ended items. The seven questions were selected from a 20-item scale developed by the CRFT research team and were revised from 31 open-ended questions, with each item assessing a single topic covered in the CRFT (Coats et al 2015; D’Agostino McGowan et al. 2015). Dr. Goodman selected the seven items based on the QPHDL program’s curriculum. Research knowledge scores were generated by summing the number of correct responses, resulting in a maximum of 7 points. The individual questions for each measure are included in Appendix C (online). Although the baseline assessment was mandatory for students to start the program, the final assessment was not required for students. For Track 1A, 100 (83%) students completed both the baseline and final assessment surveys; for Track 1B, 34 (38%) students did so (Table 4). Although most of the class time was spent on learning how to code in Stata/R, our assessments did not directly measure data literacy/coding skills but focused more on self-perceived numeracy and research literacy/knowledge. This gap between what was being taught in class and what was being measured was due to the short time available between the program’s inception and its implementation. Due to the short time to develop or find validated measures to study students’ coding skills and data literacy skills related to course materials, the construct used does not measure data literacy directly. However, in most sessions we used PollEV to quiz students with questions based on the answers to the practice problems including questions on descriptive statistics and making graphs which required students to code in Stata/R.

3.7 Participant Evaluation

Participants from Track 1A and Track 1B were sent a session evaluation at the end of each session. Participants also completed a program evaluation after the last session. The session ratings were anonymous and optional. The evaluation included six ranking-based questions and four open-ended questions. Closed-ended questions assessed the overall session in addition to whether the session’s learning objectives were met, whether the session content was helpful, whether the concepts provided were grasped, whether the facilitator was organized, and whether the facilitator was knowledgeable. Open-ended questions assessed the three most meaningful things learned during the session, what was enjoyed most about the session, and what was disliked most about the session; the open-ended questions also requested any additional comments or suggestions. For each session, completion rates for the evaluation survey ranged from 58% to 98% for Track 1A and from 38% to 61% for Track 1B (Table 5).

3.8 Program Honors

Given how engaged some of the Track 1A students were in the program, we wanted to acknowledge this commitment to their own learning and thus created several honors categories based on levels of engagement in course activities (attendance, homework submissions, and class participation). Six honor categories were applied: High Honors with Distinction (attained perfect attendance, completed all homework assignments, and participated in all poll activities), High Honors (attained perfect attendance and completed all homework assignments), Honors with Distinction (attained perfect attendance and participated in all poll activities), Honors (attained perfect attendance), High Distinction (completed all homework assignments and participated in all poll activities), and Distinction (completed all homework assignments or participated in all poll activities). The homework assignments were graded for completion only and not correctness, and no individual feedback was provided because of the large class size. Students who did not fit into any of the above categories but attended at least five lessons and completed the baseline assessment were given a certificate of completion.

4. RESULTS

4.1 Application Data Summary

We received 695 applications in the 12-day period the application was open (Table 2). Of the applicants, 511 (74%) identified as female, 299 (43%) as Black/African American, 218 (31%) as Asian, and 87 (13%) as Latino/a/x. The average age was 25 years, and most participants (n = 457, 66%) were from the Northeast. Almost two-thirds rated their skills as beginner in public health (n = 398, 66%) and quantitative data (n = 374, 63%). Most applicants rated their skills as a beginner in R/Rstudio (n = 528, 89%) and Stata (n = 546, 92%) and stated that they were interested in learning both (n = 494, 77%).

4.2 Participant Demographics, Attendance, and Homework Completion

Among the total 209 participants in Track 1A and Track 1B, 50 (24%) were in high school, and 159 (76%) were in college. The average age among all students was 21 years, and the majority of participants were female (n = 156, 75%). The largest racial-ethnic group was African American (n = 85, 41%), followed by Asian (n = 73, 35%) and Latino/a/x (n = 30, 14%). Table 3 summarizes the demographics of Tracks 1A (live sessions) and 1B (recorded sessions). The mean age was 22.5 years in Track 1A and 19.5 years in Track 1B, and 96 (80%) of 120 total students in Track 1A were in college whereas the number was 63 (71%) out of 89 in Track 1B. Table 3 also shows the attendance and homework completion rate for Track 1A students. In Track 1A, all sessions had more than 70% (n = 86 or more out of 120) attendance, and half of the sessions had more than 90% (n = 108 out of 120) of students present (Table 3). Track 1A students were also required to submit homework assignments. However, submission rates were low, ranging from 59 (49%) in one session to 79 (66%) in another. To get the highest honor, students were required to submit at least four homework assignments (in addition to having perfect attendance and participation in all polls). Homework Assignment 3, which was associated with the session taught by a guest lecturer, was optional, to give students an opportunity to earn extra credit if they missed one of the required homework assignments during the course but still wanted to get the highest honors.

4.3 Pre- and Post-Evaluation

When we compared the students’ self-reported numeracy skills from baseline to final evaluation, we observed that Track 1A's median numeracy score increased by 2 (Table 4). The numeracy skills of Track 1B did not change (baseline = 32, final = 32). There was a slight change in the mean self-reported research knowledge score for both Track 1A (baseline = 5.15, final = 5.53) and Track 1B (baseline = 4.85, final = 5.39). On the 5-point data subscale of self-reported research literacy, the median research literacy score increased from 3 to 3.5 for Track 1A but did not change for Track 1B (baseline = 4, final = 4). On the 10-point scale version, the median self-reported research literacy score remained the same for Track 1A (baseline = 8 final =8) and decreased a little for Track 1B from 8.5 to 8.

4.4 Poll evaluation results

For each session, participation in the poll ranged from 68.9% to 92.1% among those who attended that session. The number of questions in each poll ranged from 4 to 10 (Table 5). Students had about 30 seconds to answer each question. The average score for each poll problem set was calculated as a percentage of students who answered the questions correctly among those who answered all the questions. Average score in poll questions ranged from 48.9% to 70.3%.

4.5 Participant Session Evaluation

More than 80% of students in each track agreed or strongly agreed that the exercise objectives were met, that the information received in each session was helpful, and that the facilitators were well organized and appeared knowledgeable about the subject for all lessons except Lesson 3 (when students were first introduced to the software program; Table 6). More than 80% of students in each track rated the overall sessions as good or excellent for all lessons except Lesson 3. Students praised the instructor’s teaching approach, the program’s structure, the use of real-world data, the interaction of the class, and the exposure to new career paths.

Students evaluated the lecturer, course substance, and format. In an evaluation, a student stated,

I really enjoyed Dr. Goodman's teaching style and her ability to explain tough topics in an understanding manner. In addition, Dr. Goodman related what we were being taught to realistic applications in society, which further helped me understand. The breakout groups were my favorite part, as I was able to meet other students in the class, making it closer to an in-classroom experience. We were able to work on the problems together, or even struggle together.

Another student shared a similar sentiment about being pushed out of their comfort zone:

I liked being introduced into a world that I normally would not be so comfortable in. It was challenging because I do prefer words over numbers (although this is the combination of the two). I loved that we were able to get additional help and how the course assistants were so patient with us. I really do wish it could've ran longer.

The students also highlighted the importance of data literacy. As one student said,

Being data literate is a very important tool that positions us to better serve our underserved community of Blacks and people of color. This class forced me to think about the illusion of the data and how it is something used as a weapon to separate and chastise sections of the population. But it has also shown me that through clean data and asking the relevant questions, a truer picture can be presented of the facts, and this information can hopefully lead to important change.

Some students recognized the value of statistics and near career paths. A student mentioned,

I really enjoyed this program. The exercises were a great intro to statistics programming. I look forward to pursuing other opportunities related to biostatistics.

In contrast, the constructive feedback we received focused on the lack of extra time to code and the varying levels of engagement from other students in the breakout rooms. For example, one student stated,

Although we got introduced to Stata and learned commands, the session was too short to actually go over everything we needed.

As a result, we extended subsequent class times by 30 minutes.

4.6 Program Honors

Of the 120 students in Track 1A, 21% (n = 25) earned High Honors with Distinction, and 15% (n = 18) earned High Honors. Table 7 lists the percentages for the remaining categories. A little over a third (36%; n = 43) of the students did not fit in any of the categories but attended at least 5 lessons and completed the baseline assessment; these students were given a certificate of completion.

5. DISCUSSION

Data and statistics impact future life, yet only 43% of young people (16-21 year olds across UK, US and Germany) consider themselves data literate, and 54% are unaware of the concept (Exasol 2021). Therefore, it is critical for young adults to acquire data literacy skills in order to understand the results of a pandemic and to seek employment in a data-driven workforce. As data become more prevalent in everyday life, the demand for data literacy and data literacy programs is likely to expand. Our goal with the QPHDL was to help bridge the skills gap in data literacy, which is needed to navigate today’s data-driven world. We created this program to increase diversity in public health/quantitative fields and reach students who would not otherwise have access to free quantitative summer programming. Thus, the majority of applicants being Black/African American or Latino/a/x reflects our successful efforts in reaching the target population of underrepresented racial/ethnic minorities. This was due to the large social and academic networks of the program creators, Dr. Johnson Dias and Dr. Goodman. Dr. Johnson Dias is the president of GrassROOTS Community Foundation (GCF). GCF’s programming over the last decade has been focused on Black women and girls. Dr. Johnson Dias teaches at a Hispanic-serving Institution, and their social media reach is over 100k people. Furthermore, GCF’s mission focuses on women and girls who are impoverished or who grew up in poverty. Therefore, although we did not intend to target more female participants, we expected to have a higher number of female applicants given GCF’s network.

We adapted a subset of the course material for this program from a semester-long MPH course on data management and statistical computing. Thus, given the 4-week duration of the course, the implementation of the current program required teaching a small portion of the materials from the MPH course. There is an undergraduate summer program where students explore how statisticians approach large, complex problems and give students basic understanding of computing and visualization tools (Nolan & Temple Lang 2015). However, our program was mainly focused on data literacy rather than statistical literacy. Nevertheless, by introducing students to statistical software and data visualizations, we also hoped to increase interest in obtaining advanced statistical education. Recent evidence suggests that there is little minority representation amongst those pursuing master’s degrees in biostatistics and epidemiology (D’Ignazio 2017; Goodman et al. 2022). Programs to reverse this trend do exist, including the “Fostering Diversity in Biostatistics Workshop,” which has been running for years and has been shown to build and sustain effective networks and mentoring relationships among underrepresented minority students and professionals (Benn et al. 2020). Programs like these have the potential to increase diversity in biostatistics and interest in receiving graduate degrees in the field, leading to a diverse workforce in public health professions. This interest was a theme across QPHDL student feedback forms. One student said, “I think it is amazing to have a program like this to incite others to look into public health, especially into biostatistics.” Another student stated, “I hope that it continues to receive funding so that more people of color can become data efficient and enter the statistics field.”

Teaching a large (N > 100) online class with a programming component to non-technical students does have its challenges. One challenge was statistical software coding. To keep all students on pace with the course, regardless of coding background, each of our CAs held weekly office hours. These office hours also accommodated students who lived in different time zones. Another challenge was building community among a virtual cohort. To foster community, we leveraged Slack, an online communication platform with options to create channels that served as separate forums. The main QPHDL channel allowed for students to interact with other students, CAs, and the lead instructor. Other channels were created for students to ask clarifying homework questions and to share opportunities (i.e., jobs, internships, and fellowships). Among the 120 students in Track 1A, the weekly number of active users on the QPHDL Slack channel ranged from 45 (38%) to 92 (77%).

Aside from the feedback from the initial coding session, more than 80% of the students—from live and recorded sessions—rated the remaining lectures as good or excellent. Live session breakout rooms were also helpful in executing a successful online class. These rooms allowed students to discuss concepts and to practice with data, which were both critical to the program’s success. This type of live interaction with the lecturer, CAs, and classmates benefited Track 1A participants but was not available to Track 1B’s participants, who viewed lecture recordings. Despite not participating in the live sessions, Track 1B students still engaged with the instructors and CAs through emails.

The outcomes of this pilot program should be interpreted with consideration given to several limitations. For the majority of the class time, we focused on quantitative data and less on research methodology or knowledge. Therefore, it is possible that, with the lectures on data visualization, descriptive statistics, and data management, the course was only marginally successful in increasing students’ perceived ability and preference for numeric information. This could be because while numeracy is a component of data literacy, it does not capture the full scope of data literacy. The numeracy measure is about self-efficacy and comfort with numbers. We know many of the students were challenged during the program, and despite their learning, they may not feel confident in their new knowledge. There is the potential that some of what they learned made them realize there is much more for them to learn. Therefore, in a future implementation, data literacy should also be examined using validated scales which more accurately capture concepts taught in class.

We spent only two lectures (25% of the course training) on public health research methodology and research literacy concepts, which may not have been enough time to increase their understandings of either construct. It is possible that data literacy concepts taught in the class were not directly measured by any of these existing constructs accurately. Since the program was launched online during the global COVID-19 pandemic, our students may have had abnormal life stressors that can arise when working from home, financial problems, or sick family members. The potential stress among our students may have impacted our findings. Additionally, our program’s logistics require further refinement to optimize student engagement and to reduce technical difficulties. Distributing program assessments through web surveys increases the risk that questions may be answered with external support. Therefore, our findings may not accurately capture the numeracy, research literacy, and knowledge level of our participants. Lastly, students self-reported their coding skills in R/Stata, so the increase in skill level may not be a true reflection of their actual ability but rather an increase in self-efficacy. However, we do present poll data for several practice problems which required student to code in R/Stata to preform descriptive statistics and make graphs. While competition with timed poll questions made for a fun gamification, it may not have demonstrated true learning due to the limited time (30 seconds) to select a correct response which may have been challenging for some students. Despite these limitations, the QPHDL program successfully engaged and retained a majority of the participants with varying quantitative backgrounds in the data-intensive, 4-week virtual course. Future programs could be designed with more tailored content for data literacy (rather than coding). Such programs should be evaluated for their effectiveness using constructs specifically created for data literacy.

The program concluded with a virtual closing ceremony to honor students’ achievements. During this ceremony, students also shared how the program impacted their life, and they shared their excitement to learn more about programming, public health, and possible career paths in quantitative fields. This feasibility study demonstrated an ability to recruit, engage, and retain a large cohort (N > 100) of students with heterogeneous technical abilities who are underrepresented in biostatistics and data science. We also demonstrated the ability to accomplish these tasks for a virtual data literacy training during the initial phase of a global pandemic. The program also provided students with an introduction to programming and data visualization using R/Stata in a safe environment. The training program has been subsequently adapted for different audiences (e.g., general public, public health workforce).

ACKNOWLEDGMENTS

The authors thank all the students enrolled in the Quantitative Public Health Data Literacy training program for their active participation throughout the program and for contributing data for this evaluation. We also thank the GrassROOTS Community Foundation for their support in program promotion, implementation, and funding. Furthermore, we thank Erin Young and Lillian Jones for creating/editing the introductory video for the course. In addition, we thank Dr. Sharese Terrell Willis for editing assistance with this manuscript. Lastly, we would like to thank Chuck Huber from StataCorp for his guest lecture and all course assistants (including the first four authors). Other course assistants are listed here: Carolyn Winskill, Christopher Yoon, Dennis Hilgendorf, Jessica Randazzo, Zoe Haskell-Craig, and Siyuan Dong. In addition, we are immensely grateful to the 26 scholars who submitted welcome videos. In addition, we would like to thank StataCorp for providing free licenses for the training program.

CONFLICT OF INTEREST

The authors declared no potential financial or non-financial conflicts of interest with respect to the research, authorship, and/or publication of this article.

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are available at https://www.openicpsr.org/openicpsr/project/191001/version/V2/view

Table 1. Table of Session Topics and Learning Objectivesa

Table 2. Data Summary of Applications to the Quantitative Public Health Data Literacy Training Programa

Table 3. Demographic Characteristics, Attendance, and Homework Completion of Participants in Quantitative Public Health Data Literacy Training, Stratified by Tracka

Table 4. Analysis of Baseline and Final Scores of Participants in the Quantitative Public Health Data Literacy Training Program

Table 5. Analysis of In-Class of Poll activity data of Track 1A participants in Quantitative Public Health Data Literacy Training Program

Table 6. Summary of Participants’ Evaluationsa,b

Table 7. Summary of Awards for Graduating Students (Track 1A, the live course)

Figure 1. Flowchart of program admission and track assignment. The number of total applications and students assigned to each track is displayed in the figure. The flowchart displays how the final number of students enrolled in the course was reached for Track 1.

Figure 1. Flowchart of program admission and track assignment. The number of total applications and students assigned to each track is displayed in the figure. The flowchart displays how the final number of students enrolled in the course was reached for Track 1.
Supplemental material

Supplemental Material

Download (770 B)

Supplemental Material

Download MS Word (86.7 KB)

Supplemental Material

Download PNG Image (935.5 KB)

Supplemental Material

Download JPEG Image (335.6 KB)

REFERENCES