Challenges Assessing Data Science
This post originally appeared on the Software Carpentry website.
The Assessment Network was established as a space for those working on assessment within the open source/research computing community to collaborate and share resources. During our quarterly meeting in November, we engaged one another in a conversation revolving around data science education. This meeting was organized and hosted online by Kari Jordan, and six community members attended.
First, we discussed the definitions of data scientist, data analyst, and data engineer; second, we worked in pairs on a set of questions about assessing data science education.
The session was exciting and fruitful, as it combined two topical efforts: on one hand, our organization’s focus on assessment and, on the other hand, our contribution to the global effort in defining, understanding, and shaping the rising field of data science.
Kari Jordan attended a meeting of collaborators from industry, academia, and the non-profit sector to brainstorm the challenges and vision for keeping data science broad. During that meeting, a brainstorming session took place where attendees were asked to come up with core competencies for data science. This was difficult, as each sector identified competencies important for their particular interest. Kari thought it would be a good idea to talk about it with the assessment network.
What is Data Science?
So, what is data science? What are the core competencies? For a positive definition, we turn to the seminal “Data Science Venn Diagram” by Drew Conway, as reproduced by Jake VanderPlas in the preface of his Python Data Science Handbook. Data science lies at the intersection of statistics, computer science, and domain expertise (in industry-friendly terms, or traditional research, in academic terms). Data science is cross-disciplinary by definition. Hardly anyone gets formal training in all three areas. Most working data scientists are self-taught to a certain extent. Basically, it takes a growth mindset to be a data scientist!
For a negative definition (in logician’s terms, i.e., what data science is not), we turn to industry job descriptions. It turns out that Marianne Corvellec served on a panel dedicated to the definition of these emerging occupations. This panel was held in 2016 with Québec’s Sectoral Committee for the ICT Workforce. It brought together industry professionals and HR specialists who would frame the discussion, and resulted in this report (in French; note that “architecte de(s) données” == data engineer and “scientifique de(s) données” == data scientist).
This report is in line with academic sources (e.g., data science curricula at U.S. universities), insofar as a data scientist is not a data engineer. A data engineer takes care of data storage and warehousing; s/he builds, tests, and maintains a data pipeline, which integrates diverse data, transforms, cleans, and structures them. S/he masters big data technologies, such as Apache Hadoop, Apache Spark, and Amazon S3. Data engineers ensure the data are available (and in good shape) for data scientists to work with.
What is a Data Scientist?
More subtly, a data scientist is more than a data analyst. It takes an aptitude for collecting, organizing, and curating data, as well as for thinking analytically. A strong quantitative background is useful but not necessary. Principles and practices from the social sciences or digital humanities are valuable assets; data scientists should be good writers, good storytellers, and good communicators. Perhaps surprisingly, attention to detail is not a key item to include in a data scientist’s skillset; ability to grasp the big picture is much more key, as data scientists will find themselves working at the interface of very different departments or fields (in an industry context, these could be engineering, marketing, or business intelligence).
A data scientist does not master any specific technology to perfection, since s/he dabbles in everything! Unlike the traditional data (or business intelligence) analyst, s/he resorts to several different frameworks and programming languages (as opposed to a given domain-specific platform) in order to leverage data. Plus, the data scientist typically works with datasets coming from multiple sources (as opposed to the traditional data analyst who usually works with a single data source already populated by an ETL solution). Data scientists are flexible with their tools and approaches.
Challenges Assessing Data Science Education
In the second part of the meeting, we split into breakout pairs to discuss the challenges of assessing data science education with respect to Carpentries’ workshops. Brainstorming in parallel lets us cover more ground (breadth), while interacting one-on-one lets us explore different avenues (depth).
One pair focused on the industry perspective, another on the education system, and the third on assessment practices. Kari offered a list of questions to frame the discussion.
Working groups identified challenges for assessing data science education at the object level (i.e., what should this assessment consist of?) and at the meta level (i.e., what favors or hinders the application of assessment?).
At the meta level, the following prompts were discussed (pulled from South Big Data Hub’s Data Divide workshop):
- Vision for Assessing Data Science Education
- Stakeholders for Data Science Education
- What specific skills or resources are most important/lacking to address this challenge?
- How do our challenges fit into the national landscape?
- What is the broader impact of addressing our challenges?
Check out the notes from our working groups to see what we came up with!
Now is your chance to tell us what you think. We opened several issues on the Carpentries assessment repo. We’d love to engage you in a rich discussion around this topic. Comment on an issue, and tweet us your thoughts using the hashtag #carpentriesassessment.