Data Science for Social Good: an Experiment in Data Science Training

This post originally appeared on the Software Carpentry website.

Data science faces many challenges in the traditional academic setting. At the same time, many research fields are becoming increasingly dependent on data science tools and techniques. A key element in tackling these challenges is the education of a new generation of researchers that are fluent in both their research domain and in data science methodologies. In this post, we discuss an immersive approach to training in data science, the University of Washington eScience Institute's inaugural Data Science for Social Good (DSSG) program.

The eScience Institute promotes data science within the university setting, and more broadly across traditional boundaries of disciplinary and institutional research. With support from the Alfred P. Sloan Foundation and the Gordon & Betty Moore Foundation, we have, together with the UC Berkeley Institute for Data Science , and the NYU Center for Data Science, embarked on an experiment to create an environment for data science on the university campus. You can learn more about this effort here.

In our DSSG program, we took elements of our successful Data Science Incubator model and combined them with elements from fellowship programs at the University of Chicago and Georgia Tech focused on DSSG. To give the program a coherent thematic focus, and dovetailing with our involvement in a campus-wide initiative on urban studies, we collaborated with organizations focused on urban issues in the Seattle region that could be tackled using data science. In collaboration with Third Place Technologies, we created a neighborhood well-being report based on open data and social media. We assessed factors that influence the chances of homeless families transitioning into permanent housing in collaboration with the Bill & Melinda Gates Foundation. Two projects, conducted in collaboration with the UW Taskar Center for Accessible Technology addressed access for individuals with low mobility: the efficient deployment of paratransit and the creation of routing maps that take into account accessibility.

Each project had a 4-student team of DSSG Fellows working full-time throughout the summer. Applications came from students with a wide variety of academic backgrounds, and we were able to select an excellent and diverse cohort of undergraduate and graduate students from social and natural sciences (sociology, geography, astronomy), professional fields (mechanical engineering, business, planning, GIS) and mathematical sciences (computer science, statistics, applied math, informatics).

There is much to say about the goals and impact of the DSSG projects themselves, but here we focus on our experiences with the pedagogical component of the program. An important part of the fellows' experience was the opportunity to learn a variety of data science tools in a formal setting, through lectures and tutorials. For example, the second week of the program included the option to join the Software Carpentry workshop that was held at the eScience Institute that week. Participating in this workshop proved helpful in "leveling the playing field", especially for those entering the program with little previous programming experience. Several other tutorials focused on additional tools, ranging from Tableau and D3, to applications of machine learning (a tutorial led by one of the student fellows!), and to more advanced topics such as object-oriented programming. A half-day workshop run by members of the eScience Reproducibility and Open Science working group focused specifically on techniques for reproducible research.

In addition to these formal settings, the program provided daily opportunities for students to learn from each other and from their data science mentors through the experience of conducting research, writing analysis software, and communicating the results to each other and to project stakeholders.

We see several advantages for this type of training:

  1. Learning by doing is particularly important when it comes to computational thinking. Lorena Barba explains this in a recent talk she gave at the UC BIDS.

  2. Motivation is a huge factor in learning. Not many class assignments are as motivating as a real-world problem with tangible social impact.

  3. Buying into tools and practices that increase productivity and reproducibility takes time. Instructors of Software Carpentry know this: the first time you teach students how and why to use tools like git and Github, you often get blank stares. Students often only appreciate the advantages of these tools after a few weeks of use. In this case, the extended period of using these tools on a daily basis, and the specific context of a team project allowed the students time to "buy-in". A common theme in the feedback we solicited at the end of the program was how much students enjoyed using git and Github, but only after it really "clicked" for them.

  4. Collaboration in teams leads naturally to peer-instruction. Intentionally selecting teams of students with diverse backgrounds and varying levels of technical skill can create some challenges, but also allows for tremendous growth as each individual can be both a mentor and a mentee.

  5. Close mentorship enables scaffolding of skills. The notion of scaffolding grew out of Vygotsky's theory of development and learning. It refers to teaching that is based on things that a learner is able to do when working together with a mentor, even though that learner may not be able to do these things without the mentor’s presence. In Vygotsky’s theory, this "Zone of Proximal Development" serves as a crucial step in the development of skills. In our case, a relatively high ratio of mentors-to-students allowed students to build up their skills very quickly, through the process of close work with the mentors.

  6. Collaboration builds "soft skills". The nature of the DSSG teamwork allows for the development of skill that are essential in research collaborations, and are best learned through direct experience. For example, there were occasions for the students to manage workflows, communicate with numerous stakeholders, accommodate a variety of personal work styles, and translate between disciplinary-specific language and knowledge bases.

  7. Real-world application makes critical reflection about the role of data science less abstract. In the DSSG setting, students had the opportunity to critically reflect on the application of data science to real world social issues. They participated in both spontaneous and facilitated group discussions about ethical, political, methodological, and epistemological challenges of collecting and analyzing data about the social world.

We also realize the future impact that this type of educational activity can have on our student fellows, when participants come back to the Data Science Studio to take part in Software Carpentry as helpers and instructors.

If you want to learn more about the project, please read the paper we wrote and presented at the Bloomberg Data for Good Exchange.

Dialogue & Discussion

Comments must follow our Code of Conduct.

Edit this page on Github