How to Prepare for the Data Incubator

This post originally appeared on the Software Carpentry website.

At The Data Incubator, we receive thousands of applications to join our data science fellowship. Our admissions bar is very high and we are often asked, "What can I do to prepare for the fellowship application process?"

Here are five important skills to develop and some resources on how to help you develop them. While we don't expect our applicants to possess all of these skills, most applicants already have a strong background in many of them.

Scraping: There's a lot of data out there so you'll need to learn how to get access to it. Whether it's JSON, HTML or some homebrew format, you should be able to handle them with ease. Modern scripting languages like Python are ideal for this. In Python, look at packages like urllib2, requests, simplejson, re, and beautiful soup to make handling web requests and data formats easier. More advanced topics include error handling (retrying) and parallelization (multiprocessing).
SQL: Once you have a large amount of structured data, you will want to store and process it. SQL is the original query language and its syntax is so prevalent that there are SQL query interfaces for everything from sqldf for R data frames to Hive for Mapreduce.

Normally, you would have to go through a painful install process to play with SQL. Fortunately, there's a nice online interactive tutorial available where you can submit your queries and learn interactively. Additionally, Mode Analytics has a great tutorial geared towards data scientists, although it is not interactive. When you're ready to use SQL locally, SQLite offers a simple-to-install version of SQL.
Data frames: SQL is great for handling large amounts of data but unfortunately it lacks machine learning and visualization. So the workflow is often to use SQL or mapreduce to get data to a manageable size and then process it using a libraries like R's data frames or Python's pandas. For Pandas, Wes McKinney, who created pandas, has a great video tutorial on youtube. Watch it here and follow along by checking out the github code.
Machine-Learning: A lot of data science can be done with select, join, and groupby (or equivalently, map and reduce) but sometimes you need to do some non-trivial machine-learning. Before you jump into fancier algorithms, try out simpler algorithms like Naive Bayes and regularized linear regression. In Python, these are implemented in scikit learn. In R, they are implemented in the glm and gbm libraries. You should make sure you understand the basics really well before trying out fancier algorithms.
Visualization: Data science is about communicating your findings, and data visualization is an incredibly valuable part of that. Python offers Matlab-like plotting via matplotlib, which is functional, even if its ascetically lacking. R offers ggplot, which is prettier. Of course, if you're really serious about dynamic visualizations, try d3.

These are some of the foundational skills that will be invaluable to your career as a data scientist. While they only cover a subset of what we talk about at The Data Incubator (there's a lot more to cover in stats, machine-learning, and mapreduce), this is a great start.

Tianhui Michael Li 2014-09-17
Community Software Carpentry

Dialogue & Discussion

Comments must follow our Code of Conduct.

this GitHub Repository