Ten Simple Rules for Digital Data Storage

This post originally appeared on the Software Carpentry website.

Edmund Hart, Pauline Barmby, David LeBauer, François Michonneau, Sarah Mount, Timothée Poisot, Kara Woo, Naupaka Zimmerman, and Jeff Hollister have just posted a pre-print on PeerJ titled Ten Simple Rules for Digital Data Storage. The paper is a distributed collaborative effort spawned from a thread on the Software Carpentry instructors mailing list and further carried out on GitHub. There are a lot of good ideas in it, many of which we should fold back into our lessons, and we hope it will spark more collaborations in our community.

Data is the central currency of science, but the nature of scientific data has changed dramatically with the rapid pace of technology. This change has led to the development of a wide variety of data formats, dataset sizes, data complexity, data use cases, and data sharing practices. Improvements in high throughput DNA sequencing, sustained institutional support for large sensor networks, and sky surveys with large-format digital cameras have created massive quantities of data. At the same time, the combination of increasingly diverse research teams and data aggregation in portals (e.g. for biodiversity data, GBIF or iDigBio) necessitates increased coordination among data collectors and institutions. As a consequence, “data” can now mean anything from petabytes of information stored in professionally-maintained databases, through spreadsheets on a single computer, to hand-written tables in lab notebooks on shelves. All remain important, but data curation practices must continue to keep pace with the changes brought about by new forms and practices of data collection and storage.