How to approach selecting a license for data release

Suggestions and perspectives on how to select a license for your data publication

This post originally appeared on the Data Carpentry website

A question recently came up on the SWC Discuss mailing list about how to select a license for publicly released data. Answering these questions without knowing the life story of the data in question can only be done in vast generalities, but there are some nearly universal issues that everyone should consider.

No person will be able to tell you definitively which license is the best. Your selection will be determined by the kind of data you are using, community standards, and your personal preference. There are pros and cons to permissive versus restrictive licenses, which you will need to evaluate using your own preferences.

The first step is not to select a license

The first step is to determine if you are actually able to release the data in a public repository. You need to determine if the data you are releasing is subject to copyright, contractual, or legal sensitivities. You acquired your data somehow, and that method or the content may have restrictions in place on the redistribution of it. Just because you can access the data source without paying money or logging into a website doesn’t mean that it is public data and you are freely available to harvest and distribute it.

Some starter questions to ask include:

  • Did you have permission to gather the data and/or are you abiding by any applicable Terms of Service by gathering, using, or publishing that data?
  • Are you including data values where entities, such as publishers or users, hold copyright?
  • Was your access to the original data you’ve processed under a contract that restricts or has stipulations about how derivatives are released?
  • Does your home institution have policies on how data products and other intellectual property content is released and licensed?

These scenarios will impact your ability make the data public and which kind of license you can attach to it, which is why I always suggest working through this process in the hypothetical when you begin a project. Data copyright and IP control are thorny issues that, like other copyright domains, vary by country, institution, and year of creation.

Once you have determined that you can make your data public and need to select a license, here are some steps to help guide you through that process.

First, look at your community

What does your normal community of practice release under? You may also be depositing your data into a repository that has some opinions or other stock licenses to choose from. Let this group guide you, if you have one.

Looking to the larger general community of datasets published with DataCite DOIs (think FigShare and Zenodo), most of the declared rights statements are within the Creative Commons family of licenses. That doesn’t mean that Creative Commons is the best for data, but it is one of the most recognised license systems out there.

Second, select a license that you understand

When you select a license you (and your data users!) should understand the terms for release, access, reuse, and republication that the license grants. Creative Commons and Open Data Commons have a serious advantage here because they each focus on making ‘human readable’ versions to help authors make informed selections, but many other licensing schemes (particularly the software communities) have other canonical and well understood license types.

Selecting the individual license type you’d like to release under means that you need to carefully consider how restrictive or permissive you’d like to be. Generally, options like a public domain release or a simple attribution requirement are recommended for data. Data copyright is a complex issue that differs by country, state, and age of the content, with no brief statement doing it justice. The legal implications of data licenses are also being sorted out and differ from country to country. While it is true that facts and other such datums cannot be copyrighted, the assembly and selection of datasets can be (depending on the country). Given the international audience of this blog, I find it best to leave each reader to investigate the applicable copyright law. Be sure to read about the impact of attribution stacking if you select an attribution requirement license or anything more restrictive.

Do not be afraid of selecting a public domain license, such as CC0. This declaration does not mean that future users of the data are off the hook for citations from a scholarship perspective, just from legal consequences of it. You may still put in a request to be cited and suggested citation information with your data deposit in most data repositories.

Third, declare the license

There isn’t much to declaring the license other than indicating the information somewhere prominently where the data is located and/or even within the data files. Most data repositories have an option to select the license you’d like to release the data under, which is the most minimal way of making a declaration.

Where to get help

The UK’s Digital Curation Centre has a lengthy guide on selecting licenses: you can look to for more information on specific licenses. However, I cannot suggest strongly enough that you make connections to your local data services groups at your university or institute. You may also need to speak to multiple people in order to answer all the questions posed in the previous sections, but having a point of contact for a consultation or other data services can be an important starting point. While this list is geared toward larger universities, and people in these positions may have a variety of job titles, you can look for:

  • a research data service unit, scholarly publishing commons, or copyright librarian within your university’s library system.
  • research support staff within your research unit, institute, department, or college.
  • your institutional review board if you are dealing with human subject data.
  • your office of technology transfer, if you are dealing with data that may be commercialised or the basis of a patent.

Dialogue & Discussion

Comments must follow our Code of Conduct.

Edit this page on Github