Home » Open Science » Open Data

Open Data

This is the second part of Peter Cameron‘s post on Open Publication, Open Data and Open Software (the fourth part of the trilogy, called Open Research Funding, appeared later). If you would like to leave comments for the author, please leave them here.

I recently wrote about open data on my own blog; some of the points I made there will be repeated.

Open data is an important issue. For example, drug companies invest huge sums of money in new drugs, and have traditionally been reluctant to advertise the failure of a drug in clinical trials. It can be worse; sometimes they can cherry-pick the results of the trials and claim that the drug is useful in some situations. Ben Goldacre and others are campaigning for more transparency in the publication of clinical trials.

More generally, scientific journals tend to have a publication bias towards positive results, so that an important study finding no connection between certain factors may be suppressed, leading to repetition and waste of resources. This has led to calls for all experimental data to be made freely available to all researchers, subject to various safeguards.

I began to think about this when I was asked to comment on a draft concordat proposed by the UK research councils on open data. Here is their definition of research data:

Research Data are quantitative information or qualitative statements collected by researchers in the course of their work by experimentation, observation, interview or other methods. Data may be raw or primary (e.g. direct from measurement or collection) or derived from primary data for subsequent analysis or interpretation (e.g. cleaned up or as an extract from a larger data set). The purpose of open research data is to provide the information necessary to support or validate a research project’s observations, findings or outputs. Data may include, for example, statistics, collections of digital images, sound recordings, transcripts of interviews, survey data and fieldwork observations with appropriate annotations.

And here are the principles:

  1. Open access to research data is an enabler of high quality research, a facilitator of innovation and safeguards good research practice.
  2. Good data management is fundamental to all stages of the research process and should be established at the outset.
  3. Data must be curated so that they are accessible, discoverable and useable.
  4. Open access to research data carries a significant cost, which should be respected by all parties.
  5. There are sound reasons why the openness of research data may need to be restricted but any restrictions must be justified and justifiable.
  6. The right of the creators of research data to reasonable first use is recognised.
  7. Use of others’ data should always conform to legal, ethical and regulatory frameworks including appropriate acknowledgement.
  8. Data supporting publications should be accessible by the publication date and should be in a citeable form.
  9. Support for the development of appropriate data skills is recognised as a responsibility for all stakeholders.
  10. Regular reviews of progress towards open access to research data should be undertaken.

There are difficult issues here, related to data management and curation. In what format should data be presented so as to be, not only accessible, but usable by other researchers? Data formatted so as to be read in to a proprietary program obviously will not do. But data formats change, and any data repository has to track these changes.

The document is aimed primarily at experimental or observational scientists, much of whose work involves gathering data, sometimes (as in the case of astronomers or geneticists) in vast quantities. Are there issues here for pure mathematics?

I believe there are, and that pure mathematics already exhibits examples of both good and bad data curation. One of the greatest achievements in pure mathematics is the classification of the finite simple groups. The “ATLAS of Finite Groups” project, begun by John Conway and continued in Web form as the ATLAS of Finite Group Representations by Rob Wilson and his colleagues, presents data about these groups, and is a prime example of good practice. It is a huge repository of information about groups which are close to simple, including generating matrices and permutations in a wide variety of representations. It is carefully checked, and satisfies principle 3 above: computer algebra systems such as Magma and GAP can directly import the appropriate generators from the site, in a way which is almost transparent to the user.

John Cremona also recommended to me the database LMFDB of L-functions and modular forms, though I haven’t used this myself.

However, most data in pure mathematics is the result of a computer search by an individual or a small team, and is probably just put on a web page somewhere, in a format which may be more or less useable to others. I don’t know whether we could (or should) develop common standards, or at least recommendations for good practice.

Continued in Part 3: Open Software