Open science data

Open science data is a type of open data focused on publishing observations and results of scientific activities available for anyone to analyze and reuse. A major purpose of the drive for open data is to allow the verification of scientific claims, by allowing others to look at the reproducibility of results,[1] and to allow data from many sources to be integrated to give new knowledge.[2] While the idea of open science data has been actively promoted since the 1950s, the rise of the Internet has significantly lowered the cost and time required to publish or obtain data.


The concept of open access to scientific data was institutionally established with the formation of the World Data Center system (now the World Data System), in preparation for the International Geophysical Year of 1957–1958.[3] The International Council of Scientific Unions (now the International Council for Science) established several World Data Centers to minimize the risk of data loss and to maximize data accessibility, further recommending in 1955 that data be made available in machine-readable form.[4]

The first initiative to create a database of electronic bibliography of open access data was the Educational Resources Information Center (ERIC) in 1966. In the same year, MEDLINE was created – a free access online database managed by the National Library of Medicine and the National Institute of Health (USA) with bibliographical citations from journals in the biomedical area, which later would be called PubMed, currently with over 14 million complete articles.[5]

In 1995 GCDIS (US) put its position clearly in On the Full and Open Exchange of Scientific Data (A publication of the Committee on Geophysical and Environmental Data - National Research Council):

"The Earth's atmosphere, oceans, and biosphere form an integrated system that transcends national boundaries. To understand the elements of the system, the way they interact, and how they have changed with time, it is necessary to collect and analyze environmental data from all parts of the world. Studies of the global environment require international collaboration for many reasons:

  • to address global issues, it is essential to have global data sets and products derived from these data sets;
  • it is more efficient and cost-effective for each nation to share its data and information than to collect everything it needs independently; and
  • the implementation of effective policies addressing issues of the global environment requires the involvement from the outset of nearly all nations of the world.

International programs for global change research and environmental monitoring crucially depend on the principle of full and open data exchange (i.e., data and information are made available without restriction, on a non-discriminatory basis, for no more than the cost of reproduction and distribution)." [6]

The last phrase highlights the traditional cost of disseminating information by print and post. It is the removal of this cost through the Internet which has made data vastly easier to disseminate technically. It is correspondingly cheaper to create, sell and control many data resources and this has led to the current concerns over non-open data.

More recent uses of the term include:

  • SAFARI 2000 (South Africa, 2001) used a license informed by ICSU and NASA policies[7]
  • The human genome[8] (Kent, 2002)
  • An Open Data Consortium on geospatial data[9] (2003)
  • Manifesto for Open Chemistry[10] (Murray-Rust and Rzepa, 2004) (2004)
  • Presentations to JISC and OAI under the title "open data"[11] (Murray-Rust, 2005)
  • Science Commons launch[12] (2004)
  • First Open Knowledge Forums (London, UK) run by the Open Knowledge Foundation (London UK) on open data in relation to civic information and geodata[13] (February and April 2005)
  • The Blue Obelisk group in chemistry (mantra: Open Data, Open Source, Open Standards) (2005) doi:10.1021/ci050400b
  • The Petition for Open Data in Crystallography is launched by the Crystallography Open Database Advisory Board.[14](2005)
  • XML Conference & Exposition 2005[15] (Connolly 2005)
  • SPARC Open Data mailing list[16] (2005)
  • First draft of the Open Knowledge Definition explicitly references "Open Data"[17] (2005)
  • XTech[18] (Dumbill, 2005),[19] (Bray and O'Reilly 2006)

In 2004, the Science Ministers of all nations of the OECD (Organisation for Economic Co-operation and Development), which includes most developed countries of the world, signed a declaration which essentially states that all publicly funded archive data should be made publicly available.[20] Following a request and an intense discussion with data-producing institutions in member states, the OECD published in 2007 the OECD Principles and Guidelines for Access to Research Data from Public Funding as a soft-law recommendation.[21]

In 2005 Edd Dumbill introduced an “Open Data” theme in XTech, including:

In 2006 Science Commons[22] ran a 2-day conference in Washington where the primary topic could be described as Open Data. It was reported that the amount of micro-protection of data (e.g. by license) in areas such as biotechnology was creating a Tragedy of the anticommons. In this the costs of obtaining licenses from a large number of owners made it uneconomic to do research in the area.

In 2007 SPARC and Science Commons announced a consolidation and enhancement of their author addenda.[23]

In 2007 the OECD (Organisation for Economic Co-operation and Development) published the Principles and Guidelines for Access to Research Data from Public Funding.[24] The Principles state that:

Access to research data increases the returns from public investment in this area; reinforces open scientific inquiry; encourages diversity of studies and opinion; promotes new areas of work and enables the exploration of topics not envisioned by the initial investigators.

In 2010 the Panton Principles launched,[25] advocating Open Data in science and setting out for principles to which providers must comply to have their data Open.

In 2011 was launched to realize the approach of the Linked Open Science[26] to openly share and interconnect scientific assets like datasets, methods, tools and vocabularies.

In 2012, the Royal Society published a major report, "Science as an Open Enterprise"[27], advocating open scientific data and considering its benefits and requirements.

In 2013 the G8 Science Ministers released a Statement[28] supporting a set of principles for open scientific research data

In 2015 the World Data System of the International Council for Science adopted a new set of Data Sharing Principles[29][30] to embody the spirit of 'open science'. These Principles are in line with data policies of national and international initiatives and they express core ethical commitments operationalized in the WDS Certification of trusted data repositories and service.

Relation to open access

Much data is made available through scholarly publication, which now attracts intense debate under "Open Access" and semantically open formats – like to offer the scientific articles in JATS format. The Budapest Open Access Initiative (2001) coined this term:

By "open access" to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.

The logic of the declaration permits re-use of the data although the term "literature" has connotations of human-readable text and can imply a scholarly publication process. In Open Access discourse the term "full-text" is often used which does not emphasize the data contained within or accompanying the publication.

Some Open Access publishers do not require the authors to assign copyright and the data associated with these publications can normally be regarded as Open Data. Some publishers have Open Access strategies where the publisher requires assignment of the copyright and where it is unclear that the data in publications can be truly regarded as Open Data.

The ALPSP and STM publishers have issued a statement about the desirability of making data freely available:[31]

Publishers recognise that in many disciplines data itself, in various forms, is now a key output of research. Data searching and mining tools permit increasingly sophisticated use of raw data. Of course, journal articles provide one ‘view’ of the significance and interpretation of that data – and conference presentations and informal exchanges may provide other ‘views’ – but data itself is an increasingly important community resource. Science is best advanced by allowing as many scientists as possible to have access to as much prior data as possible; this avoids costly repetition of work, and allows creative new integration and reworking of existing data.


We believe that, as a general principle, data sets, the raw data outputs of research, and sets or sub-sets of that data which are submitted with a paper to a journal, should wherever possible be made freely accessible to other scholars. We believe that the best practice for scholarly journal publishers is to separate supporting data from the article itself, and not to require any transfer of or ownership in such data or data sets as a condition of publication of the article in question.

Even though this statement was without any effect on the open availability of primary data related to publications in journals of the ALPSP and STM members. Data tables provided by the authors as supplement with a paper are still available to subscribers only.

Relation to peer review

In an effort to address issues with the reproducibility of research results, some scholars are asking that authors agree to share their raw data as part of the scholarly peer review process.[32] As far back as 1962, for example, a number of psychologists have attempted to obtain raw data sets from other researchers, with mixed results, in order to reanalyze them. A recent attempt resulted in only seven data sets out of fifty requests. The notion of obtaining, let alone requiring, open data as a condition of peer review remains controversial.[33]

Open research computation

To make sense of scientific data they must be analysed. In all but the simplest cases, this is done by software. The extensive use of software poses problems for the reproducibility of research. To keep research reproducible, it is necessary to publish not only all data, but also the source code of all software used, and all the parametrization used in running this software. Presently, these requests are rarely ever met. Ways to come closer to reproducible scientific computation are discussed under the catchword "open research computation".

See also


  1. ^ Spiegelhalter, D. Open data and trust in the literature. The Scholarly Kitchen. Retrieved 7 September 2018.
  2. ^ Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.-W.; da Silva Santos, L.B.; Bourne, P.E.; Bouwman, J.; Brookes, A.J.; Clark, T.; Crosas, M.; Dillo, I.; Dumon, O.; Edmunds, Scott; Evelo, C. T.; Finkers, R.; Gonzalez-Beltran, A.; Gray, A.J.G.; Groth, P.; Goble, C.; Grethe, J. S.; Heringa, J.; ’t Hoen, P.A.C; Hooft, R.; Kuhn, T.; Kok, R.; Kok, J.; Lusher, S. J.; Martone, M.E.; Mons, A.; Packer, A.L.; Persson, B.; Rocca-Serra, P.; Roos, M.; van Schaik, R.; Sansone, S.; Schultes, E.; Sengstag, T.; Slater, T.; Strawn, G.; Swertz, M. A.; Thompson, M.; van der Lei, J.; van Mulligen, E.; Velterop, J.; Waagmeester, A.; Wittenburg, P.; Wolstencroft, K.; Zhao, J.; Mons, B. (2016). "The FAIR Guiding Principles for scientific data management and stewardship". Scientific Data. 3: 160018. doi:10.1038/sdata.2016.18. ISSN 2052-4463. PMC 4792175. PMID 26978244.
  3. ^ Committee on Scientific Accomplishments of Earth Observations from Space, National Research Council (2008). Earth Observations from Space: The First 50 Years of Scientific Achievements. The National Academies Press. p. 6. ISBN 978-0-309-11095-2. Retrieved 2010-11-24.
  4. ^ World Data Center System (2009-09-18). "About the World Data Center System". NOAA, National Geophysical Data Center. Retrieved 2010-11-24.
  5. ^ Machado, Jorge. "Open data and open science". In Albagli, Maciel, Abdo. "Open Science, Open Questions", 2015
  6. ^ National Research Council (1995). On the Full and Open Exchange of Scientific Data. Washington, DC: The National Academies Press. doi:10.17226/18769. ISBN 978-0-309-30427-6.
  7. ^ "Safari 2000 Data Policy" (PDF). Archived from the original (PDF) on September 29, 2006. Retrieved May 28, 2011.
  8. ^ Bruce Stewart (2002). "Keeping Genome Data Open;An Interview with Jim Kent".
  9. ^ Open Data Consortium ca. 2003
  10. ^ Peter Murray-Rust, Henry Rzepa 2004
  11. ^ "Open Data" at CERN Workshop on Innovations in Scholarly Communication (OAI4) Peter Murray-Rust, 2005
  12. ^ Report on Science Commons Dec 2004
  13. ^ Open Knowledge Forums
  14. ^
  15. ^ Semantic Web Data Integration with hCalendar and GRDDL; Dan Connolly | From Syntax to Semantics (XML 2005) Atlanta, GA, USA
  16. ^ SPARC Open Data Mailing list
  17. ^ [1]
  18. ^ XTech 2005
  19. ^ Tim Bray and Tim O'Reilly
  20. ^ OECD Declaration on Open Access to publicly funded data Archived 20 April 2010 at the Wayback Machine
  21. ^ OECD Principles and Guidelines for Access to Research Data from Public Funding
  22. ^ Science Commons in Washington 2006
  23. ^ SPARC-OAF forum
  24. ^ "OECD Principles and Guidelines for Access to Research Data from Public Funding". OECD.
  25. ^ Launch of the Panton Principles for Open Data in Science and 'Is It Open Data?' Web Service
  26. ^ Kauppinen, T.; Espindola, G. M. D. (2011). "Linked Open Science-Communicating, Sharing and Evaluating Data, Methods and Results for Executable Papers". Procedia Computer Science. 4: 726–731. doi:10.1016/j.procs.2011.04.076.
  27. ^ "Final report - Science as an open enterprise". Retrieved 2017-09-29.
  28. ^ "G8 Science Ministers Statement". Foreign & Commonwealth Office.
  29. ^ "Global Data Organization Adopts Open Data Sharing Principles". AlphaGalileo. Retrieved 8 January 2016.
  30. ^ Emerson, Claudia; Faustman, Elaine M.; Mokrane, Mustapha; Harrison, Sandy (2015). "World Data System (WDS) Data Sharing Principles". doi:10.5281/zenodo.34354.
  31. ^ A statement by the Association of Learned and Professional Society Publishers (ALPSP) and the International Association of Scientific, Technical and Medical Publishers (STM), Association of Learned and Professional Society Publishers
  32. ^ "The PRO Initiative for Open Science". Peer Reviewers' Openness Initiative. Retrieved 15 September 2018.
  33. ^ Witkowski, Tomasz (2017). "A Scientist Pushes Psychology Journals toward Open Data". Skeptical Inquirer. 41 (4): 6–7.

External links

Aaron Swartz

Aaron Hillel Swartz (November 8, 1986 – January 11, 2013) was an American computer programmer, entrepreneur, writer, political organizer, and Internet hacktivist. He was involved in the development of the web feed format RSS and the Markdown publishing format, the organization Creative Commons, and the website framework, and was a co-founder of the social news site Reddit. He was given the title of co-founder by Y Combinator owner Paul Graham after the formation of Not a Bug, Inc. (a merger of Swartz's project Infogami and a company run by Alexis Ohanian and Steve Huffman).

Swartz's work also focused on civic awareness and activism. He helped launch the Progressive Change Campaign Committee in 2009 to learn more about effective online activism. In 2010, he became a research fellow at Harvard University's Safra Research Lab on Institutional Corruption, directed by Lawrence Lessig. He founded the online group Demand Progress, known for its campaign against the Stop Online Piracy Act.

In 2011, Swartz was arrested by Massachusetts Institute of Technology (MIT) police on state breaking-and-entering charges, after connecting a computer to the MIT network in an unmarked and unlocked closet, and setting it to download academic journal articles systematically from JSTOR using a guest user account issued to him by MIT. Federal prosecutors later charged him with two counts of wire fraud and eleven violations of the Computer Fraud and Abuse Act, carrying a cumulative maximum penalty of $1 million in fines, 35 years in prison, asset forfeiture, restitution, and supervised release.Swartz declined a plea bargain under which he would have served six months in federal prison. Two days after the prosecution rejected a counter-offer by Swartz, he was found dead in his Brooklyn apartment, where he had hanged himself.In 2013, Swartz was inducted posthumously into the Internet Hall of Fame.

Data-driven journalism

Data-driven journalism, often shortened to "ddj", a term in use since 2009, is a journalistic process based on analyzing and filtering large data sets for the purpose of creating or elevating a news story. Many data-driven stories begin with newly available resources such as open source software, open access publishing and open data, while others are products of public records requests or leaked materials. This approach to journalism builds on older practices, most notably on computer-assisted reporting (CAR) a label used mainly in the US for decades. Other labels for partially similar approaches are "precision journalism", based on a book by Philipp Meyer, published in 1972, where he advocated the use of techniques from social sciences in researching stories.

Data-driven journalism has a wider approach. At the core the process builds on the growing availability of open data that is freely available online and analyzed with open source tools. Data-driven journalism strives to reach new levels of service for the public, helping the general public or specific groups or individuals to understand patterns and make decisions based on the findings. As such, data driven journalism might help to put journalists into a role relevant for society in a new way.

Since the introduction of the concept a number of media companies have created "data teams" which develop visualizations for newsrooms. Most notable are teams e.g. at Reuters, Pro Publica, and La Nacion (Argentina). In Europe, The Guardian and Berliner Morgenpost have very productive teams, as well as public broadcasters.

As projects like the MP expense scandal (2009) and the 2013 release of the "offshore leaks" demonstrate, data-driven journalism can assume an investigative role, dealing with "not-so open" aka secret data on occasion.

The annual Data Journalism Awards recognize outstanding reporting in the field of data journalism, and numerous Pulitzer Prizes in recent years have been awarded to data-driven storytelling, including the 2018 Pulitzer Prize in International Reporting and the 2017 Pulitzer Prize in Public Service


The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. This database is produced and maintained by the National Center for Biotechnology Information (NCBI) as part of the International Nucleotide Sequence Database Collaboration (INSDC). The National Center for Biotechnology Information is a part of the National Institutes of Health in the United States.

GenBank and its collaborators receive sequences produced in laboratories throughout the world from more than 100,000 distinct organisms. The database started in 1982 by Walter Goad and Los Alamos National Laboratory. GenBank has become an important database for research in biological fields and has grown in recent years at an exponential rate by doubling roughly every 18 months.Release 194, produced in February 2013, contained over 150 billion nucleotide bases in more than 162 million sequences. GenBank is built by direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centers.

Journal Article Tag Suite

The Journal Article Tag Suite (JATS) is an XML format used to describe scientific literature published online. It is a technical standard developed by the National Information Standards Organization (NISO) and approved by the American National Standards Institute with the code Z39.96-2012.

The NISO project was a continuation of the work done by NLM/NCBI, and popularized by the NLM's PubMed Central as a de facto standard for archiving and interchange of scientific open-access journals and its contents with XML.

With the NISO standardization the NLM initiative has gained a wider reach, and several other repositories, such as SciELO and Redalyc, adopted the XML formatting for scientific articles.

The JATS provides a set of XML elements and attributes for describing the textual and graphical content of journal articles

as well as some non-article material such as letters, editorials, and book and product reviews.

JATS allows for descriptions of the full article content or just the article header metadata;

and allows other kinds of contents, including research and non-research articles, letters, editorials, and book and product reviews.

Mertonian norms

In 1942, Robert K. Merton introduced "four sets of institutional imperatives taken to comprise the ethos of modern science... communism, universalism, disinterestedness, and organized skepticism." The subsequent portion of his book, The Sociology of Science, elaborated on these principles at "the heart of the Mertonian paradigm—the powerful juxtaposition of the normative structure of science with its institutionally distinctive reward system".

Open Commons Consortium

The Open Commons Consortium (aka OCC - formerly the Open Cloud Consortium) is a 501(c)(3) non-profit venture which provides cloud computing and data commons resources to support "scientific, environmental, medical and health care research." OCC manages and operates resources including the Open Science Data Cloud (aka OSDC), which is a multi-petabyte scientific data sharing resource. The consortium is based in Chicago, Illinois, and is managed by the 501(c)3 Center for Computational Science Research.

Open Knowledge International

Open Knowledge International (OKI), known as the Open Knowledge Foundation (OKF) until April 2014 and then Open Knowledge until May 2016, is a global, non-profit network that promotes and shares information at no charge, including both content and data. It was founded by Rufus Pollock on 24 May 2004 in Cambridge, UK.Its slogan is, "Sonnets to statistics, genes to geodata ..."

Open data

Open data is the idea that some data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. The goals of the open-source data movement are similar to those of other "open(-source)" movements such as open-source software, hardware, open content, open education, open educational resources, open government, open knowledge, open access, open science, and the open web. Paradoxically, the growth of the open data movement is paralleled by a rise in intellectual property rights. The philosophy behind open data has been long established (for example in the Mertonian tradition of science), but the term "open data" itself is recent, gaining popularity with the rise of the Internet and World Wide Web and, especially, with the launch of open-data government initiatives such as, and

Open data, can also be linked data; when it is, it is linked open data. One of the most important forms of open data is open government data (OGD), which is a form of open data created by ruling government institutions. Open government data's importance is borne from it being a part of citizens' everyday lives, down to the most routine/mundane tasks that are seemingly far removed from government.

Open science

Open science is the movement to make scientific research (including publications, data, physical samples, and software) and its dissemination accessible to all levels of an inquiring society, amateur or professional. Open science is transparent and accessible knowledge that is shared and developed through collaborative networks. It encompasses practices such as publishing open research, campaigning for open access, encouraging scientists to practice open notebook science, and generally making it easier to publish and communicate scientific knowledge.

Open Science can be seen as a continuation of, rather than a revolution in, practices begun in the 17th century with the advent of the academic journal, when the societal demand for access to scientific knowledge reached a point at which it became necessary for groups of scientists to share resources with each other so that they could collectively do their work. In modern times there is debate about the extent to which scientific information should be shared. The conflict that led to the Open Science movement is between the desire of scientists to have access to shared resources versus the desire of individual entities to profit when other entities partake of their resources. Additionally, the status of open access and resources that are available for its promotion are likely to differ from one field of academic inquiry to another.

Open source

Open source is a term denoting that a product includes permission to use its source code, design documents, or content. It most commonly refers to the open-source model, in which open-source software or other products are released under an open-source license as part of the open-source-software movement. Use of the term originated with software, but has expanded beyond the software sector to cover other open content and forms of open collaboration.

Tracy Teal

Tracy Teal is an American bioinformatician and the Executive Director of Data Carpentry. She is known for her work in open science and biomedical data science education.

By location
Open data projects

This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.