Open Archives Initiative Protocol for Metadata Harvesting

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol developed for harvesting metadata descriptions of records in an archive so that services can be built using metadata from many archives. An implementation of OAI-PMH must support representing metadata in Dublin Core, but may also support additional representations.[1]

The protocol is usually just referred to as the OAI Protocol.

OAI-PMH uses XML over HTTP. Version 2.0 of the protocol was released in 2002; the document was last updated in 2015. It has a Creative Commons license BY-SA.


In the late 1990s, Herbert Van de Sompel (Ghent University) was working with researchers and librarians at Los Alamos National Laboratory (US) and called a meeting to address difficulties related to interoperability issues of e-print servers and digital repositories. The meeting was held in Santa Fe, New Mexico, in October 1999. A key development from the meeting was the definition of an interface that permitted e-print servers to expose metadata for the papers it held in a structured fashion so other repositories could identify and copy papers of interest with each other. This interface/protocol was named the "Santa Fe Convention".[1]

Several workshops were held in 2000 at the ACM Digital Libraries conference[2] and elsewhere to share the ideas from the Santa Fe Convention. It was discovered at the workshops that the problems faced by the e-print community were also shared by libraries, museums, journal publishers, and others who needed to share distributed resources. To address these needs, the Coalition for Networked Information[3] and the Digital Library Federation[4] provided funding to establish an Open Archives Initiative (OAI) secretariat managed by Herbert Van de Sompel and Carl Lagoze. The OAI held a meeting at Cornell University (Ithaca, New York) in September 2000 to improve the interface developed at the Santa Fe Convention. The specifications were refined over e-mail.

OAI-PMH version 1.0 was introduced to the public in January 2001 at a workshop in Washington D.C., and another in February in Berlin, Germany. Subsequent modifications to the XML standard by the W3C required making minor modifications to OAI-PMH resulting in version 1.1. The current version, 2.0, was released in June 2002. It contained several technical changes and enhancements and is not backward compatible.


The OAI Protocol was adopted by many digital libraries, institutional repositories, and digital archives. Although registration is not mandatory, it is encouraged.

There are several large registries of OAI-compliant repositories:

  1. The Open Archives list of registered OAI repositories
  2. The OAI registry at University of Illinois at Urbana-Champaign
  3. The Celestial OAI registry
  4. Eprint’s Institutional Archives Registry
  5. The European Guide to OAI-PMH compliant repositories in the world
  6. A worldwide service and registry
  7. the material library of Finnish archives, libraries and museums


Some commercial search engines use OAI-PMH to acquire more resources. Google initially included support for OAI-PMH when launching sitemaps, however decided to support only the standard XML Sitemaps format in May 2008.[5] In 2004, Yahoo! acquired content from OAIster (University of Michigan) that was obtained through metadata harvesting with OAI-PMH. Wikimedia uses an OAI-PMH repository to provide feeds of Wikipedia and related site updates for search engines and other bulk analysis/republishing endeavors.[6] Especially when dealing with thousands of files being harvested every day, OAI-PMH can help in reducing the network traffic and other resource usage by doing incremental harvesting.[7] NASA's Mercury metadata search system uses OAI-PMH to index thousands of metadata records from Global Change Master Directory (GCMD) every day.[8]

The mod_oai project is using OAI-PMH to expose content to web crawlers that is accessible from Apache Web servers.


OAI-PMH is based on a client–server architecture, in which "harvesters" request information on updated records from "repositories". Requests for data can be based on a datestamp range, and can be restricted to named sets defined by the provider. Data providers are required to provide XML metadata in Dublin Core format, and may also provide it in other XML formats.

A number of software systems support the OAI-PMH, including Fedora, EThOS from the British Library, GNU EPrints from the University of Southampton, Open Journal Systems from the Public Knowledge Project, Desire2Learn, DSpace from MIT, HyperJournal from the University of Pisa, Digibib from Digibis, MyCoRe, Primo, DigiTool, Rosetta and MetaLib from Ex Libris, ArchivalWare from PTFS, DOOR [9] from the eLab[10] in Lugano, Switzerland, panFMP from the PANGAEA (data library),[11] SimpleDL from Roaring Development, and jOAI.[12]


A number of large archives support the protocol including arXiv and the CERN Document Server.


A dedicated workshop, The CERN Workshop on Innovations in Scholarly Communication, has been held at CERN in Geneva on a regular basis since 2001. It is now co-organised by University of Geneva and CERN every two years in June. OAI8 was held on June 19th-21st, 2013; OAI9 was held on June 17–19, 2015; and OAI10 was held on June 21 to 23, 2017.

See also


  1. ^ a b Marshall Breeding (September 2002). "Understanding the Protocol for Metadata Harvesting of the Open Archives Initiative". Computers in Libraries. 8 (24): 24–29. Retrieved October 11, 2013.
  2. ^ ACM Digital Libraries conference
  3. ^ Coalition for Networked Information
  4. ^ Digital Library Federation
  5. ^ Google Webmaster blog
  6. ^ "Wikimedia update feed service". Wikimedia Meta-Wiki. Retrieved 14 July 2013.
  7. ^ incremental harvesting
  8. ^ R. Devarakonda; G. Palanisamy; J. Green; B. Wilson (2010). "Data sharing and retrieval uses OAI-PMH". Earth Science Informatics. Springer Berlin / Heidelberg. 4 (1): 1–5. doi:10.1007/s12145-010-0073-0.
  9. ^ DOOR
  10. ^ eLab
  11. ^ panFMP
  12. ^ jOAI


External links


ALTO (Analyzed Layout and Text Object) is an open XML Schema developed by the EU-funded project called METAe.

The standard was initially developed for the description of text OCR and layout information of pages for digitized material. The goal was to describe the layout and text in a form to be able to reconstruct the original appearance based on the digitized information - similar to the approach of a lossless image saving operation.

ALTO is often used in combination with Metadata Encoding and Transmission Standard (METS) for the description of the whole digitized object and creation of references across the ALTO files, e.g. reading sequence description.

The Standard is hosted by the Library of Congress since 2010 and maintained by the Editorial Board initialized at the same time.

In the time from the final version of the ALTO standard in June 2004 (version 1.0) ALTO was maintained by CCS CCS Content Conversion Specialists GmbH, Hamburg up to version 1.4.

BASE (search engine)

BASE (Bielefeld Academic Search Engine) is a multi-disciplinary search engine to scholarly internet resources, created by Bielefeld University Library in Bielefeld, Germany. It is based on free and open-source software such as Apache Solr and VuFind. It harvests OAI metadata from institutional repositories and other academic digital libraries that implement the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), and then normalizes and indexes the data for searching. In addition to OAI metadata, the library indexes selected web sites and local data collections, all of which can be searched via a single search interface.

Users can search bibliographic metadata including abstracts, if available. However, BASE does not currently offer full text search. It contrasts with commercial search engines in multiple ways, including in the types and kinds of resources it searches and the information it offers about the results it finds. Results can be narrowed down using drill down menus (faceted search). Bibliographic data is provided in several formats, and the results may be sorted by multiple fields, such as by author or year of publication.

Paying customers include EBSCO Information Services who integrated BASE into their EBSCO Discovery Service (EDS). Non-commercial services can integrate BASE search for free using an API. BASE becomes an increasingly important component of open access initiatives concerned with enhancing the visibility of their digital archive collections.On 6 October 2016, BASE surpassed the 100 million documents threshold having indexed 100,183,705 documents from 4,695 content sources.

Dove Medical Press

Dove Medical Press is an academic publisher of open access peer-reviewed scientific and medical journals, with offices in Manchester, London (United Kingdom), Princeton, New Jersey (United States), and Auckland (New Zealand).In September 2017, Dove Medical Press was acquired by the Taylor and Francis Group (Informa PLC).As an open access publisher, Dove charges a publication fee to authors or their institutions or funders. This charge allows Dove to recover its editorial and production costs and to create a pool of funds that can be used to provide fee waivers for authors from lesser developed countries. Articles published are available via an interface following the Open Archives Initiative Protocol for Metadata Harvesting, a set of uniform standards promulgated by the Open Archives Initiative allowing metadata on archive holdings.Dove is a member of the Association of Learned and Professional Society Publishers, the Committee on Publication Ethics, and the Open Archives Initiative. As of September 2016, it publishes over 100 journals.


EPrints is a free and open-source software package for building open access repositories that are compliant with the Open Archives Initiative Protocol for Metadata Harvesting. It shares many of the features commonly seen in document management systems, but is primarily used for institutional repositories and scientific journals. EPrints has been developed at the University of Southampton School of Electronics and Computer Science and released under a GPL license.The EPrints software is not to be confused with "Eprints" (or "e-prints"), which are preprints (before peer review) and postprints (after peer review), of research journal articles (eprints = preprints + postprints).

Iberian Books

Iberian Books is a bibliographical research project set up to chart the development of printing in Spain, Portugal and the New World in the early-modern period. It offers a catalogue of what was known to have been printed, along with a survey of surviving copies and links to digital editions. It is funded by the Andrew W. Mellon Foundation. The records created are made available in an open-access database under a Creative Commons license.Established in 2007 and based in the School of History at University College Dublin, as of December 2016 the project has made available data for the period from the beginning of printing in the Iberian Peninsula around 1472 to the middle of the seventeenth century. In late 2017, the project expects to publish the datasets for the second half of the seventeenth century. The datasets currently available online (1472-1650) hold information on 66,000 items, 339,000 copies, and 15,000 digital copies.The project works in partnership with the Digital Library Group at University College Dublin and with the Universal Short Title Catalogue Project based at the University of St Andrews.

Libertas Academica

Libertas Academica is an open access academic journal publisher specializing in the biological sciences and clinical medicine. It was acquired by SAGE Publications in September 2016.

Library portal

A library portal is an interface to access library resources and services through a single access and management point for users, combining the circulation and catalog functions of an integrated library system (ILS) with additional tools and facilities.

List of SIMILE projects

The following is a list of SIMILE projects.

The SIMILE tools assist in the storage, querying, transformation and mapping of very large collections of RDF data. The tools developed within SIMILE are meant to allow people who are not Semantic Web developers to create ontologies which describe their specialized metadata, create RDF and convert other types of metadata into RDF. These open source tools are designed to be scalable and provide for cross-community sharing of metadata at low cost.

Nature Precedings

Nature Precedings was an open access electronic preprint repository of scholarly work in the fields of biomedical sciences, chemistry, and earth sciences. It ceased accepting new submissions as of April 3, 2012.

Nature Precedings functioned as a permanent, citable archive for pre-publication research and preliminary findings. It was a place for researchers to share documents, including presentations, posters, white papers, technical papers, supplementary findings, and non-peer-reviewed manuscripts. It provided a rapid way to share preliminary findings, disseminate emerging results, solicit community feedback, and claim priority over discoveries. The content was curated and developed by the Nature Publishing Group.

OPUS (software)

OPUS is an open source software package under the GNU General Public License used for creating Open Access repositories that are compliant with the Open Archives Initiative Protocol for Metadata Harvesting. It provides tools for creating collections of digital resources, as well as for their storage and dissemination. It is usually used at universities, libraries and research institutes as a platform for institutional repositories.

Open-access repository

An open-access repository or open archive is a digital platform that holds research output and provides free, immediate and permanent access to research results for anyone to use, download and distribute. To facilitate open access such repositories must be interoperable according to the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Search engines harvest the content of open access repositories, constructing a database of worldwide, free of charge available research.As opposed to a simple institutional repository or disciplinary repository, open-access repositories provide free access to research for users outside the institutional community and are one of the recommended ways to achieve the open access vision described in the Budapest Open Access Initiative definition of open access. This is sometimes referred to as the self-archiving or "green" route to open access.

Open Research Online

Open Research Online (ORO) is a repository of research publications run by The Open University (OU).It uses the GNU ePrints software and its repositories use the Open Archives Initiative Protocol for Metadata Harvesting.It is an open access repository and it accepts books, journal articles, patents, conference articles, and theses.As of 21 September 2008, 496 of its publications are from the Mathematics and Computing Department of the OU, while over two thousand are from the Science department.


In academic publishing, a postprint is a digital draft of a research journal article after it has been peer reviewed. A digital draft before peer review is called a preprint. Jointly, postprints and preprints are called eprints.Expressed in the CrossRef terminology, any draft starting from the author's original version but prior to the accepted version is a preprint, whereas any draft from the accepted version onward, including the version of record or definitive work, is a postprint.

Since the advent of the Open Archives Initiative, preprints and postprints have been deposited in institutional repositories, which are interoperable because they are compliant with the Open Archives Initiative Protocol for Metadata Harvesting.

Eprints are at the heart of the open access initiative to make research freely accessible online. Eprints were first deposited or self-archived in arbitrary websites and then harvested by virtual archives such as CiteSeer (and, more recently, Google Scholar), or they were deposited in central disciplinary archives such as Arxiv or PubMed Central.


The Sociedad Iberoamericana de Gráfica Digital, SIGraDi (Iberoamerican Society of Digital Graphics) gathers researchers, educators and professionals in architecture, urban design, communication design, Product Design and Art whose work involves the new digital media.

It is an organization sister to ACADIA, eCAADe, CAADRIA and ASCAAD (see below).

SIGraDi organizes a yearly Congress when the most recent and state of the art digital technologies and applications are presented and debated.

SWORD (protocol)

SWORD (Simple Web-service Offering Repository Deposit) is an interoperability standard that allows digital repositories to accept the deposit of content from multiple sources in different formats (such as XML documents) via a standardized protocol. In the same way that the HTTP protocol allows any web browser to talk to any web server, so SWORD allows clients to talk to repository servers. SWORD is a profile (specialism) of the Atom Publishing Protocol, but restricts itself solely to the scope of depositing resources into scholarly systems.


ScientificCommons is a project of the University of St. Gallen Institute for Media and Communications Management. The major aim of the project is to develop the world’s largest archive of scientific knowledge with fulltexts freely accessible to the public.

ScientificCommons includes a search engine for publications and author profiles. It also allows the user to turn searches into customized RSS feeds of new publications. ScientificCommons also provides a fulltext caching service for researchers.

Since the beginning of 2013, ScientificCommons has been inaccessible. All visitors are forwared to an administration login for server virtualization management software Proxmox VE and the site is no longer issuing a valid TLS certificate.

Search as a service

Search as a service is a branch of software as a service (SaaS), focussed on enterprise search or site-specific web search.


Z39.50 is an international standard client–server, application layer communications protocol for searching and retrieving information from a database over a TCP/IP computer network. It is covered by ANSI/NISO standard Z39.50, and ISO standard 23950. The standard's maintenance agency is the Library of Congress.

Z39.50 is widely used in library environments, often incorporated into integrated library systems and personal bibliographic reference software. Interlibrary catalogue searches for interlibrary loan are often implemented with Z39.50 queries.

Work on the Z39.50 protocol began in the 1970s, and led to successive versions in 1988, 1992, 1995 and 2003. The Contextual Query Language (formerly called the Common Query Language) is based on Z39.50 semantics.

Projects +

This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.