Data Format

Overview

The database snapshot, Simple Query Tool, REST API, and Data Feed products all return JSON-formatted data. For simplicity, that data is organized under the same schema in all cases; that schema is informally described on this page.

Regardless of the source, each record returned consists of one DOI Object, containing resource metadata. Each DOI Object in turn contains a list of zero or more OA Location Objects.

New fields may be added at any time. This won't be a problem for existing code in most cases since they will simply go unused, but you shouldn't rely on the number of fields being fixed.

Fields marked (beta) may have their behavior changed without warning. Changes to other fields will be announced on the Unpaywall mailing list.

DOI object

The DOI object is more or less a row in our main database...it's everything we know about a given DOI-assigned resource, including metadata about the resource itself, and information about its OA status. It includes a list of zero or more OA Location Objects, as well as a best_oa_location property that's probably the OA Location you'll want to use.

best_oa_locationObject|null The best OA Location Object we could find for this DOI.

The "best" location is determined using an algorithm that prioritizes publisher-hosted content first (eg Hybrid or Gold), then prioritizes versions closer to the version of record (PublishedVersion over AcceptedVersion), then more authoritative repositories (PubMed Central over CiteSeerX).

Returns null if we couldn't find any OA Locations.

data_standardInteger Indicates the data collection approaches used for this resource.

Possible values

  • 1 First-generation hybrid detection. Uses only data from the Crossref API to determine hybrid status. Does a good job for Elsevier articles and a few other publishers, but most publishers are not checked for hybrid.
  • 2 Second-generation hybrid detection. Uses additional sources, checks all publishers for hybrid. Gets about 10x as much hybrid. data_standard==2 is the version used in the paper we wrote about the dataset.
doiString The DOI of this resource. This is always lowercase.
doi_urlString The DOI in hyperlink form. This field simply contains "https://doi.org/" prepended to the doi field. It expresses the DOI in its correct format according to the Crossref DOI display guidelines.
genreString|null The type of resource. Currently the genre is identical to the Crossref-reported type of a given resource. The "journal-article" type is most common, but there are many others.
is_paratextBoolean Is the item an ancillary part of a journal, like a table of contents? See here for more information on how we determine whether an article is paratext.
is_oaBoolean Is there an OA copy of this resource. Convenience attribute; returns true when best_oa_location is not null.
journal_is_in_doajBoolean Is this resource published in a DOAJ-indexed journal. Useful for defining whether a resource is Gold OA (depending on your definition, see also journal_is_oa).
journal_is_oaBoolean Is this resource published in a completely OA journal. Useful for defining whether a resource is Gold OA. Includes any fully-OA journal, regardless of inclusion in DOAJ. This includes journals by all-OA publishers and journals that would otherwise be all Hybrid or Bronze OA. See here for more information on OA journals.
journal_issnsString|null Any ISSNs assigned to the journal publishing this resource. Separate ISSNs are sometimes assigned to print and electronic versions of the same journal. If there are multiple ISSNs, they are separated by commas. Example: 1232-1203,1532-6203
journal_issn_lString|null A single ISSN for the journal publishing this resource. An ISSN-L can be used as a primary key for a journal when more than one ISSN is assigned to it. Resources' journal_issns are mapped to ISSN-Ls using the issn.org table, with some manual corrections.
journal_nameString|null The name of the journal publishing this resource. The same journal may have multiple name strings (eg, "J. Foo", "Journal of Foo", "JOURNAL OF FOO", etc). These have not been fully normalized within our database, so use with care.
oa_locationsList List of all the OA Location objects associated with this resource. This list is unnecessary for the vast majority of use-cases, since you probably just want the best_oa_location. It's included primarily for research purposes.
oa_locations_embargoed (beta)List List of OA Location objects associated with this resource that are not yet available. This list includes locations that we expect to be available in the future based on information like license metadata and journals' delayed OA policies. They do not affect the resource's oa_status and cannot be the best_oa_location or first_oa_location.
first_oa_locationObject|null The OA Location Object with the earliest oa_date.

Returns null if we couldn't find any OA Locations.

oa_statusString The OA status, or color, of this resource. Classifies OA resources by location and license terms as one of: gold, hybrid, bronze, green or closed. See here for more information on how we assign an oa_status.
has_repository_copyBoolean Whether there is a copy of this resource in a repository. True if this resource has at least one OA Location with host_type = "repository". False otherwise.
published_dateString|null The date this resource was published. As reported by the publishers, who unfortunately have inconsistent definitions of what counts as officially "published." Returned as an ISO8601-formatted timestamp, generally with only year-month-day.
publisherString|null The name of this resource's publisher. Keep in mind that publisher name strings change over time, particularly as publishers are acquired or split up.
titleString|null The title of this resource. It's the title. Pretty straightforward.
updatedString Time when the data for this resource was last updated. Returned as an ISO8601-formatted timestamp. Example: 2017-08-17T23:43:27.753663
yearInteger|null The year this resource was published. Just the year part of the published_date
z_authorsList of Crossref Contributor objects, or null The authors of this resource. These are formatted as a list of Crossref Contributor objects, which are described in the Crossref API docs here.
Contributor objects may also contain sequence elements, which at the time of writing are not included in the Crossref API docs.

OA Location object

The OA Location object describes particular place where we found a given OA article. The same article is often available from multiple locations, and there may be differences in format, version, and license depending on the location; the OA Location object describes these key attributes. An OA Location Object is always a Child of a DOI Object.

evidenceString How we found this OA location.

Used for debugging. Don’t depend on the exact contents of this for anything, because values are subject to change without warning. Example values:

  • oa journal (via journal title in doaj) We found the name of the journal that publishes this article in the DOAJ database.
  • oa repository (via pmcid lookup) We found this article in an index of PubMed Central articles.
host_typeString The type of host that serves this OA location.

There are two possible values:

  • publisher means this location is served by the article’s publisher (in practice, this usually means it is hosted on the same domain the DOI resolves to).
  • repository means this location is served by an Open Access repository. Preprint servers are considered repositories even if the DOI resolves there.
is_bestBoolean Is this location the best_oa_location for its resource. See the DOI object's best_oa_location description for more on how we select which location is "best."
licenseString|null The license under which this copy is published.

We return several types of licenses:

  • Creative Commons licenses are uniformly abbreviated and lowercased. Example: cc-by-nc
  • Publisher-specific licenses are normalized using this format: acs-specific: authorchoice/editors choice usage agreement
  • When we have evidence that an OA license of some kind was used, but it’s not reported directly on the webpage at this location, this field returns implied-oa
  • If we are unable to determine a license, or it's not an OA license, this field is null.
oa_dateString|null When this document first became available at this location.

oa_date is calculated differently for different host types and is not available for all oa_locations. See https://support.unpaywall.org/a/solutions/articles/44002063719 for details.

pmh_idString|null OAI-PMH endpoint where we found this location. This is primarily for internal debugging. It's null for locations that weren't found using OAI-PMH.
updatedString Time when the data for this location was last updated. Returned as an ISO8601-formatted timestamp. Example: 2017-08-17T23:43:27.753663
urlString The url_for_pdf if there is one; otherwise landing page URL.

When we can't find a url_for_pdf (or there isn't one), this field uses the url_for_landing_page, which is a useful fallback for some use cases.

url_for_landing_pageString The URL for a landing page describing this OA copy.

When the host_type is "publisher" the landing page usually includes HTML fulltext.

url_for_pdfString|null The URL with a PDF version of this OA copy.

Pretty much what it says.

versionString The content version accessible at this location.

We use the DRIVER Guidelines v2.0 VERSION standard to define versions of a given article; see those docs for complete definitions of terms. Here's the basic idea, though, for the three version types we support:

  • submittedVersion is not yet peer-reviewed.
  • acceptedVersion is peer-reviewed, but lacks publisher-specific formatting.
  • publishedVersion is the version of record.