This vignette explains how common scholarly identifiers are formally defined, what their structural components are, and what it means for them to be valid in a programmatic context.
When working with identifiers in R, it is essential to distinguish between:
The functions in scholid operate at the
structural level. The regexes shown below describe the
structural form that an identifier must match.
Governing body: International DOI Foundation
Standard: ISO 26324
A DOI has two parts:
prefix/suffix
10.Example:
10.1000
10.1038
Example:
10.1000/182
10.1038/s41586-020-2649-2
A commonly accepted structural regex:
^10\.\d{4,9}/\S+$
This checks: - Prefix starts with 10. - 4–9 digits - A
slash - Non-whitespace suffix
Governing body: ORCID, Inc.
Standard basis: ISO 7064 (checksum algorithm)
An ORCID iD consists of 16 characters:
0000-0002-1825-0097
XInternally (without hyphens):
0000000218250097
Uses ISO 7064 Mod 11-2 algorithm.
A structurally correct ORCID may still be invalid if the checksum does
not match.
Hyphenated form:
^\d{4}-\d{4}-\d{4}-\d{3}[\dX]$
Unhyphenated internal form:
^\d{15}[\dX]$
Governing body: International ISBN Agency
Standard: ISO 2108
XExample:
0306406152
030640615X
Example:
9780306406157
ISBN-10:
^\d{9}[\dX]$
ISBN-13:
^\d{13}$
Governing body: ISSN International Centre
Standard: ISO 3297
An ISSN has 8 characters:
1234-567X
Internal numeric form:
1234567X
Hyphenated:
^\d{4}-\d{3}[\dX]$
Compact form:
^\d{7}[\dX]$
Authority: arXiv (Cornell University)
YYMM.NNNN
YYMM.NNNNN
Optional version suffix:
YYMM.NNNN(v2)
Components: - 4-digit year/month - Dot - 4–5 digit submission number
- Optional version vN
Structural regex:
^\d{4}\.\d{4,5}(v\d+)?$
archive/YYMMNNN
Example:
hep-th/9901001
Structural regex:
^[a-z\-]+/\d{7}(v\d+)?$
Authority: U.S. National Library of Medicine
Example:
12345678
Structural regex:
^\d+$
Authority: PubMed Central
PMC1234567
Components: - Literal prefix PMC - One or more
digits
Structural regex:
^PMC\d+$