Skip to contents

Parse PubMed XML export files (NCBI DTD format) into an sm_corpus object. Extracts article metadata from PubmedArticle nodes including PMID, title, abstract, journal, authors, MeSH terms, DOI, and publication type.

Usage

sm_read_pubmed_xml(
  path,
  encoding = "UTF-8",
  engine = c("native", "bibliometrix", "auto"),
  verbose = TRUE,
  call = rlang::caller_env()
)

Arguments

path

Character scalar. Path to a PubMed XML file.

encoding

Character scalar. File encoding (default "UTF-8").

engine

Character scalar. One of "native" (built-in parser using xml2), "bibliometrix" (delegate to bibliometrix::convert2df()), or "auto" (try bibliometrix first, fall back to native).

verbose

Logical. Print progress messages?

call

Caller environment for error reporting.

Value

An sm_corpus object.

Implementation

The native parser uses xml2::read_xml() to parse the NCBI PubMed XML DTD. Each PubmedArticle element is processed to extract:

  • MedlineCitation/PMID for the PubMed identifier

  • MedlineCitation/Article/ArticleTitle for the title

  • MedlineCitation/Article/Abstract/AbstractText for the abstract (multiple sections are concatenated)

  • MedlineCitation/Article/Journal/Title and /ISSN for the journal

  • MedlineCitation/Article/Journal/JournalIssue/PubDate/Year for year

  • MedlineCitation/Article/AuthorList/Author for authors

  • MedlineCitation/MeshHeadingList/MeshHeading for MeSH terms

  • PubmedData/ArticleIdList/ArticleId[@IdType='doi'] for DOI

  • MedlineCitation/Article/PublicationTypeList for document type

References

Aria, M. & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics, 11(4), 959–975. doi:10.1016/j.joi.2017.08.007

Examples

if (FALSE) { # \dontrun{
corpus <- sm_read_pubmed_xml("pubmed_result.xml")
corpus$works
} # }