Parse PubMed XML export files (NCBI DTD format) into an sm_corpus
object. Extracts article metadata from PubmedArticle nodes including
PMID, title, abstract, journal, authors, MeSH terms, DOI, and
publication type.
Usage
sm_read_pubmed_xml(
path,
encoding = "UTF-8",
engine = c("native", "bibliometrix", "auto"),
verbose = TRUE,
call = rlang::caller_env()
)Arguments
- path
Character scalar. Path to a PubMed XML file.
- encoding
Character scalar. File encoding (default
"UTF-8").- engine
Character scalar. One of
"native"(built-in parser using xml2),"bibliometrix"(delegate tobibliometrix::convert2df()), or"auto"(try bibliometrix first, fall back to native).- verbose
Logical. Print progress messages?
- call
Caller environment for error reporting.
Value
An sm_corpus object.
Implementation
The native parser uses xml2::read_xml() to parse the NCBI PubMed XML
DTD. Each PubmedArticle element is processed to extract:
MedlineCitation/PMIDfor the PubMed identifierMedlineCitation/Article/ArticleTitlefor the titleMedlineCitation/Article/Abstract/AbstractTextfor the abstract (multiple sections are concatenated)MedlineCitation/Article/Journal/Titleand/ISSNfor the journalMedlineCitation/Article/Journal/JournalIssue/PubDate/Yearfor yearMedlineCitation/Article/AuthorList/Authorfor authorsMedlineCitation/MeshHeadingList/MeshHeadingfor MeSH termsPubmedData/ArticleIdList/ArticleId[@IdType='doi']for DOIMedlineCitation/Article/PublicationTypeListfor document type
References
Aria, M. & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics, 11(4), 959–975. doi:10.1016/j.joi.2017.08.007