Auto-detect bibliographic file format and read

Detect the bibliographic file format from the file extension and content, then dispatch to the appropriate reader function.

Usage

sm_read_auto(
  path,
  encoding = "UTF-8",
  engine = c("native", "bibliometrix", "auto"),
  verbose = TRUE,
  call = rlang::caller_env()
)

Arguments

path: Character scalar. Path to a bibliographic file.
encoding: Character scalar. File encoding (default "UTF-8").
engine: Character scalar. One of "native" (built-in parser), "bibliometrix" (delegate to bibliometrix::convert2df()), or "auto" (try bibliometrix first, fall back to native). Passed through to the selected reader. Ignored for formats without engine support (OpenAlex JSON, Zotero, EndNote XML).
verbose: Logical. Print progress messages?
call: Caller environment for error reporting.

Value

An sm_corpus object.

Implementation

Format detection proceeds in two stages:

Extension-based: .bib (BibTeX), .ris (RIS), .json/.jsonl (OpenAlex JSON), .xml (PubMed XML or EndNote XML).
Content-based: For .csv, .tsv, and .txt files, the first few lines are inspected for format-specific signatures:
- WoS plaintext: begins with FN or PT tags
- Scopus CSV: contains EID column header
- Lens CSV: contains Lens ID column header
- Dimensions CSV: contains Dimensions ID or PubYear header
- Cochrane CSV: contains Cochrane in header or record-like structure
- Zotero CSV: contains Key and Item Type columns
- RIS-format content in non-.ris files

For XML files, the root element or DTD is inspected to distinguish PubMed XML (PubmedArticleSet or PubmedArticle) from EndNote XML (xml/records or records).

References

Aria, M. & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics, 11(4), 959–975. doi:10.1016/j.joi.2017.08.007

Examples

if (FALSE) { # \dontrun{
corpus <- sm_read_auto("references.bib")
corpus <- sm_read_auto("exported_data.csv")
} # }