A tested, extensible institution matcher for the authorships table. It
tags each authorship with the institution it belongs to, using a dictionary
of name variants (multilingual and synonym-aware) and an optional
email-domain fallback for records whose affiliation string is missing.
Because the matcher operates per authorship row, it naturally handles secondary / multiple affiliations per author (each authorship row is matched independently).
Usage
sm_affiliation_match(
corpus,
patterns = sm_affiliation_dict,
fields = NULL,
email_domain_fallback = TRUE,
postcode_signal = FALSE,
call = rlang::caller_env()
)Arguments
- corpus
An
sm_corpusobject.- patterns
A dictionary of institution name variants. Either:
a named list mapping each canonical institution name to a character vector of case-insensitive regex variant patterns, or
a data frame with columns
institution,pattern, and an optionalemail_domain(the form of the bundled sm_affiliation_dict).
Defaults to sm_affiliation_dict.
- fields
Character vector of
authorshipscolumns to search for affiliation text. Defaults to"raw_affiliation"(plus"email"if such a column exists). Values across multiple fields are concatenated per row.- email_domain_fallback
Logical; if
TRUE(default) and an authorship has no pattern match but does have an email address (in anemailcolumn), match the email's domain against the dictionary'semail_domainentries.- postcode_signal
Logical (default
FALSE). WhenTRUEand the dictionary carries apostcodecolumn, a postcode match is attempted as a last resort (lowest priority, after name tokens and email domains) so existing matches do not shift. Off by default.- call
Caller environment for error reporting.
Value
The corpus with its authorships table gaining (or having
updated) four columns:
- institution_match
Canonical institution name, or
NA.- match_method
"pattern","email_domain","postcode", or"none".- match_signal
The signal that fired:
"name_token","email_domain","postcode", or"none". Precedence (highest first): name token, email domain, postcode.- match_evidence
The actual substring / domain / postcode that triggered the match (an audit trail), or
NA.
Type-stable: a corpus with no authorships is returned unchanged with the
four columns present and 0 rows. See sm_affiliation_summary() for a tidy
breakdown.
Details
To extend the dictionary, append rows to sm_affiliation_dict (or build your
own data frame with the same columns) and pass it as patterns. For example
rbind(sm_affiliation_dict, tibble::tibble(institution = "My Uni", pattern = "my university", email_domain = "myuni.edu")).
Examples
corpus <- sm_example_corpus(n_works = 5, n_authors = 5)
corpus$authorships$raw_affiliation[1] <- "Bundeswehrkrankenhaus Berlin"
matched <- sm_affiliation_match(corpus)
#> ✔ Affiliation matching flagged 1 authorship across 1 institution.
#> ℹ By signal: name_token: 1. See `sm_affiliation_summary()` for the full
#> breakdown.
matched$authorships$institution_match[1]
#> [1] "Bundeswehr Hospital"