Computes a symmetric set difference between two corpora (or coercible data
frames) using the shared DOI-then-title matching engine. This generalises
sm_diff_corpora(), which compares strictly by internal work_id:
sm_reconcile() instead matches on content (normalised DOI with a fuzzy
title fallback), so it works across corpora from different sources that do
not share identifiers.
Usage
sm_reconcile(
corpus_a,
corpus_b,
match = c("doi_then_title", "doi", "title"),
threshold = 0.9,
call = rlang::caller_env()
)
# S3 method for class 'sm_reconciliation'
print(x, ...)
# S3 method for class 'sm_reconciliation'
summary(object, ...)
# S3 method for class 'sm_reconciliation'
autoplot(object, ...)Arguments
- corpus_a, corpus_b
An
sm_corpusor a data frame with DOI and/or title columns (seesm_coverage_audit()for accepted column aliases).- match
Matching strategy:
"doi_then_title"(default),"doi", or"title".- threshold
Minimum Jaro-Winkler title similarity (
[0, 1]) to accept a title match (default0.9).- call
Caller environment for error reporting.
- x
An
sm_reconciliationobject.- ...
Ignored.
- object
An
sm_reconciliationobject.
Value
An sm_reconciliation S3 object with components:
- in_both
Tibble of matched pairs:
a_id,b_id,title,match_type,match_score.- only_a
Tibble of
corpus_arecords absent fromcorpus_b(id,doi,title,year).- only_b
Tibble of
corpus_brecords absent fromcorpus_a.- matches
Match provenance tibble (
a_id,b_id,match_type,match_score), the same shape assm_coverage_audit()'s.- summary
One-row tibble with counts.
print returns x invisibly.
summary returns the one-row summary tibble.
autoplot returns a ggplot set-size bar chart.
Examples
a <- sm_example_corpus(n_works = 20, seed = 1)
b <- sm_example_corpus(n_works = 20, seed = 1)[5:20]
rec <- sm_reconcile(a, b)
rec
#>
#> ── <sm_reconciliation> ─────────────────────────────────────────────────────────
#> A: 20 records B: 16 records
#> In both: 16 Only A: 4 Only B: 0
#> Jaccard overlap: 0.8
#> Matched via: 16 DOI, 0 title