Audit corpus coverage against a ground-truth reference
Source:R/coverage-audit.R
sm_coverage_audit.RdComputes recall, precision, and F1 of an sm_corpus against an external
ground-truth reference (a manual tracker, an ORCID works set, an
institutional-repository export, or another sm_corpus). This is the core
primitive for a coverage / completeness audit: "how much of what we should
have did we actually capture, and how much of what we captured is real?"
Matching is by normalised DOI with a fuzzy-title fallback (see
sm_reconcile() for the shared matching engine). Full match provenance is
retained so every decision can be inspected.
Usage
sm_coverage_audit(
corpus,
reference,
by = NULL,
match = c("doi_then_title", "doi", "title"),
threshold = 0.9,
index_table = NULL,
call = rlang::caller_env()
)
# S3 method for class 'sm_coverage'
print(x, ...)
# S3 method for class 'sm_coverage'
summary(object, ...)
# S3 method for class 'sm_coverage'
autoplot(object, dim = NULL, ...)Arguments
- corpus
An
sm_corpusobject (the corpus under audit).- reference
The ground truth: an
sm_corpusor a data frame with at least a DOI and/or title column (column names matched case-insensitively againstdoi/diandtitle/ti/display_name; anid/work_id/reference_idcolumn is used as the identifier if present).- by
Optional character vector of breakdown dimensions, any of
"year","source","affiliation". Per-slice recall is reported for each supplied dimension.- match
Matching strategy:
"doi_then_title"(default),"doi", or"title".- threshold
Minimum Jaro-Winkler title similarity (
[0, 1]) to accept a title match. Default0.9. When the optionalstringdistpackage is not installed, the title fallback degrades to normalised exact matching.- index_table
Optional journal index master list (same contract as
sm_journal_in_index(); document expected columns there). When supplied, the audit additionally reports, per corpus record, whether its journal is indexable – so "is this journal in the index" and "is this record captured" are assessed together. WhenNULL(default) the result is identical to before.- call
Caller environment for error reporting.
- x
An
sm_coverageobject.- ...
Ignored.
- object
An
sm_coverageobject.- dim
Which breakdown dimension to plot (defaults to the first available). If no breakdowns exist, a recall/precision summary bar is drawn.
Value
An sm_coverage S3 object (a list) with components:
- recall
Matched reference records / total reference records.
- precision
Matched corpus records / total corpus records.
- f1
Harmonic mean of recall and precision.
- n_corpus, n_reference, n_matched
Integer counts.
- n_corpus_only, n_reference_only
Unmatched counts.
- matches
Tibble with one row per corpus record:
corpus_id,reference_id,match_type("doi"/"title"/"none"),match_score. Whenindex_tableis supplied, alsoissn,indexable(logical), andindexed_title.- corpus_only
Tibble of corpus records absent from the reference.
- reference_only
Tibble of reference records absent from the corpus.
- breakdowns
A single flat tibble (
dimension,level,n_reference,n_matched,recall) across allbydimensions. Usesm_coverage_breakdowns()to access/filter it.- breakdowns_nested
The legacy named list of per-dimension tibbles (
slice,n_reference,n_matched,recall). Retained for one release; preferbreakdowns.- indexability
When
index_tableis supplied, a summary tibble of indexable vs non-indexable record counts; otherwiseNULL.
print returns x invisibly.
summary returns a one-row tibble of headline metrics.
autoplot returns a ggplot object.
Examples
corpus <- sm_example_corpus(n_works = 30, seed = 1)
# Pretend the reference is the corpus minus a few works, plus an extra one
ref <- corpus$works[1:25, c("work_id", "doi", "title", "year")]
cov <- sm_coverage_audit(corpus, ref, by = "year")
cov
#>
#> ── <sm_coverage> ───────────────────────────────────────────────────────────────
#> Match strategy: doi_then_title (title threshold 0.9)
#> Recall: 1 (25/25 reference records found)
#> Precision: 0.8333 (25/30 corpus records in reference)
#> F1: 0.9091
#>
#> Corpus-only: 5 Reference-only: 0
#>
#>
#> ── Worst-covered slices by year
#> 2015: recall 1 (1/1)
#> 2016: recall 1 (2/2)
#> 2017: recall 1 (2/2)
#> 2018: recall 1 (2/2)
#> 2020: recall 1 (6/6)