Skip to contents

Audits the geographic distribution of a corpus, tabulating the representation of countries, regions, or World Bank income tiers. The analysis can be weighted by work count, total citations, first-authorship, or corresponding authorship.

The result includes a Gini coefficient for concentration and a coverage metric (proportion of works with known geography).

Usage

sm_audit_geographic(
  corpus,
  by = c("country", "region", "income_tier"),
  weight = c("count", "citations", "first-author", "corresponding"),
  call = rlang::caller_env()
)

# S3 method for class 'sm_audit_geographic'
print(x, ...)

Arguments

corpus

An sm_corpus object.

by

Character. Grouping variable: "country" (ISO 3166-1 alpha-2 codes from authorships), "region" (from institutions table), or "income_tier" (World Bank income classification from institutions).

weight

Character. How to weight each work: "count" (one per work), "citations" (weighted by cited_by_count), "first-author" (only first-author affiliations), "corresponding" (only corresponding-author affiliations).

call

Caller environment for error reporting.

x

An audit object to print.

...

Ignored.

Value

An sm_audit_geographic S3 object containing:

distribution

Tibble with columns group, count, pct, citations.

gini

Gini coefficient of the distribution.

coverage

Proportion of works with at least one known geography.

by

The grouping variable used.

weight

The weighting method used.

x invisibly (print methods).

Examples

corpus <- sm_example_corpus()
geo <- sm_audit_geographic(corpus)
print(geo)
#> 
#> ── <sm_audit_geographic> ───────────────────────────────────────────────────────
#> Grouping: country
#> Weighting: count
#> Coverage: 100% of authorships have known geography
#> Gini coefficient: 0.143
#> 
#> 
#> ── Distribution (top 10) 
#> DE: 77 (10.2%) [1204 cit.]
#> AU: 50 (6.62%) [727 cit.]
#> FI: 49 (6.49%) [771 cit.]
#> NO: 43 (5.7%) [524 cit.]
#> IT: 41 (5.43%) [571 cit.]
#> GB: 40 (5.3%) [572 cit.]
#> CA: 38 (5.03%) [647 cit.]
#> DK: 38 (5.03%) [576 cit.]
#> NL: 37 (4.9%) [530 cit.]
#> BR: 35 (4.64%) [556 cit.]
#> ... and 10 more
#> 
#> 
#> ── Limitations 
#> • Country codes are derived from author affiliation metadata, which may be
#> incomplete or inaccurate.
#> • Multi-country affiliations may be under- or over-counted depending on the
#> data source.
#> • Region and income tier classifications rely on institutional metadata which
#> may not be populated for all works.
#> • Geographic representation does not capture diaspora researchers or
#> researchers with affiliations in multiple countries.
#> • The Gini coefficient measures concentration but does not account for
#> population size or research funding differences.