Constructs an undirected co-word network. Nodes represent terms and an edge connects two terms if they co-occur within the same work. Edge weight equals the number of works in which both terms appear.
Usage
sm_network_coword(
corpus,
field = c("concepts", "title", "abstract", "keywords"),
ngram = 1L,
min_freq = 5L,
stopwords = NULL,
call = rlang::caller_env()
)Arguments
- corpus
An sm_corpus object.
- field
Character; the text field to extract terms from. One of
"concepts"(default, usescorpus$concepts),"title","abstract", or"keywords".- ngram
Integer; for free-text fields (
"title","abstract"), the n-gram size to tokenise into. Defaults to1L(unigrams).- min_freq
Integer; minimum document frequency a term must have to be included. Defaults to
5L.- stopwords
Character vector of additional stopwords to remove when tokenising free text. Combined with tidytext::stop_words. Set to
NULL(default) for no additional stopwords.- call
Caller environment for error reporting.
Value
A tidygraph::tbl_graph object (undirected). Nodes carry name
(the term) and freq (document frequency). Edges carry a weight column
(co-occurrence count).
Details
When field = "concepts", terms come from the concept_name column in
corpus$concepts (no tokenisation needed).
When field is "title", "abstract", or "keywords", the text is
tokenised with tidytext::unnest_tokens() using the chosen ngram size,
stopwords are removed, and terms below min_freq are dropped.
Empty input returns an empty undirected tbl_graph.
Examples
corpus <- sm_example_corpus()
g <- sm_network_coword(corpus, field = "concepts", min_freq = 5L)
g
#> # A tbl_graph: 10 nodes and 45 edges
#> #
#> # An undirected simple graph with 1 component
#> #
#> # Node Data: 10 × 2 (active)
#> name freq
#> <chr> <int>
#> 1 biomarker discovery 74
#> 2 clinical outcomes 72
#> 3 colorectal cancer 82
#> 4 drug resistance 76
#> 5 gene expression 57
#> 6 immune checkpoint 79
#> 7 machine learning 66
#> 8 single-cell rna-seq 74
#> 9 spatial transcriptomics 67
#> 10 tumor microenvironment 58
#> #
#> # Edge Data: 45 × 3
#> from to weight
#> <int> <int> <int>
#> 1 1 2 21
#> 2 1 3 26
#> 3 1 4 30
#> # ℹ 42 more rows