K-means clustering of works — sm_cluster_kmeans • scimapR

Clusters works using K-means clustering via stats::kmeans(). Optionally reduces embedding dimensions first with UMAP or PCA.

Usage

sm_cluster_kmeans(
  corpus,
  k,
  reducer = c("umap", "pca", "none"),
  n_components = 5L,
  call = rlang::caller_env()
)

Arguments

corpus: An sm_corpus object with embeddings.
k: Integer; the number of clusters. Required.
reducer: Character; dimensionality reduction method to apply before clustering. One of "umap" (default), "pca", or "none".
n_components: Integer; number of dimensions to reduce to. Defaults to 5L.
call: Caller environment for error reporting.

Value

The input corpus with a cluster_id column added to corpus$works.

Details

Requires embeddings in corpus$embeddings. Compute them first with sm_embed_works() or load from cache with sm_embed_load().

K-means is deterministic given a fixed random seed. Consider setting a seed before calling this function for reproducibility.

See also

Other clustering: sm_cluster_evolution(), sm_cluster_hdbscan(), sm_cluster_label(), sm_cluster_leiden()

Examples

# \donttest{
corpus <- sm_example_corpus(with_embeddings = TRUE)
corpus <- sm_cluster_kmeans(corpus, k = 5)
#> ✔ K-means clustering complete.
#> ℹ 5 clusters, sizes range from 27 to 49.
table(corpus$works$cluster_id)
#> 
#>  1  2  3  4  5 
#> 40 27 39 45 49 
# }