Clusters works using K-means clustering via stats::kmeans(). Optionally
reduces embedding dimensions first with UMAP or PCA.
Usage
sm_cluster_kmeans(
corpus,
k,
reducer = c("umap", "pca", "none"),
n_components = 5L,
call = rlang::caller_env()
)Arguments
- corpus
An sm_corpus object with embeddings.
- k
Integer; the number of clusters. Required.
- reducer
Character; dimensionality reduction method to apply before clustering. One of
"umap"(default),"pca", or"none".- n_components
Integer; number of dimensions to reduce to. Defaults to
5L.- call
Caller environment for error reporting.
Details
Requires embeddings in corpus$embeddings. Compute them first with
sm_embed_works() or load from cache with sm_embed_load().
K-means is deterministic given a fixed random seed. Consider setting a seed before calling this function for reproducibility.
See also
Other clustering:
sm_cluster_evolution(),
sm_cluster_hdbscan(),
sm_cluster_label(),
sm_cluster_leiden()
Examples
# \donttest{
corpus <- sm_example_corpus(with_embeddings = TRUE)
corpus <- sm_cluster_kmeans(corpus, k = 5)
#> ✔ K-means clustering complete.
#> ℹ 5 clusters, sizes range from 27 to 49.
table(corpus$works$cluster_id)
#>
#> 1 2 3 4 5
#> 40 27 39 45 49
# }