Skip to contents

Clusters works using K-means clustering via stats::kmeans(). Optionally reduces embedding dimensions first with UMAP or PCA.

Usage

sm_cluster_kmeans(
  corpus,
  k,
  reducer = c("umap", "pca", "none"),
  n_components = 5L,
  call = rlang::caller_env()
)

Arguments

corpus

An sm_corpus object with embeddings.

k

Integer; the number of clusters. Required.

reducer

Character; dimensionality reduction method to apply before clustering. One of "umap" (default), "pca", or "none".

n_components

Integer; number of dimensions to reduce to. Defaults to 5L.

call

Caller environment for error reporting.

Value

The input corpus with a cluster_id column added to corpus$works.

Details

Requires embeddings in corpus$embeddings. Compute them first with sm_embed_works() or load from cache with sm_embed_load().

K-means is deterministic given a fixed random seed. Consider setting a seed before calling this function for reproducibility.

Examples

# \donttest{
corpus <- sm_example_corpus(with_embeddings = TRUE)
corpus <- sm_cluster_kmeans(corpus, k = 5)
#>  K-means clustering complete.
#>  5 clusters, sizes range from 27 to 49.
table(corpus$works$cluster_id)
#> 
#>  1  2  3  4  5 
#> 40 27 39 45 49 
# }