K-means clustering on PhiSpace principal components

Performs k-means clustering on principal components derived from PhiSpace scores or any other matrix-like data. This is useful for identifying spatial niches or clusters based on cell state compositions or gene expression patterns.

clusterPhiSpace(
  x,
  k = NULL,
  k_range = NULL,
  select_k_method = c("silhouette", "elbow"),
  ncomp = NULL,
  reducedDimName = "PhiSpace",
  use_assay = NULL,
  nstart = 20,
  iter.max = 500,
  algorithm = c("Lloyd", "Hartigan-Wong", "MacQueen", "Forgy"),
  center = TRUE,
  scale = FALSE,
  seed = NULL,
  return_pca = TRUE,
  store_in_colData = FALSE,
  cluster_name = "PhiClust"
)

Arguments

x: Either a SpatialExperiment/SingleCellExperiment object containing PhiSpace scores in reducedDim, OR a matrix-like object (matrix, sparse matrix, data frame) with features in columns and observations in rows.
k: Integer specifying the number of clusters. Either provide k or k_range but not both.
k_range: Integer vector of length 2 specifying range of k values to test (e.g., c(5, 15)). Will use elbow method or silhouette to select optimal k. Either provide k or k_range but not both.
select_k_method: Character string specifying method to select optimal k when k_range is provided. Options: "elbow" (total within-cluster SS) or "silhouette" (average silhouette width). Default is "silhouette".
ncomp: Integer specifying number of principal components to use for clustering. If NULL (default), uses min(30, nfeatures - 1). If "all", uses nfeatures - 1 components.
reducedDimName: Character string specifying which reducedDim slot contains the PhiSpace scores. Only used when x is a SingleCellExperiment or SpatialExperiment. Default is "PhiSpace".
use_assay: Character string specifying which assay to use if extracting data from a SingleCellExperiment/SpatialExperiment object instead of using reducedDim. If NULL (default), uses reducedDim specified by reducedDimName. Common options: "logcounts", "counts", "normcounts".
nstart: Integer specifying number of random starts for k-means. Default is 20.
iter.max: Integer specifying maximum number of iterations. Default is 500.
algorithm: Character string specifying k-means algorithm. Options are "Hartigan-Wong", "Lloyd", "Forgy", "MacQueen". Default is "Lloyd".
center: Logical indicating whether to center data before PCA. Default is TRUE.
scale: Logical indicating whether to scale data before PCA. Default is FALSE.
seed: Integer seed for reproducibility. Default is NULL (no seed set).
return_pca: Logical indicating whether to return PCA results. Default is TRUE.
store_in_colData: Logical indicating whether to store cluster assignments in the colData of the input object (only applicable when x is an SCE/SPE). Default is FALSE.
cluster_name: Character string specifying the column name for cluster assignments if store_in_colData is TRUE. Default is "PhiClust".

Value

A list with class "PhiSpaceClustering" containing:

clusters: Factor vector of cluster assignments
cluster_centers: Matrix of cluster centers in PC space
kmeans_result: Full kmeans object from stats::kmeans
pca_result: PCA results (if return_pca = TRUE)
pc_scores: Matrix of PC scores used for clustering
optimal_k: Selected k value (relevant when k_range is used)
k_selection: List with selection metrics (if k_range was used)
parameters: List of parameters used
spe: Updated object (if store_in_colData = TRUE and x is SCE/SPE)

Details

This function implements a common workflow for spatial clustering:

Extract data (from reducedDim, assay, or use directly if matrix)
Perform PCA to reduce dimensionality
Select top principal components
Apply k-means clustering

The function can either:

Use a fixed k value (specify k parameter)
Automatically select k from a range (specify k_range parameter)

When k_range is provided, the function tests all k values in the range and selects the optimal k using either:

Silhouette method (default): Maximizes average silhouette width
Elbow method: Identifies elbow point in total within-cluster sum of squares

Input Types

The function accepts multiple input types:

SingleCellExperiment/SpatialExperiment: Uses reducedDim (default) or assay
Matrix: Standard R matrix with observations in rows
Sparse matrix: dgCMatrix or similar sparse formats
Data frame: Coerced to matrix

PCA Parameters

The number of PCs to use affects clustering resolution:

Fewer PCs (e.g., 10-15): Broader, more general clusters
More PCs (e.g., 30-50): Finer, more specific clusters
Default (30): Good balance for most applications

K-means Algorithm

Lloyd (default): Standard algorithm, good balance of speed and quality
Hartigan-Wong: Often better results but slower
MacQueen: Faster but may converge to local optima
Forgy: Similar to Lloyd

Examples

if (FALSE) { # \dontrun{
# Example 1: Using SingleCellExperiment with PhiSpace scores
result <- clusterPhiSpace(
  x = lung_data,
  k = 9,
  ncomp = 30,
  seed = 123
)

# Example 2: Using matrix directly
phi_matrix <- reducedDim(lung_data, "PhiSpace")
result <- clusterPhiSpace(
  x = phi_matrix,
  k = 9,
  ncomp = 30
)

# Example 3: Cluster on normalized counts instead
result <- clusterPhiSpace(
  x = lung_data,
  use_assay = "logcounts",
  k = 9,
  ncomp = 50
)

# Example 4: Using data frame
df <- as.data.frame(reducedDim(lung_data, "PhiSpace"))
result <- clusterPhiSpace(x = df, k = 9)

# Example 5: Automatic k selection
result <- clusterPhiSpace(
  x = phi_matrix,
  k_range = c(5, 15),
  select_k_method = "silhouette"
)
} # }