Performs k-means clustering on principal components derived from PhiSpace scores or any other matrix-like data. This is useful for identifying spatial niches or clusters based on cell state compositions or gene expression patterns.

clusterPhiSpace(
  x,
  k = NULL,
  k_range = NULL,
  select_k_method = c("silhouette", "elbow"),
  ncomp = NULL,
  reducedDimName = "PhiSpace",
  use_assay = NULL,
  nstart = 20,
  iter.max = 500,
  algorithm = c("Lloyd", "Hartigan-Wong", "MacQueen", "Forgy"),
  center = TRUE,
  scale = FALSE,
  seed = NULL,
  return_pca = TRUE,
  store_in_colData = FALSE,
  cluster_name = "PhiClust"
)

Arguments

x

Either a SpatialExperiment/SingleCellExperiment object containing PhiSpace scores in reducedDim, OR a matrix-like object (matrix, sparse matrix, data frame) with features in columns and observations in rows.

k

Integer specifying the number of clusters. Either provide k or k_range but not both.

k_range

Integer vector of length 2 specifying range of k values to test (e.g., c(5, 15)). Will use elbow method or silhouette to select optimal k. Either provide k or k_range but not both.

select_k_method

Character string specifying method to select optimal k when k_range is provided. Options: "elbow" (total within-cluster SS) or "silhouette" (average silhouette width). Default is "silhouette".

ncomp

Integer specifying number of principal components to use for clustering. If NULL (default), uses min(30, nfeatures - 1). If "all", uses nfeatures - 1 components.

reducedDimName

Character string specifying which reducedDim slot contains the PhiSpace scores. Only used when x is a SingleCellExperiment or SpatialExperiment. Default is "PhiSpace".

use_assay

Character string specifying which assay to use if extracting data from a SingleCellExperiment/SpatialExperiment object instead of using reducedDim. If NULL (default), uses reducedDim specified by reducedDimName. Common options: "logcounts", "counts", "normcounts".

nstart

Integer specifying number of random starts for k-means. Default is 20.

iter.max

Integer specifying maximum number of iterations. Default is 500.

algorithm

Character string specifying k-means algorithm. Options are "Hartigan-Wong", "Lloyd", "Forgy", "MacQueen". Default is "Lloyd".

center

Logical indicating whether to center data before PCA. Default is TRUE.

scale

Logical indicating whether to scale data before PCA. Default is FALSE.

seed

Integer seed for reproducibility. Default is NULL (no seed set).

return_pca

Logical indicating whether to return PCA results. Default is TRUE.

store_in_colData

Logical indicating whether to store cluster assignments in the colData of the input object (only applicable when x is an SCE/SPE). Default is FALSE.

cluster_name

Character string specifying the column name for cluster assignments if store_in_colData is TRUE. Default is "PhiClust".

Value

A list with class "PhiSpaceClustering" containing:

clusters

Factor vector of cluster assignments

cluster_centers

Matrix of cluster centers in PC space

kmeans_result

Full kmeans object from stats::kmeans

pca_result

PCA results (if return_pca = TRUE)

pc_scores

Matrix of PC scores used for clustering

optimal_k

Selected k value (relevant when k_range is used)

k_selection

List with selection metrics (if k_range was used)

parameters

List of parameters used

spe

Updated object (if store_in_colData = TRUE and x is SCE/SPE)

Details

This function implements a common workflow for spatial clustering:

  1. Extract data (from reducedDim, assay, or use directly if matrix)

  2. Perform PCA to reduce dimensionality

  3. Select top principal components

  4. Apply k-means clustering

The function can either:

  • Use a fixed k value (specify k parameter)

  • Automatically select k from a range (specify k_range parameter)

When k_range is provided, the function tests all k values in the range and selects the optimal k using either:

  • Silhouette method (default): Maximizes average silhouette width

  • Elbow method: Identifies elbow point in total within-cluster sum of squares

Input Types

The function accepts multiple input types:

  • SingleCellExperiment/SpatialExperiment: Uses reducedDim (default) or assay

  • Matrix: Standard R matrix with observations in rows

  • Sparse matrix: dgCMatrix or similar sparse formats

  • Data frame: Coerced to matrix

PCA Parameters

The number of PCs to use affects clustering resolution:

  • Fewer PCs (e.g., 10-15): Broader, more general clusters

  • More PCs (e.g., 30-50): Finer, more specific clusters

  • Default (30): Good balance for most applications

K-means Algorithm

  • Lloyd (default): Standard algorithm, good balance of speed and quality

  • Hartigan-Wong: Often better results but slower

  • MacQueen: Faster but may converge to local optima

  • Forgy: Similar to Lloyd

Examples

if (FALSE) { # \dontrun{
# Example 1: Using SingleCellExperiment with PhiSpace scores
result <- clusterPhiSpace(
  x = lung_data,
  k = 9,
  ncomp = 30,
  seed = 123
)

# Example 2: Using matrix directly
phi_matrix <- reducedDim(lung_data, "PhiSpace")
result <- clusterPhiSpace(
  x = phi_matrix,
  k = 9,
  ncomp = 30
)

# Example 3: Cluster on normalized counts instead
result <- clusterPhiSpace(
  x = lung_data,
  use_assay = "logcounts",
  k = 9,
  ncomp = 50
)

# Example 4: Using data frame
df <- as.data.frame(reducedDim(lung_data, "PhiSpace"))
result <- clusterPhiSpace(x = df, k = 9)

# Example 5: Automatic k selection
result <- clusterPhiSpace(
  x = phi_matrix,
  k_range = c(5, 15),
  select_k_method = "silhouette"
)
} # }