Clustering

Vector clustering discovers natural groupings in your embeddings, enabling automatic categorization, topic discovery, and semantic organization of documents.

Installation

Clustering is included in the standard installation:

pip install simplevecdb

Dependencies included: - scikit-learn>=1.3.0 — K-means and MiniBatch K-means algorithms - hdbscan>=0.8.33 — Density-based clustering

No extra installation steps required!

Quick Start

from simplevecdb import VectorDB

db = VectorDB("products.db")
collection = db.collection("items")

# Add documents with embeddings
collection.add_texts(texts=descriptions, embeddings=embeddings)

# Cluster into 5 groups
result = collection.cluster(n_clusters=5)

# Auto-generate descriptive tags
tags = collection.auto_tag(result, method="tfidf", n_keywords=3)
# {'0': ['electronics', 'wireless', 'bluetooth'], '1': ['clothing', ...], ...}

# Persist cluster IDs to document metadata
collection.assign_cluster_metadata(result, tags)

# Retrieve all documents in cluster 0
docs = collection.get_cluster_members(0)

Algorithms

K-Means (`kmeans`)

Classic centroid-based clustering. Best for balanced, spherical clusters.

result = collection.cluster(
    n_clusters=5,
    algorithm="kmeans",
    random_state=42  # Reproducible results
)

Pros: - Fast and deterministic - Works well with balanced clusters - Provides cluster centroids for assignment

Cons: - Requires specifying n_clusters upfront - Sensitive to outliers - Assumes spherical cluster shapes

Best for: Product categorization, customer segmentation, content organization

MiniBatch K-Means (`minibatch_kmeans`, default)

Scalable variant of K-means using mini-batches. 3-10x faster on large datasets.

result = collection.cluster(
    n_clusters=10,
    algorithm="minibatch_kmeans",
    sample_size=5000,  # Use subset for speed
    random_state=42
)

Pros: - Scales to millions of documents - Memory-efficient - Nearly identical quality to K-means

Cons: - Slightly less stable than K-means - Still requires n_clusters

Best for: Large-scale document clustering, real-time categorization

HDBSCAN (`hdbscan`)

Density-based clustering that automatically discovers cluster count and handles noise.

result = collection.cluster(
    algorithm="hdbscan",
    min_cluster_size=10  # Minimum documents per cluster
)

Pros: - Automatically determines optimal cluster count - Handles noise (assigns label -1 to outliers) - Discovers non-spherical clusters

Cons: - Slower than K-means variants - No centroids (cannot assign new documents) - Requires tuning min_cluster_size

Best for: Exploratory analysis, topic discovery, anomaly detection

Cluster Quality Metrics

Evaluate clustering quality with built-in metrics:

result = collection.cluster(n_clusters=5)

# Silhouette Score: -1 to 1 (higher is better)
# Measures how well-separated clusters are
print(f"Silhouette: {result.silhouette_score:.2f}")
# > 0.7: Strong clustering
# 0.5-0.7: Reasonable clustering
# < 0.5: Weak clustering

# Inertia: Sum of squared distances to centroids (K-means only)
# Lower is better (indicates tighter clusters)
print(f"Inertia: {result.inertia:.2f}")

# Get all metrics as dict
metrics = result.metrics()
# {'inertia': 1523.45, 'silhouette_score': 0.62}

Auto-Tagging

Generate human-readable labels for clusters:

TF-IDF Method (default)

Extracts keywords with highest TF-IDF scores per cluster.

tags = collection.auto_tag(result, method="tfidf", n_keywords=5)
# {'0': ['machine', 'learning', 'neural', 'network', 'deep'], ...}

Best for: Text documents with distinct vocabulary per cluster

Frequency Method

Extracts most common words per cluster.

tags = collection.auto_tag(result, method="frequency", n_keywords=3)

Best for: Short documents, social media posts

Custom Callback

Implement custom tagging logic:

def custom_tagger(cluster_id: int, texts: list[str]) -> list[str]:
    # Your logic here (e.g., LLM-based summarization)
    return ["tag1", "tag2", "tag3"]

tags = collection.auto_tag(result, custom_callback=custom_tagger)

Cluster Persistence

Save cluster configurations for fast assignment of new documents without re-clustering.

Save Cluster State

result = collection.cluster(n_clusters=5)
tags = collection.auto_tag(result)

collection.save_cluster(
    "product_categories",
    result,
    metadata={"tags": tags, "version": 1, "created_at": "2026-01-17"}
)

Load Cluster State

loaded = collection.load_cluster("product_categories")
if loaded:
    result, metadata = loaded
    print(f"Loaded {result.n_clusters} clusters")
    print(f"Tags: {metadata['tags']}")

Assign New Documents

# Add new documents
new_ids = collection.add_texts(new_texts, embeddings=new_embeddings)

# Assign to nearest cluster centroids
assigned_count = collection.assign_to_cluster("product_categories", new_ids)
print(f"Assigned {assigned_count} documents")

# Retrieve assigned documents
docs = collection.get_cluster_members(0)

List and Delete

# List all saved clusters
clusters = collection.list_clusters()
for c in clusters:
    print(f"{c['name']}: {c['n_clusters']} clusters, {c['algorithm']}")

# Delete when no longer needed
collection.delete_cluster("product_categories")

Filtering and Sampling

Cluster subsets of your collection:

# Cluster only verified documents
result = collection.cluster(
    n_clusters=3,
    filter={"verified": True}
)

# Use random sample for speed (large collections)
result = collection.cluster(
    n_clusters=10,
    sample_size=10000  # Cluster 10k random documents
)

Async Support

All clustering methods have async equivalents:

from simplevecdb import AsyncVectorDB

async with AsyncVectorDB("products.db") as db:
    collection = db.collection("items")

    result = await collection.cluster(n_clusters=5)
    tags = await collection.auto_tag(result)
    await collection.save_cluster("categories", result)

    new_ids = await collection.add_texts(texts, embeddings=embeddings)
    await collection.assign_to_cluster("categories", new_ids)

Use Cases

Product Categorization

# Cluster products by description embeddings
result = collection.cluster(n_clusters=20, algorithm="minibatch_kmeans")
tags = collection.auto_tag(result, n_keywords=5)
collection.assign_cluster_metadata(result, tags)

# Save for new products
collection.save_cluster("product_taxonomy", result, metadata={"tags": tags})

Topic Discovery

# Let HDBSCAN discover natural topics
result = collection.cluster(algorithm="hdbscan", min_cluster_size=50)
tags = collection.auto_tag(result, method="tfidf", n_keywords=10)

# Analyze cluster sizes
for cluster_id in range(result.n_clusters):
    docs = collection.get_cluster_members(cluster_id)
    print(f"Topic {cluster_id}: {len(docs)} docs - {tags[str(cluster_id)]}")

Customer Segmentation

# Cluster customer profiles
result = collection.cluster(n_clusters=8, random_state=42)
collection.assign_cluster_metadata(result)

# Target marketing campaigns per segment
segment_0_customers = collection.get_cluster_members(0)

Duplicate Detection

# Use high cluster count to find near-duplicates
result = collection.cluster(n_clusters=1000, algorithm="minibatch_kmeans")
collection.assign_cluster_metadata(result)

# Find potential duplicates in same cluster
for cluster_id in range(result.n_clusters):
    docs = collection.get_cluster_members(cluster_id)
    if len(docs) > 1:
        print(f"Potential duplicates in cluster {cluster_id}: {len(docs)} docs")

Best Practices

Choosing Cluster Count

Elbow Method: Plot inertia vs. n_clusters, look for "elbow"
Silhouette Analysis: Maximize silhouette score
Domain Knowledge: Use business requirements (e.g., 10 product categories)
HDBSCAN: Let algorithm decide

# Elbow method
inertias = []
for k in range(2, 20):
    result = collection.cluster(n_clusters=k)
    inertias.append(result.inertia)
# Plot and find elbow

Performance Optimization

# Large collections: use sampling + MiniBatch K-means
result = collection.cluster(
    n_clusters=50,
    algorithm="minibatch_kmeans",
    sample_size=50000
)

# Small collections: use K-means for stability
result = collection.cluster(
    n_clusters=5,
    algorithm="kmeans",
    random_state=42
)

Reproducibility

Always set random_state for deterministic results:

result = collection.cluster(n_clusters=5, random_state=42)

Metadata Organization

Use consistent metadata keys:

collection.assign_cluster_metadata(result, tags, metadata_key="category")
collection.assign_cluster_metadata(result, tags, metadata_key="topic")

API Reference

See VectorCollection API for complete method signatures and parameters.

Troubleshooting

ValueError: n_clusters must be >= 2

K-means requires at least 2 clusters. Use HDBSCAN for single-cluster detection.

Silhouette score is None

Occurs when clustering produces only 1 cluster or all documents in separate clusters. Adjust n_clusters or min_cluster_size.

HDBSCAN assigns all documents to noise (cluster -1)

Decrease min_cluster_size or increase document count. HDBSCAN needs sufficient density.

Slow clustering on large collections

Use sample_size parameter or switch to minibatch_kmeans:

result = collection.cluster(n_clusters=10, sample_size=10000)

Cluster assignments change between runs

Set random_state for reproducibility:

result = collection.cluster(n_clusters=5, random_state=42)