Core type definitions and data classes used throughout SimpleVecDB.

Document

simplevecdb.types.Document dataclass

Simple document with text content and arbitrary metadata.

Represents a document with text content, embeddings, and metadata.

Example:

from simplevecdb.types import Document

doc = Document(
    id=1,
    page_content="Paris is the capital of France.",
    metadata={"category": "geography", "verified": True},
    embedding=[0.1, 0.2, 0.3, ...]
)

ClusterResult

simplevecdb.types.ClusterResult dataclass

Result of a clustering operation.

n_clusters instance-attribute

labels instance-attribute

centroids instance-attribute

algorithm instance-attribute

inertia = None class-attribute instance-attribute

silhouette_score = None class-attribute instance-attribute

summary()

Return cluster_id -> member count mapping.

metrics()

Return clustering quality metrics.

Result of clustering operation with quality metrics.

Example:

result = collection.cluster(n_clusters=5)

print(f"Clusters: {result.n_clusters}")
print(f"Algorithm: {result.algorithm}")
print(f"Silhouette: {result.silhouette_score:.2f}")
print(f"Inertia: {result.inertia:.2f}")

# Cluster size distribution
print(result.summary())
# {0: 42, 1: 38, 2: 15, 3: 3, 4: 2}

# All metrics
metrics = result.metrics()
# {'inertia': 1523.45, 'silhouette_score': 0.62}

Fields

Field Type Description
n_clusters int Number of clusters discovered
labels np.ndarray Cluster ID for each document (shape: [n_docs])
centroids np.ndarray \| None Cluster centroids (shape: [n_clusters, dim]). None for HDBSCAN.
algorithm ClusterAlgorithm Algorithm used: "kmeans", "minibatch_kmeans", or "hdbscan"
inertia float \| None Sum of squared distances to centroids (K-means only, lower is better)
silhouette_score float \| None Cluster separation metric (-1 to 1, higher is better)

Methods

summary() -> dict[int, int]

Returns cluster size distribution.

result = collection.cluster(n_clusters=3)
sizes = result.summary()
# {0: 50, 1: 30, 2: 20}

metrics() -> dict[str, float | None]

Returns all quality metrics as a dictionary.

metrics = result.metrics()
# {'inertia': 1523.45, 'silhouette_score': 0.62}

Enums

DistanceStrategy

simplevecdb.types.DistanceStrategy

Bases: StrEnum

Supported distance metrics for usearch backend.

Distance metrics for vector similarity.

Value Description Use Case
COSINE Cosine similarity (default) Text embeddings, normalized vectors
EUCLIDEAN L2 distance Image embeddings, spatial data
INNER_PRODUCT Dot product Pre-normalized embeddings

Example:

from simplevecdb import VectorDB, DistanceStrategy

collection = db.collection("docs", distance_strategy=DistanceStrategy.COSINE)

Quantization

simplevecdb.types.Quantization

Bases: StrEnum

Vector compression strategies.

Value Precision Compression Speed Use Case
FLOAT 32-bit 1x Baseline High precision required
FLOAT16 16-bit 2x Fast Recommended default
INT8 8-bit 4x Faster Large collections
BIT 1-bit 32x Fastest Massive scale, approximate search

Example:

from simplevecdb import VectorDB, Quantization

# 2x memory savings, minimal precision loss
collection = db.collection("docs", quantization=Quantization.FLOAT16)

# 32x compression for massive scale
collection = db.collection("docs", quantization=Quantization.BIT)

ClusterAlgorithm

Clustering algorithms (string literals accepted by VectorCollection.cluster).

Value Description Requires n_clusters Provides Centroids
kmeans Classic K-means Yes Yes
minibatch_kmeans Scalable K-means (default) Yes Yes
hdbscan Density-based clustering No No

Example:

# Auto-discover cluster count
result = collection.cluster(algorithm="hdbscan", min_cluster_size=10)

# Fixed cluster count with centroids
result = collection.cluster(n_clusters=5, algorithm="minibatch_kmeans")

Type Aliases

EmbeddingVector

EmbeddingVector = list[float] | np.ndarray

Represents a single embedding vector.

MetadataDict

MetadataDict = dict[str, Any]

Document metadata with arbitrary key-value pairs.

FilterDict

FilterDict = dict[str, Any]

Metadata filter for search operations. Supports equality matching.

Example:

results = collection.similarity_search(
    query_vector,
    k=10,
    filter={"category": "tech", "verified": True}
)