Clustering

ClusterDescriptionModel

Model for generating cluster descriptions using LLMs. Similar to SummaryModel, this handles the LLM interaction for generating cluster names and descriptions with configurable parameters.

Constructor

ClusterDescriptionModel(
    model: Union[str, "KnownModelName"] = "openai/gpt-4o-mini",
    max_concurrent_requests: int = 50,
    temperature: float = 0.2,
    checkpoint_filename: str = "clusters",
    console: Optional[Console] = None,
)

model

Union[str, KnownModelName]

default:"openai/gpt-4o-mini"

Model identifier (e.g., “openai/gpt-4o-mini”)

max_concurrent_requests

int

default:"50"

Maximum concurrent API requests

temperature

float

default:"0.2"

LLM temperature for generation

checkpoint_filename

str

default:"clusters"

Filename for checkpointing

console

Optional[Console]

default:"None"

Rich console for progress tracking

Methods

generate_clusters()

Generate clusters from a mapping of cluster IDs to summaries.

async def generate_clusters(
    cluster_id_to_summaries: Dict[int, List[ConversationSummary]],
    prompt: str = DEFAULT_CLUSTER_PROMPT,
    max_contrastive_examples: int = 10,
) -> List[Cluster]

cluster_id_to_summaries

Dict[int, List[ConversationSummary]]

required

Mapping of cluster IDs to their conversation summaries

prompt

str

default:"DEFAULT_CLUSTER_PROMPT"

Custom prompt for cluster generation

max_contrastive_examples

int

default:"10"

Number of contrastive examples from other clusters to use

return

List[Cluster]

List of clusters with generated names and descriptions

generate_cluster_description()

Generate a cluster description from summaries with contrastive examples.

async def generate_cluster_description(
    summaries: List[ConversationSummary],
    contrastive_examples: List[ConversationSummary],
    semaphore: Semaphore,
    client: "AsyncInstructor",
    prompt: str = DEFAULT_CLUSTER_PROMPT,
) -> Cluster

summaries

List[ConversationSummary]

required

Summaries in this cluster

contrastive_examples

List[ConversationSummary]

required

Examples from other clusters for contrast

semaphore

Semaphore

required

Asyncio semaphore for rate limiting

client

AsyncInstructor

required

Instructor async client for LLM calls

prompt

str

default:"DEFAULT_CLUSTER_PROMPT"

Custom prompt for cluster generation

return

Cluster

Cluster with generated name and description

generate_base_clusters_from_conversation_summaries()

Cluster conversation summaries using embeddings. This is the main entry point for generating base clusters from conversation summaries.

async def generate_base_clusters_from_conversation_summaries(
    summaries: List[ConversationSummary],
    embedding_model: Optional[BaseEmbeddingModel] = None,
    clustering_method: Optional[BaseClusteringMethod] = None,
    clustering_model: Optional[BaseClusterDescriptionModel] = None,
    checkpoint_manager: Optional[BaseCheckpointManager] = None,
    max_contrastive_examples: int = 10,
    prompt: str = DEFAULT_CLUSTER_PROMPT,
    **kwargs,
) -> List[Cluster]

summaries

List[ConversationSummary]

required

List of conversation summaries to cluster

embedding_model

Optional[BaseEmbeddingModel]

default:"None"

Model for generating embeddings (defaults to OpenAI)

clustering_method

Optional[BaseClusteringMethod]

default:"None"

Clustering algorithm (defaults to K-means)

clustering_model

Optional[BaseClusterDescriptionModel]

default:"None"

Model for generating cluster descriptions

checkpoint_manager

Optional[BaseCheckpointManager]

default:"None"

Optional checkpoint manager for caching

max_contrastive_examples

int

default:"10"

Number of contrastive examples to use

prompt

str

default:"DEFAULT_CLUSTER_PROMPT"

Custom prompt for cluster generation

return

List[Cluster]

List of clusters with generated names and descriptions

Example:

from kura.cluster import (
    generate_base_clusters_from_conversation_summaries,
    ClusterDescriptionModel,
    KmeansClusteringModel
)
from kura.embedding import OpenAIEmbeddingModel

# Use default models
clusters = await generate_base_clusters_from_conversation_summaries(
    summaries=conversation_summaries
)

# Or customize each component
clusters = await generate_base_clusters_from_conversation_summaries(
    summaries=conversation_summaries,
    embedding_model=OpenAIEmbeddingModel(model_name="text-embedding-3-large"),
    clustering_method=KmeansClusteringModel(clusters_per_group=15),
    clustering_model=ClusterDescriptionModel(model="openai/gpt-4o"),
    max_contrastive_examples=20
)

KmeansClusteringModel

K-means based clustering method for grouping conversation summaries.

Constructor

KmeansClusteringModel(clusters_per_group: int = 10)

clusters_per_group

int

default:"10"

Target number of items per cluster (actual cluster count will be calculated as ceiling(n_items / clusters_per_group))

Methods

cluster()

Perform K-means clustering on items with embeddings.

def cluster(
    items: list[dict[str, Union[ConversationSummary, list[float]]]]
) -> dict[int, list[ConversationSummary]]

items

list[dict[str, Union[ConversationSummary, list[float]]]]

required

List of items with “embedding” and “item” keys

return

dict[int, list[ConversationSummary]]

Dictionary mapping cluster IDs to their conversation summaries

Clustering Method Classes

These classes implement different clustering algorithms that can be used with ClusterDescriptionModel or directly for custom implementations.

KmeansClusteringMethod

Standard K-means clustering implementation using scikit-learn. Best for medium-sized datasets (up to 10k items) where you want consistent, reproducible results.

from kura.k_means import KmeansClusteringMethod

method = KmeansClusteringMethod(clusters_per_group=10)

clusters_per_group

int

default:"10"

Target number of items per cluster. Number of clusters = ceil(n_items / clusters_per_group)

Usage

from kura.k_means import KmeansClusteringMethod
from kura.cluster import ClusterDescriptionModel

# Initialize clustering method
clustering_method = KmeansClusteringMethod(clusters_per_group=15)

# Use with ClusterDescriptionModel
model = ClusterDescriptionModel()

# The clustering method is used internally during cluster generation
# to group similar conversation summaries

MiniBatchKmeansClusteringMethod

Memory-efficient MiniBatch K-means implementation for large datasets (100k+ items). Processes data in configurable batches to reduce memory usage.

from kura.k_means import MiniBatchKmeansClusteringMethod

method = MiniBatchKmeansClusteringMethod(
    clusters_per_group=10,
    batch_size=1000,
    max_iter=100,
    random_state=42
)

clusters_per_group

int

default:"10"

Target number of items per cluster

batch_size

int

default:"1000"

Size of mini-batches for processing. Larger batches use more memory but may be more stable. Adjust based on available RAM.

max_iter

int

default:"100"

Maximum number of iterations for convergence

random_state

int

default:"42"

Random seed for reproducibility. Set to a fixed value for consistent results across runs.

Key Differences from Standard K-means

Lower memory usage: Processes data in configurable batch sizes
Faster convergence: Updates centroids incrementally
Slightly less accurate: Results may vary between runs due to stochastic processing
Better scalability: Handles large datasets without memory issues

When to Use

Processing 100k+ conversation summaries
Limited RAM availability
Need faster clustering speed
Can tolerate minor result variations

Usage Example

from kura.k_means import MiniBatchKmeansClusteringMethod
from kura.cluster import ClusterDescriptionModel

# For large-scale clustering
method = MiniBatchKmeansClusteringMethod(
    clusters_per_group=20,     # Larger groups for big datasets
    batch_size=5000,           # Process 5k items at a time
    max_iter=50,               # Fewer iterations for speed
    random_state=42            # Reproducible results
)

# Use in pipeline
model = ClusterDescriptionModel()
# Pass method to clustering operations

HDBSCANClusteringMethod

Density-based clustering that automatically determines the number of clusters and identifies outliers. Best for datasets with natural density variations and noise.

from kura.hdbscan import HDBSCANClusteringMethod

method = HDBSCANClusteringMethod(
    min_cluster_size=10,
    min_samples=None,
    cluster_selection_epsilon=0.0,
    alpha=1.0,
    cluster_selection_method="eom",
    metric="euclidean"
)

min_cluster_size

int

default:"10"

Minimum size of clusters. Smaller clusters are considered noise/outliers.

min_samples

Optional[int]

default:"None"

Number of samples in a neighborhood for a point to be considered a core point. If None, defaults to min_cluster_size.

cluster_selection_epsilon

float

default:"0.0"

Distance threshold for merging clusters. Clusters closer than this value will be merged.

alpha

float

default:"1.0"

Distance scaling parameter for robust single linkage. Higher values make clustering more conservative.

cluster_selection_method

str

default:"eom"

Method for selecting clusters from the tree:

eom: Excess of Mass (default, more conservative)
leaf: Leaf clustering (creates more, smaller clusters)

metric

str

default:"euclidean"

Distance metric: euclidean, manhattan, cosine, l1, l2, etc.

Key Features

Automatic cluster detection: No need to specify number of clusters
Outlier identification: Identifies noise points that don’t fit any cluster
Variable density: Handles clusters of different densities
No assumptions: Doesn’t assume spherical cluster shapes

When to Use

Don’t know how many clusters to expect
Dataset has natural outliers or noise
Clusters have varying densities
Need more flexible cluster shapes than K-means provides

Usage Example

from kura.hdbscan import HDBSCANClusteringMethod
from kura.cluster import ClusterDescriptionModel

# Initialize with appropriate parameters
method = HDBSCANClusteringMethod(
    min_cluster_size=15,                  # Minimum 15 conversations per cluster
    cluster_selection_epsilon=0.5,        # Merge very similar clusters
    cluster_selection_method="eom",       # Conservative clustering
    metric="cosine"                       # Good for text embeddings
)

# Use in custom clustering pipeline
# Note: Outliers are automatically reassigned to a separate cluster

Noise Handling

HDBSCAN identifies outliers (noise points) and assigns them cluster ID -1. Kura automatically reassigns these to a separate cluster with the next available ID to maintain compatibility with the pipeline.

# Example: 1000 items clustered
# - 950 items in 10 dense clusters (IDs 0-9)
# - 50 outliers automatically moved to cluster ID 10

Core Components

Data Types

Utilities

ClusterDescriptionModel

Constructor

Methods

generate_clusters()

generate_cluster_description()

generate_base_clusters_from_conversation_summaries()

KmeansClusteringModel

Constructor

Methods

cluster()

Clustering Method Classes

KmeansClusteringMethod

Usage

MiniBatchKmeansClusteringMethod

Key Differences from Standard K-means

When to Use

Usage Example

HDBSCANClusteringMethod

Key Features

When to Use

Usage Example

Noise Handling

Core Components

Data Types

Utilities

Documentation Index

​ClusterDescriptionModel

​Constructor

​Methods

​generate_clusters()

​generate_cluster_description()

​generate_base_clusters_from_conversation_summaries()

​KmeansClusteringModel

​Constructor

​Methods

​cluster()

​Clustering Method Classes

​KmeansClusteringMethod

​Usage

​MiniBatchKmeansClusteringMethod

​Key Differences from Standard K-means

​When to Use

​Usage Example

​HDBSCANClusteringMethod

​Key Features

​When to Use

​Usage Example

​Noise Handling

ClusterDescriptionModel

Constructor

Methods

generate_clusters()

generate_cluster_description()

generate_base_clusters_from_conversation_summaries()

KmeansClusteringModel

Constructor

Methods

cluster()

Clustering Method Classes

KmeansClusteringMethod

Usage

MiniBatchKmeansClusteringMethod

Key Differences from Standard K-means

When to Use

Usage Example

HDBSCANClusteringMethod

Key Features

When to Use

Usage Example

Noise Handling