Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/jxnl/kura/llms.txt

Use this file to discover all available pages before exploring further.

ClusterDescriptionModel

Model for generating cluster descriptions using LLMs. Similar to SummaryModel, this handles the LLM interaction for generating cluster names and descriptions with configurable parameters.

Constructor

ClusterDescriptionModel(
    model: Union[str, "KnownModelName"] = "openai/gpt-4o-mini",
    max_concurrent_requests: int = 50,
    temperature: float = 0.2,
    checkpoint_filename: str = "clusters",
    console: Optional[Console] = None,
)
model
Union[str, KnownModelName]
default:"openai/gpt-4o-mini"
Model identifier (e.g., “openai/gpt-4o-mini”)
max_concurrent_requests
int
default:"50"
Maximum concurrent API requests
temperature
float
default:"0.2"
LLM temperature for generation
checkpoint_filename
str
default:"clusters"
Filename for checkpointing
console
Optional[Console]
default:"None"
Rich console for progress tracking

Methods

generate_clusters()

Generate clusters from a mapping of cluster IDs to summaries.
async def generate_clusters(
    cluster_id_to_summaries: Dict[int, List[ConversationSummary]],
    prompt: str = DEFAULT_CLUSTER_PROMPT,
    max_contrastive_examples: int = 10,
) -> List[Cluster]
cluster_id_to_summaries
Dict[int, List[ConversationSummary]]
required
Mapping of cluster IDs to their conversation summaries
prompt
str
default:"DEFAULT_CLUSTER_PROMPT"
Custom prompt for cluster generation
max_contrastive_examples
int
default:"10"
Number of contrastive examples from other clusters to use
return
List[Cluster]
List of clusters with generated names and descriptions

generate_cluster_description()

Generate a cluster description from summaries with contrastive examples.
async def generate_cluster_description(
    summaries: List[ConversationSummary],
    contrastive_examples: List[ConversationSummary],
    semaphore: Semaphore,
    client: "AsyncInstructor",
    prompt: str = DEFAULT_CLUSTER_PROMPT,
) -> Cluster
summaries
List[ConversationSummary]
required
Summaries in this cluster
contrastive_examples
List[ConversationSummary]
required
Examples from other clusters for contrast
semaphore
Semaphore
required
Asyncio semaphore for rate limiting
client
AsyncInstructor
required
Instructor async client for LLM calls
prompt
str
default:"DEFAULT_CLUSTER_PROMPT"
Custom prompt for cluster generation
return
Cluster
Cluster with generated name and description

generate_base_clusters_from_conversation_summaries()

Cluster conversation summaries using embeddings. This is the main entry point for generating base clusters from conversation summaries.
async def generate_base_clusters_from_conversation_summaries(
    summaries: List[ConversationSummary],
    embedding_model: Optional[BaseEmbeddingModel] = None,
    clustering_method: Optional[BaseClusteringMethod] = None,
    clustering_model: Optional[BaseClusterDescriptionModel] = None,
    checkpoint_manager: Optional[BaseCheckpointManager] = None,
    max_contrastive_examples: int = 10,
    prompt: str = DEFAULT_CLUSTER_PROMPT,
    **kwargs,
) -> List[Cluster]
summaries
List[ConversationSummary]
required
List of conversation summaries to cluster
embedding_model
Optional[BaseEmbeddingModel]
default:"None"
Model for generating embeddings (defaults to OpenAI)
clustering_method
Optional[BaseClusteringMethod]
default:"None"
Clustering algorithm (defaults to K-means)
clustering_model
Optional[BaseClusterDescriptionModel]
default:"None"
Model for generating cluster descriptions
checkpoint_manager
Optional[BaseCheckpointManager]
default:"None"
Optional checkpoint manager for caching
max_contrastive_examples
int
default:"10"
Number of contrastive examples to use
prompt
str
default:"DEFAULT_CLUSTER_PROMPT"
Custom prompt for cluster generation
return
List[Cluster]
List of clusters with generated names and descriptions
Example:
from kura.cluster import (
    generate_base_clusters_from_conversation_summaries,
    ClusterDescriptionModel,
    KmeansClusteringModel
)
from kura.embedding import OpenAIEmbeddingModel

# Use default models
clusters = await generate_base_clusters_from_conversation_summaries(
    summaries=conversation_summaries
)

# Or customize each component
clusters = await generate_base_clusters_from_conversation_summaries(
    summaries=conversation_summaries,
    embedding_model=OpenAIEmbeddingModel(model_name="text-embedding-3-large"),
    clustering_method=KmeansClusteringModel(clusters_per_group=15),
    clustering_model=ClusterDescriptionModel(model="openai/gpt-4o"),
    max_contrastive_examples=20
)

KmeansClusteringModel

K-means based clustering method for grouping conversation summaries.

Constructor

KmeansClusteringModel(clusters_per_group: int = 10)
clusters_per_group
int
default:"10"
Target number of items per cluster (actual cluster count will be calculated as ceiling(n_items / clusters_per_group))

Methods

cluster()

Perform K-means clustering on items with embeddings.
def cluster(
    items: list[dict[str, Union[ConversationSummary, list[float]]]]
) -> dict[int, list[ConversationSummary]]
items
list[dict[str, Union[ConversationSummary, list[float]]]]
required
List of items with “embedding” and “item” keys
return
dict[int, list[ConversationSummary]]
Dictionary mapping cluster IDs to their conversation summaries

Clustering Method Classes

These classes implement different clustering algorithms that can be used with ClusterDescriptionModel or directly for custom implementations.

KmeansClusteringMethod

Standard K-means clustering implementation using scikit-learn. Best for medium-sized datasets (up to 10k items) where you want consistent, reproducible results.
from kura.k_means import KmeansClusteringMethod

method = KmeansClusteringMethod(clusters_per_group=10)
clusters_per_group
int
default:"10"
Target number of items per cluster. Number of clusters = ceil(n_items / clusters_per_group)

Usage

from kura.k_means import KmeansClusteringMethod
from kura.cluster import ClusterDescriptionModel

# Initialize clustering method
clustering_method = KmeansClusteringMethod(clusters_per_group=15)

# Use with ClusterDescriptionModel
model = ClusterDescriptionModel()

# The clustering method is used internally during cluster generation
# to group similar conversation summaries

MiniBatchKmeansClusteringMethod

Memory-efficient MiniBatch K-means implementation for large datasets (100k+ items). Processes data in configurable batches to reduce memory usage.
from kura.k_means import MiniBatchKmeansClusteringMethod

method = MiniBatchKmeansClusteringMethod(
    clusters_per_group=10,
    batch_size=1000,
    max_iter=100,
    random_state=42
)
clusters_per_group
int
default:"10"
Target number of items per cluster
batch_size
int
default:"1000"
Size of mini-batches for processing. Larger batches use more memory but may be more stable. Adjust based on available RAM.
max_iter
int
default:"100"
Maximum number of iterations for convergence
random_state
int
default:"42"
Random seed for reproducibility. Set to a fixed value for consistent results across runs.

Key Differences from Standard K-means

  • Lower memory usage: Processes data in configurable batch sizes
  • Faster convergence: Updates centroids incrementally
  • Slightly less accurate: Results may vary between runs due to stochastic processing
  • Better scalability: Handles large datasets without memory issues

When to Use

  • Processing 100k+ conversation summaries
  • Limited RAM availability
  • Need faster clustering speed
  • Can tolerate minor result variations

Usage Example

from kura.k_means import MiniBatchKmeansClusteringMethod
from kura.cluster import ClusterDescriptionModel

# For large-scale clustering
method = MiniBatchKmeansClusteringMethod(
    clusters_per_group=20,     # Larger groups for big datasets
    batch_size=5000,           # Process 5k items at a time
    max_iter=50,               # Fewer iterations for speed
    random_state=42            # Reproducible results
)

# Use in pipeline
model = ClusterDescriptionModel()
# Pass method to clustering operations

HDBSCANClusteringMethod

Density-based clustering that automatically determines the number of clusters and identifies outliers. Best for datasets with natural density variations and noise.
from kura.hdbscan import HDBSCANClusteringMethod

method = HDBSCANClusteringMethod(
    min_cluster_size=10,
    min_samples=None,
    cluster_selection_epsilon=0.0,
    alpha=1.0,
    cluster_selection_method="eom",
    metric="euclidean"
)
min_cluster_size
int
default:"10"
Minimum size of clusters. Smaller clusters are considered noise/outliers.
min_samples
Optional[int]
default:"None"
Number of samples in a neighborhood for a point to be considered a core point. If None, defaults to min_cluster_size.
cluster_selection_epsilon
float
default:"0.0"
Distance threshold for merging clusters. Clusters closer than this value will be merged.
alpha
float
default:"1.0"
Distance scaling parameter for robust single linkage. Higher values make clustering more conservative.
cluster_selection_method
str
default:"eom"
Method for selecting clusters from the tree:
  • eom: Excess of Mass (default, more conservative)
  • leaf: Leaf clustering (creates more, smaller clusters)
metric
str
default:"euclidean"
Distance metric: euclidean, manhattan, cosine, l1, l2, etc.

Key Features

  • Automatic cluster detection: No need to specify number of clusters
  • Outlier identification: Identifies noise points that don’t fit any cluster
  • Variable density: Handles clusters of different densities
  • No assumptions: Doesn’t assume spherical cluster shapes

When to Use

  • Don’t know how many clusters to expect
  • Dataset has natural outliers or noise
  • Clusters have varying densities
  • Need more flexible cluster shapes than K-means provides

Usage Example

from kura.hdbscan import HDBSCANClusteringMethod
from kura.cluster import ClusterDescriptionModel

# Initialize with appropriate parameters
method = HDBSCANClusteringMethod(
    min_cluster_size=15,                  # Minimum 15 conversations per cluster
    cluster_selection_epsilon=0.5,        # Merge very similar clusters
    cluster_selection_method="eom",       # Conservative clustering
    metric="cosine"                       # Good for text embeddings
)

# Use in custom clustering pipeline
# Note: Outliers are automatically reassigned to a separate cluster

Noise Handling

HDBSCAN identifies outliers (noise points) and assigns them cluster ID -1. Kura automatically reassigns these to a separate cluster with the next available ID to maintain compatibility with the pipeline.
# Example: 1000 items clustered
# - 950 items in 10 dense clusters (IDs 0-9)
# - 50 outliers automatically moved to cluster ID 10