Documentation Index
Fetch the complete documentation index at: https://mintlify.com/jxnl/kura/llms.txt
Use this file to discover all available pages before exploring further.
ClusterDescriptionModel
Model for generating cluster descriptions using LLMs. Similar to SummaryModel, this handles the LLM interaction for generating cluster names and descriptions with configurable parameters.Constructor
Model identifier (e.g., “openai/gpt-4o-mini”)
Maximum concurrent API requests
LLM temperature for generation
Filename for checkpointing
Rich console for progress tracking
Methods
generate_clusters()
Generate clusters from a mapping of cluster IDs to summaries.Mapping of cluster IDs to their conversation summaries
Custom prompt for cluster generation
Number of contrastive examples from other clusters to use
List of clusters with generated names and descriptions
generate_cluster_description()
Generate a cluster description from summaries with contrastive examples.Summaries in this cluster
Examples from other clusters for contrast
Asyncio semaphore for rate limiting
Instructor async client for LLM calls
Custom prompt for cluster generation
Cluster with generated name and description
generate_base_clusters_from_conversation_summaries()
Cluster conversation summaries using embeddings. This is the main entry point for generating base clusters from conversation summaries.List of conversation summaries to cluster
Model for generating embeddings (defaults to OpenAI)
Clustering algorithm (defaults to K-means)
Model for generating cluster descriptions
Optional checkpoint manager for caching
Number of contrastive examples to use
Custom prompt for cluster generation
List of clusters with generated names and descriptions
KmeansClusteringModel
K-means based clustering method for grouping conversation summaries.Constructor
Target number of items per cluster (actual cluster count will be calculated as ceiling(n_items / clusters_per_group))
Methods
cluster()
Perform K-means clustering on items with embeddings.List of items with “embedding” and “item” keys
Dictionary mapping cluster IDs to their conversation summaries
Clustering Method Classes
These classes implement different clustering algorithms that can be used withClusterDescriptionModel or directly for custom implementations.
KmeansClusteringMethod
Standard K-means clustering implementation using scikit-learn. Best for medium-sized datasets (up to 10k items) where you want consistent, reproducible results.Target number of items per cluster. Number of clusters = ceil(n_items / clusters_per_group)
Usage
MiniBatchKmeansClusteringMethod
Memory-efficient MiniBatch K-means implementation for large datasets (100k+ items). Processes data in configurable batches to reduce memory usage.Target number of items per cluster
Size of mini-batches for processing. Larger batches use more memory but may be more stable. Adjust based on available RAM.
Maximum number of iterations for convergence
Random seed for reproducibility. Set to a fixed value for consistent results across runs.
Key Differences from Standard K-means
- Lower memory usage: Processes data in configurable batch sizes
- Faster convergence: Updates centroids incrementally
- Slightly less accurate: Results may vary between runs due to stochastic processing
- Better scalability: Handles large datasets without memory issues
When to Use
- Processing 100k+ conversation summaries
- Limited RAM availability
- Need faster clustering speed
- Can tolerate minor result variations
Usage Example
HDBSCANClusteringMethod
Density-based clustering that automatically determines the number of clusters and identifies outliers. Best for datasets with natural density variations and noise.Minimum size of clusters. Smaller clusters are considered noise/outliers.
Number of samples in a neighborhood for a point to be considered a core point. If None, defaults to min_cluster_size.
Distance threshold for merging clusters. Clusters closer than this value will be merged.
Distance scaling parameter for robust single linkage. Higher values make clustering more conservative.
Method for selecting clusters from the tree:
eom: Excess of Mass (default, more conservative)leaf: Leaf clustering (creates more, smaller clusters)
Distance metric:
euclidean, manhattan, cosine, l1, l2, etc.Key Features
- Automatic cluster detection: No need to specify number of clusters
- Outlier identification: Identifies noise points that don’t fit any cluster
- Variable density: Handles clusters of different densities
- No assumptions: Doesn’t assume spherical cluster shapes
When to Use
- Don’t know how many clusters to expect
- Dataset has natural outliers or noise
- Clusters have varying densities
- Need more flexible cluster shapes than K-means provides
Usage Example
Noise Handling
HDBSCAN identifies outliers (noise points) and assigns them cluster ID -1. Kura automatically reassigns these to a separate cluster with the next available ID to maintain compatibility with the pipeline.