Documentation Index
Fetch the complete documentation index at: https://mintlify.com/jxnl/kura/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Kura provides a flexible checkpoint management system for persisting intermediate pipeline results. The system supports multiple backends with different performance and feature characteristics.Available Checkpoint Managers
JSONLCheckpointManager
The traditional JSONL file-based checkpoint system. This is the default and most compatible option.Directory path for saving checkpoints
Whether checkpointing is enabled
Methods
Load data from a checkpoint file if it exists.Parameters:
filename(str): Name of the checkpoint filemodel_class(type[T]): Pydantic model class for deserializing the data
Save data to a checkpoint file.Parameters:
filename(str): Name of the checkpoint filedata(List[T]): List of model instances to save
List all available checkpoint files.Returns: List of checkpoint filenames
Delete a checkpoint file.Parameters:
filename(str): Name of the checkpoint file to delete
Example Usage
HFDatasetCheckpointManager
HuggingFace datasets-based checkpoint system with advanced features like streaming, versioning, and cloud storage integration.Directory path for saving checkpoints locally
Whether checkpointing is enabled
HuggingFace Hub repository name for cloud storage (e.g., “username/repo-name”)
HuggingFace Hub authentication token for private repositories
Whether to use streaming mode by default for memory-efficient processing
Compression algorithm to use. Options:
"gzip", "lz4", "zstd", or NoneFeatures
- Memory-mapped files: Efficient access without loading everything into memory
- Streaming support: Process datasets larger than available RAM
- Built-in compression: Automatic compression with multiple algorithms
- Version control: Track changes via HuggingFace Hub
- Rich querying: Filter and search capabilities
- Schema validation: Type safety with predefined schemas
Methods
Load data from a checkpoint using HuggingFace datasets.Parameters:
filename(str): Name of the checkpointmodel_class(type[T]): Pydantic model class for deserializingstreaming(Optional[bool]): Override default streaming modecheckpoint_type(Optional[str]): Type hint for deserialization
Save data to a checkpoint using HuggingFace datasets.Parameters:
filename(str): Name of the checkpoint (without extension)data(List[T]): List of model instances to savecheckpoint_type(Optional[str]): Type hint for schema selection
Get information about a checkpoint dataset.Parameters:
filename(str): Name of the checkpoint
num_rows: Number of recordsnum_columns: Number of columnscolumn_names: List of column namesfeatures: Dataset schema informationsize_bytes: Total size on disk
Filter a checkpoint dataset without loading everything into memory.Parameters:
filename(str): Name of the checkpointfilter_fn(Callable[[Dict[str, Any]], bool]): Function to filter rowsmodel_class(type[T]): Pydantic model class for resultscheckpoint_type(Optional[str]): Type hint for deserialization
Example Usage
Supported Data Types
The HFDatasetCheckpointManager has predefined schemas for:- conversations:
Conversationobjects with messages and metadata - summaries:
ConversationSummaryobjects with embeddings - clusters:
Clusterobjects with hierarchical relationships - projected_clusters:
ProjectedClusterobjects with 2D coordinates
ParquetCheckpointManager
Requires the optional
parquet dependency group: pip install kura[parquet]Directory path for saving Parquet files
Whether checkpointing is enabled
Compression algorithm:
snappy, gzip, brotli, lz4, or zstdFeatures
- 50% space savings: Better compression than JSONL
- Faster loading: Columnar format optimized for analytics
- Type-safe schemas: PyArrow schema validation
- Multiple compression options: Choose speed vs size tradeoff
Example Usage
Compression Options
| Algorithm | Speed | Compression | Best For |
|---|---|---|---|
snappy | Fastest | Good | Default choice |
lz4 | Very Fast | Good | Speed-critical |
gzip | Medium | Better | Balanced |
zstd | Medium | Best | Max compression |
brotli | Slow | Best | Archival |
SQLCheckpointManager
Requires the optional
sqlmodel dependency: pip install sqlmodelSQLAlchemy database URL:
- SQLite:
sqlite:///path/to/database.db - PostgreSQL:
postgresql://user:pass@localhost/dbname - MySQL:
mysql://user:pass@localhost/dbname
Whether checkpointing is enabled
Features
- Run tracking: Track multiple analysis runs with metadata
- Rich querying: SQL queries for filtering and analysis
- Relational data: Proper foreign keys and relationships
- Production-ready: PostgreSQL support for multi-user deployments
- ACID compliance: Reliable transactions and data integrity
Methods
Create a new analysis run for organizing checkpoints.Parameters:
name(str): Human-readable name for the rundescription(Optional[str]): Detailed descriptionconfig(Optional[dict]): Configuration dictionary
Retrieve a specific run by ID.Parameters:
run_id(str): The run’s unique identifier
List all analysis runs.Returns: List of RunTable objects ordered by creation time
Example Usage
PostgreSQL Production Setup
MultiCheckpointManager
Coordinates multiple checkpoint backends simultaneously for redundancy and performance optimization.Base Class
BaseCheckpointManager
Abstract base class that all checkpoint managers inherit from.Configuration
Environment Variables
Checkpoint managers can be configured via environment variables:In Pipeline
Checkpoint managers integrate with the Kura pipeline:Performance Comparison
| Feature | JSONL | HuggingFace Datasets |
|---|---|---|
| Compatibility | High | Requires datasets package |
| Compression | None | 50%+ space savings |
| Loading Speed | Fast for small data | Faster for large data |
| Memory Usage | Loads all into RAM | Memory-mapped |
| Streaming | No | Yes |
| Cloud Storage | Manual | Built-in (Hub) |
| Querying | Manual filtering | Rich query support |
Best Practices
Use JSONLCheckpointManager for:
- Small to medium datasets (< 100K conversations)
- Maximum compatibility
- Simple local storage needs
Use HFDatasetCheckpointManager for:
- Large datasets (> 100K conversations)
- Memory-constrained environments
- Cloud storage and versioning needs
- Advanced filtering and querying
Enable compression when using HFDatasetCheckpointManager to reduce disk usage by 50% or more.
Use streaming mode for datasets that don’t fit in available RAM.