Documentation Index
Fetch the complete documentation index at: https://mintlify.com/jxnl/kura/llms.txt
Use this file to discover all available pages before exploring further.
Kura supports multiple data sources for loading conversations, including HuggingFace datasets, Claude conversation exports, and custom data formats. This guide shows you how to load data from each source.
Loading from HuggingFace Datasets
The from_hf_dataset method allows you to load conversations from any HuggingFace dataset with a compatible structure:
from kura.types import Conversation
# Basic usage
conversations = Conversation.from_hf_dataset(
"ivanleomk/synthetic-gemini-conversations",
split="train"
)
Limiting the Number of Conversations
For testing or smaller analyses, you can limit the number of conversations loaded:
conversations = Conversation.from_hf_dataset(
"ivanleomk/synthetic-gemini-conversations",
split="train",
max_conversations=1000 # Only load first 1000 conversations
)
Custom Field Mapping
If your dataset has a different structure, you can provide custom mapping functions:
conversations = Conversation.from_hf_dataset(
"your-dataset-name",
split="train",
chat_id_fn=lambda x: x["conversation_id"],
created_at_fn=lambda x: x["timestamp"],
messages_fn=lambda x: x["dialogue"],
metadata_fn=lambda x: {
"user_type": x.get("user_category"),
"session_length": len(x["dialogue"])
}
)
Metadata enriches your analysis by providing additional context about each conversation:
conversations = Conversation.from_hf_dataset(
"allenai/WildChat-nontoxic",
split="train",
metadata_fn=lambda x: {
"model": x["model"],
"toxic": x["toxic"],
"redacted": x["redacted"],
}
)
This metadata will be carried through the analysis pipeline and can be used for filtering or additional analysis.
Loading from Claude Conversation Dumps
If you have exported your Claude conversation history, Kura can parse it directly:
from kura.types import Conversation
# Basic usage
conversations = Conversation.from_claude_conversation_dump("conversations.json")
You can extract additional metadata from the Claude export:
conversations = Conversation.from_claude_conversation_dump(
"conversations.json",
metadata_fn=lambda x: {
"project_name": x.get("project"),
"has_attachments": len(x.get("attachments", [])) > 0,
}
)
The Claude loader automatically:
- Converts message timestamps to ISO format
- Maps “human” sender to “user” role
- Maps “assistant” sender to “assistant” role
- Extracts text content from message objects
- Sorts messages by timestamp
Creating Custom Conversations
For custom data sources, create Conversation objects directly:
from kura.types import Conversation, Message
from datetime import datetime
from uuid import uuid4
# Create messages
messages = [
Message(
created_at=datetime.now(),
role="user",
content="How do I implement authentication?"
),
Message(
created_at=datetime.now(),
role="assistant",
content="To implement authentication, you can use..."
)
]
# Create conversation
conversation = Conversation(
chat_id=str(uuid4()),
created_at=datetime.now(),
messages=messages,
metadata={
"source": "support_system",
"priority": "high"
}
)
conversations = [conversation] # Add to list for processing
Loading from Multiple Sources
You can combine conversations from different sources:
from kura.types import Conversation
# Load from multiple sources
hf_conversations = Conversation.from_hf_dataset(
"dataset1",
metadata_fn=lambda x: {"source": "hf_dataset"}
)
claude_conversations = Conversation.from_claude_conversation_dump(
"conversations.json",
metadata_fn=lambda x: {"source": "claude_export"}
)
# Combine them
all_conversations = hf_conversations + claude_conversations
print(f"Total conversations: {len(all_conversations)}")
Saving and Loading Conversation Dumps
You can save conversations to disk for faster reloading:
from kura.types import Conversation
# Save conversations
Conversation.generate_conversation_dump(
conversations,
"my_conversations.json"
)
# Load them back later
conversations = Conversation.from_conversation_dump("my_conversations.json")
Best Practices
When working with large datasets, start with a small sample to test your pipeline:
conversations = Conversation.from_hf_dataset(
"large-dataset",
max_conversations=100
)
Add metadata that will be useful for analysis:
User segments or types
Conversation source (support, sales, etc.)
Time periods or seasons
Product versions
Check that your conversations loaded correctly:
print(f"Loaded {len(conversations)} conversations")
print(f"First conversation has {len(conversations[0].messages)} messages")
print(f"Sample content: {conversations[0].messages[0].content[:100]}...")
Troubleshooting
HuggingFace Dataset Not Found
If you get an import error when loading from HuggingFace:
ImportError: Please install hf datasets to load conversations from a dataset
Install the datasets package:
If your dataset has a different message structure, use custom mapping functions to transform it:
def custom_messages_fn(item):
# Transform your message format to Kura's expected format
return [
Message(
role=msg["speaker"],
content=msg["text"],
created_at=msg["timestamp"]
)
for msg in item["dialogue"]
]
conversations = Conversation.from_hf_dataset(
"your-dataset",
messages_fn=custom_messages_fn
)
Memory Issues with Large Datasets
For very large datasets, use streaming mode:
# The from_hf_dataset method uses streaming by default
conversations = Conversation.from_hf_dataset(
"very-large-dataset",
split="train",
max_conversations=10000 # Process in chunks
)
Next Steps
Now that you’ve loaded your conversations, you can: