Fine-tuning a large language model means training it on your specific data so it behaves the way you need, rather than the way it was generally trained. The process itself is increasingly straightforward. The part that trips people up is preparing the training data in exactly the right format.
Every provider has its own JSONL format, its own required fields, and its own quirks. Submit data in the wrong format and you get cryptic validation errors. Submit data in the right format but with poor examples and you get a fine-tuned model that performs worse than the base model.
This guide covers the exact data formats for OpenAI, Anthropic, and Google fine-tuning, with copy-paste examples and validation tips. Use the [Fine-Tuning Formatter](/tools/fine-tuning-formatter) to convert your data into the correct format without manual JSONL editing.
What is JSONL and Why Fine-Tuning Uses It
JSONL (JSON Lines) is a text format where each line is a valid JSON object. It is the standard for fine-tuning data because it is streamable (you can process one example at a time without loading the entire file into memory), easy to validate (each line is independently valid JSON), and simple to append to (just add a new line).
A JSONL file looks like this:
`
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is 2+2?"}, {"role": "assistant", "content": "4"}]}
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "Paris."}]}
`
Each line is one training example. The file has no header, no trailing comma, and no wrapping array. This is the single most common mistake: wrapping your examples in a JSON array [{...}, {...}] instead of putting each on its own line.
Use the [JSON Formatter](/tools/json-formatter) to validate individual JSON objects before assembling them into a JSONL file.
OpenAI Fine-Tuning Format
OpenAI uses a chat-based JSONL format for fine-tuning GPT-4o, GPT-4o mini, and GPT-3.5 Turbo. Each training example is a conversation with system, user, and assistant messages.
Basic Format
`json
{
"messages": [
{"role": "system", "content": "You are a customer support agent for Acme Corp."},
{"role": "user", "content": "I want to return my order"},
{"role": "assistant", "content": "I can help with that. Could you provide your order number?"}
]
}
`
Multi-Turn Conversations
You can include multiple turns in a single example. The model learns from the _assistant_ messages only; user and system messages provide context.
`json
{
"messages": [
{"role": "system", "content": "You are a technical support agent."},
{"role": "user", "content": "My app keeps crashing"},
{"role": "assistant", "content": "Which version of the app are you running?"},
{"role": "user", "content": "Version 3.2.1"},
{"role": "assistant", "content": "Version 3.2.1 has a known memory leak. Please update to 3.2.2 which resolves this issue."}
]
}
`
Function Calling Fine-Tuning
OpenAI also supports fine-tuning with function/tool calls:
`json
{
"messages": [
{"role": "system", "content": "You have access to a weather API."},
{"role": "user", "content": "What is the weather in Tokyo?"},
{"role": "assistant", "content": null, "tool_calls": [{"id": "call_1", "type": "function", "function": {"name": "get_weather", "arguments": "{\"city\": \"Tokyo\"}"}}]},
{"role": "tool", "tool_call_id": "call_1", "content": "{\"temp\": 22, \"condition\": \"sunny\"}"},
{"role": "assistant", "content": "It is currently 22 degrees C and sunny in Tokyo."}
]
}
`
OpenAI Requirements
- Minimum 10 examples (recommended: 50-100 for noticeable improvement)
- System message is optional but should be consistent across examples if used
- Each JSONL line must be under 4MB
- Total training data can be up to 50M tokens for GPT-4o fine-tuning
- The
messagesarray must end with anassistantmessage
Key Takeaway
OpenAI uses a chat-based JSONL format for fine-tuning GPT-4o, GPT-4o mini, and GPT-3.5 Turbo.
Anthropic Fine-Tuning Format
Anthropic offers fine-tuning for Claude models using a similar chat format, but with some differences in structure and naming.
Basic Format
`json
{
"messages": [
{"role": "user", "content": "Summarize this legal clause: [clause text here]"},
{"role": "assistant", "content": "This clause establishes a non-compete period of 12 months..."}
]
}
`
With System Prompt
Anthropic separates the system prompt from the messages array:
`json
{
"system": "You are a legal document analyst. Respond concisely.",
"messages": [
{"role": "user", "content": "Summarize this legal clause: [clause text here]"},
{"role": "assistant", "content": "This clause establishes a non-compete period of 12 months..."}
]
}
`
Multi-Turn with Prefill
Anthropic supports "prefill" patterns where you start the assistant's response:
`json
{
"messages": [
{"role": "user", "content": "Classify this email: [email text]"},
{"role": "assistant", "content": "Category: Support Request\nPriority: High\nSummary: Customer reports billing discrepancy on invoice #4521."}
]
}
`
Anthropic Requirements
- Messages must alternate between
userandassistantroles - The conversation must start with a
usermessage and end with anassistantmessage - System prompt goes in a separate
systemfield, not as a message - Minimum training examples depend on the fine-tuning tier
- Content blocks can be strings or structured content arrays
Google (Gemini) Fine-Tuning Format
Google uses a slightly different format for Gemini fine-tuning through Vertex AI.
Basic Format
`json
{
"contents": [
{"role": "user", "parts": [{"text": "Translate to French: Good morning"}]},
{"role": "model", "parts": [{"text": "Bonjour"}]}
]
}
`
The key differences from OpenAI/Anthropic: Google uses contents instead of messages, model instead of assistant, and wraps text in a parts array with text objects.
With System Instruction
`json
{
"systemInstruction": {
"parts": [{"text": "You are a French translator. Translate accurately and naturally."}]
},
"contents": [
{"role": "user", "parts": [{"text": "How are you?"}]},
{"role": "model", "parts": [{"text": "Comment allez-vous ?"}]}
]
}
`
Google Requirements
- Use
modelrole instead ofassistant - Content must be wrapped in
partsarray withtextobjects - System instructions use
systemInstructionfield - Fine-tuning is available through Vertex AI, not the standard Gemini API
- Supports both text and multimodal fine-tuning examples
Key Takeaway
Google uses a slightly different format for Gemini fine-tuning through Vertex AI.
Format Comparison at a Glance
| Feature | OpenAI | Anthropic | Google (Gemini) |
|---|---|---|---|
| File format | JSONL | JSONL | JSONL |
| Messages field | messages | messages | contents |
| Assistant role name | assistant | assistant | model |
| System prompt | Inside messages array | Separate system field | Separate systemInstruction |
| Content structure | Plain string | String or content blocks | parts array with text |
| Tool/function calls | Supported | Supported | Supported |
| Min examples | 10 | Varies | Varies |
| Multi-turn | Yes | Yes | Yes |
The differences are small but critical. A training file formatted for OpenAI will fail validation on Anthropic or Google because of field naming differences. The [Fine-Tuning Formatter](/tools/fine-tuning-formatter) handles these conversions automatically.
Best Practices for Training Data Quality
The format gets your data accepted. The _quality_ of your data determines whether the fine-tuned model actually improves.
Consistency Over Volume
50 high-quality, consistent examples outperform 500 sloppy ones. Every example should demonstrate the exact behavior you want. If your assistant messages vary in tone, length, or format across examples, the model learns that inconsistency.
Match Your Production Format
Your training examples should mirror exactly how users will interact with the model in production. If your app always includes a specific system prompt, include it in every training example. If users typically send short messages, do not train with long, detailed prompts.
Include Edge Cases
Do not just train on the happy path. Include examples where the user asks something outside the model's scope, provides incomplete information, or makes mistakes. Show the model how to handle these gracefully.
Balance Your Categories
If you are fine-tuning a classifier, ensure each category has roughly equal representation. A dataset with 90% positive examples and 10% negative examples will produce a model biased toward positive classifications.
Validate Before Uploading
Common validation errors that waste time:
- Trailing commas in JSON objects
- Missing closing brackets or braces
- Empty
contentfields - Wrong role names (
botinstead ofassistant) - Conversations that do not end with the assistant role
- Lines that exceed the provider's size limit
Run your file through the [JSON Formatter](/tools/json-formatter) line by line to catch syntax errors before uploading.
Key Takeaway
The format gets your data accepted.
FAQ
How many training examples do I need for fine-tuning?
OpenAI requires a minimum of 10 examples but recommends 50-100 for a noticeable improvement. For complex tasks like code generation or domain-specific reasoning, 200-500 high-quality examples typically produce strong results. Beyond 1,000 examples, improvements plateau unless you are addressing a very diverse set of tasks. Quality matters more than quantity.
Can I fine-tune open-source models like Llama with the same data format?
Not directly. Open-source models use different training frameworks (like Hugging Face Transformers, Axolotl, or LLaMA Factory), each with their own data format conventions. However, the conceptual structure is similar: input-output pairs organized as conversations. The [Fine-Tuning Formatter](/tools/fine-tuning-formatter) can convert between common formats.
How long does fine-tuning take?
OpenAI fine-tuning typically completes in 30 minutes to 2 hours depending on dataset size and model. Anthropic and Google timelines vary. Self-hosted fine-tuning on Llama 70B can take 4-24 hours on a single A100 GPU depending on the dataset and training configuration.
Is fine-tuning better than prompt engineering?
They solve different problems. Prompt engineering is faster, cheaper, and more flexible. Fine-tuning is better for consistent formatting, domain-specific behavior, and reducing token usage (because you no longer need lengthy few-shot examples in every prompt). Start with prompt engineering. Fine-tune only when you have identified a specific behavior that prompting alone cannot achieve reliably.
What happens if my training data has errors?
The model learns from your errors. If 10% of your training examples have incorrect outputs, the fine-tuned model will reproduce those errors approximately 10% of the time. This is why data quality review is critical. Have a domain expert validate a random sample of your training data before uploading it.
Can I fine-tune a model to forget its safety training?
No. All major providers have safeguards that prevent fine-tuning from overriding safety behavior. Attempting to do so will result in your fine-tuning job being rejected or your API access being revoked. Fine-tuning is for specialization within the model's existing safety boundaries.
Try these tools
Related articles
LLM Pricing Comparison 2026: How Much Does AI Really Cost?
Compare pricing across GPT-4o, Claude Opus, Gemini Pro, Llama, Mistral, and DeepSeek. Detailed cost breakdown per million tokens with practical budget examples.
Understanding AI Token Limits: A Complete Guide to Context Windows
Learn what context windows are, why they matter, and how to manage token limits across GPT-4o, Claude, and Gemini. Practical tips for working within AI token constraints.