How to Fine-Tune LLMs: Data Format Guide for 2026

Fine-tuning a large language model means training it on your specific data so it behaves the way you need, rather than the way it was generally trained. The process itself is increasingly straightforward. The part that trips people up is preparing the training data in exactly the right format.

Every provider has its own JSONL format, its own required fields, and its own quirks. Submit data in the wrong format and you get cryptic validation errors. Submit data in the right format but with poor examples and you get a fine-tuned model that performs worse than the base model.

This guide covers the exact data formats for OpenAI, Anthropic, and Google fine-tuning, with copy-paste examples and validation tips. Use the Fine-Tuning Formatter to convert your data into the correct format without manual JSONL editing.

* * *

What Is JSONL and Why Fine-Tuning Uses It

JSONL (JSON Lines) is a text format where each line is a valid JSON object. It is the standard for fine-tuning data because it is streamable (you can process one example at a time without loading the entire file into memory), easy to validate (each line is independently valid JSON), and simple to append to (just add a new line).

A JSONL file looks like this:

` {"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is 2+2?"}, {"role": "assistant", "content": "4"}]} {"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "Paris."}]} `

Each line is one training example. The file has no header, no trailing comma, and no wrapping array. This is the single most common mistake: wrapping your examples in a JSON array [{...}, {...}] instead of putting each on its own line.

Use the JSON Formatter to validate individual JSON objects before assembling them into a JSONL file.

Large language model fine-tuning data preparation workflow

* * *

OpenAI Fine-Tuning Format

OpenAI uses a chat-based JSONL format for fine-tuning GPT-4o, GPT-4o mini, and GPT-3.5 Turbo. Each training example is a conversation with system, user, and assistant messages.

Basic Format

`json { "messages": [ {"role": "system", "content": "You are a customer support agent for Acme Corp."}, {"role": "user", "content": "I want to return my order"}, {"role": "assistant", "content": "I can help with that. Could you provide your order number?"} ] } `

Multi-Turn Conversations

You can include multiple turns in a single example. The model learns from the _assistant_ messages only; user and system messages provide context.

`json { "messages": [ {"role": "system", "content": "You are a technical support agent."}, {"role": "user", "content": "My app keeps crashing"}, {"role": "assistant", "content": "Which version of the app are you running?"}, {"role": "user", "content": "Version 3.2.1"}, {"role": "assistant", "content": "Version 3.2.1 has a known memory leak. Please update to 3.2.2 which resolves this issue."} ] } `

Function Calling Fine-Tuning

OpenAI also supports fine-tuning with function/tool calls:

`json { "messages": [ {"role": "system", "content": "You have access to a weather API."}, {"role": "user", "content": "What is the weather in Tokyo?"}, {"role": "assistant", "content": null, "tool_calls": [{"id": "call_1", "type": "function", "function": {"name": "get_weather", "arguments": "{\"city\": \"Tokyo\"}"}}]}, {"role": "tool", "tool_call_id": "call_1", "content": "{\"temp\": 22, \"condition\": \"sunny\"}"}, {"role": "assistant", "content": "It is currently 22 degrees C and sunny in Tokyo."} ] } `

OpenAI Requirements

Minimum 10 examples (recommended: 50-100 for noticeable improvement)
System message is optional but should be consistent across examples if used
Each JSONL line must be under 4MB
Total training data can be up to 50M tokens for GPT-4o fine-tuning
The messages array must end with an assistant message

AI model training with JSONL formatted datasets

* * *

Anthropic Fine-Tuning Format

Anthropic offers fine-tuning for Claude models using a similar chat format, but with some differences in structure and naming.

Basic Format

`json { "messages": [ {"role": "user", "content": "Summarize this legal clause: [clause text here]"}, {"role": "assistant", "content": "This clause establishes a non-compete period of 12 months..."} ] } `

With System Prompt

Anthropic separates the system prompt from the messages array:

`json { "system": "You are a legal document analyst. Respond concisely.", "messages": [ {"role": "user", "content": "Summarize this legal clause: [clause text here]"}, {"role": "assistant", "content": "This clause establishes a non-compete period of 12 months..."} ] } `

Multi-Turn with Prefill

Anthropic supports "prefill" patterns where you start the assistant's response:

`json { "messages": [ {"role": "user", "content": "Classify this email: [email text]"}, {"role": "assistant", "content": "Category: Support Request\nPriority: High\nSummary: Customer reports billing discrepancy on invoice #4521."} ] } `

Anthropic Requirements

Messages must alternate between user and assistant roles
The conversation must start with a user message and end with an assistant message
System prompt goes in a separate system field, not as a message
Minimum training examples depend on the fine-tuning tier
Content blocks can be strings or structured content arrays

* * *

Google (Gemini) Fine-Tuning Format

Google uses a different format for Gemini fine-tuning through Vertex AI.

Basic Format

`json { "contents": [ {"role": "user", "parts": [{"text": "Translate to French: Good morning"}]}, {"role": "model", "parts": [{"text": "Bonjour"}]} ] } `

The key differences from OpenAI/Anthropic: Google uses contents instead of messages, model instead of assistant, and wraps text in a parts array with text objects.

With System Instruction

`json { "systemInstruction": { "parts": [{"text": "You are a French translator. Translate accurately and naturally."}] }, "contents": [ {"role": "user", "parts": [{"text": "How are you?"}]}, {"role": "model", "parts": [{"text": "Comment allez-vous ?"}]} ] } `

Google Requirements

Use model role instead of assistant
Content must be wrapped in parts array with text objects
System instructions use systemInstruction field
Fine-tuning is available through Vertex AI, not the standard Gemini API
Supports both text and multimodal fine-tuning examples

Key takeaway

Google uses a different format for Gemini fine-tuning through Vertex AI.

* * *

Format Comparison at a Glance

| Feature | OpenAI | Anthropic | Google (Gemini) | |---|---|---|---| | File format | JSONL | JSONL | JSONL | | Messages field | messages | messages | contents | | Assistant role name | assistant | assistant | model | | System prompt | Inside messages array | Separate system field | Separate systemInstruction | | Content structure | Plain string | String or content blocks | parts array with text | | Tool/function calls | Supported | Supported | Supported | | Min examples | 10 | Varies | Varies | | Multi-turn | Yes | Yes | Yes |

The differences are small but critical. A training file formatted for OpenAI will fail validation on Anthropic or Google because of field naming differences. The Fine-Tuning Formatter handles these conversions automatically.

* * *

Best Practices for Training Data Quality

The format gets your data accepted. The _quality_ of your data determines whether the fine-tuned model actually improves.

Consistency Over Volume

50 high-quality, consistent examples outperform 500 sloppy ones. Every example should demonstrate the exact behavior you want. If your assistant messages vary in tone, length, or format across examples, the model learns that inconsistency.

Match Your Production Format

Your training examples should mirror exactly how users will interact with the model in production. If your app always includes a specific system prompt, include it in every training example. If users typically send short messages, do not train with long, detailed prompts.

Include Edge Cases

Do not just train on the happy path. Include examples where the user asks something outside the model's scope, provides incomplete information, or makes mistakes. Show the model how to handle these gracefully.

Balance Your Categories

If you are fine-tuning a classifier, ensure each category has roughly equal representation. A dataset with 90% positive examples and 10% negative examples will produce a model biased toward positive classifications.

Validate Before Uploading

Common validation errors that waste time:

Trailing commas in JSON objects
Missing closing brackets or braces
Empty content fields
Wrong role names (bot instead of assistant)
Conversations that do not end with the assistant role
Lines that exceed the provider's size limit

Run your file through the JSON Formatter line by line to catch syntax errors before uploading.

Key takeaway

The format gets your data accepted.

* * *

Frequently Asked Questions About Fine-Tuning Data Formats

How many training examples do I need for fine-tuning?

OpenAI requires a minimum of 10 examples but recommends 50-100 for a noticeable improvement. For complex tasks like code generation or domain-specific reasoning, 200-500 high-quality examples typically produce strong results. Beyond 1,000 examples, improvements plateau unless you are addressing a very diverse set of tasks. Quality matters more than quantity.

Can I fine-tune open-source models like Llama with the same data format?

Not directly. Open-source models use different training frameworks (like Hugging Face Transformers, Axolotl, or LLaMA Factory), each with their own data format conventions. However, the conceptual structure is similar: input-output pairs organized as conversations. The Fine-Tuning Formatter can convert between common formats.

How long does fine-tuning take?

OpenAI fine-tuning typically completes in 30 minutes to 2 hours depending on dataset size and model. Anthropic and Google timelines vary. Self-hosted fine-tuning on Llama 70B can take 4-24 hours on a single A100 GPU depending on the dataset and training configuration.

Is fine-tuning better than prompt engineering?

They solve different problems. Prompt engineering is faster, cheaper, and more flexible. Fine-tuning is better for consistent formatting, domain-specific behavior, and reducing token usage (because you no longer need lengthy few-shot examples in every prompt). Start with prompt engineering. Fine-tune only when you have identified a specific behavior that prompting alone cannot achieve reliably.

What happens if my training data has errors?

The model learns from your errors. If 10% of your training examples have incorrect outputs, the fine-tuned model will reproduce those errors approximately 10% of the time. This is why data quality review is critical. Have a domain expert validate a random sample of your training data before uploading it.

Can I fine-tune a model to remove its safety training?

No. All major providers have safeguards that prevent fine-tuning from overriding safety behavior. Attempting to do so will result in your fine-tuning job being rejected or your API access being revoked. Fine-tuning is for specialization within the model's existing safety boundaries.

Try these tools

· 🔧 Fine Tuning Formatter · 🔧 Json Formatter

Related articles

AI & LLM · 10 min read

LLM Pricing Comparison 2026: How Much Does AI Really Cost?

LLM pricing compared: GPT-4o, Claude, Gemini, Llama, Mistral, DeepSeek. Cost per million tokens, batch discounts, and budget examples to plan your AI spend.

AI & LLM · 10 min read

AI Context Windows and Token Limits Explained

Context window and token limits explained: what they are, how they differ across GPT-4o, Claude, and Gemini, and strategies for managing token constraints.

AI & LLM · 7 min read

Sentiment Analysis: Measure the Tone of Any Text

Sentiment analysis explained: how it works, where it helps (customer feedback, content, competitor reviews), and how to interpret scores accurately.