Back to Blog
·11 min read·AI & LLM

How to Fine-Tune LLMs: Data Format Guide for 2026

Fine-tuning a large language model means training it on your specific data so it behaves the way you need, rather than the way it was generally trained. The process itself is increasingly straightforward. The part that trips people up is preparing the training data in exactly the right format.

Every provider has its own JSONL format, its own required fields, and its own quirks. Submit data in the wrong format and you get cryptic validation errors. Submit data in the right format but with poor examples and you get a fine-tuned model that performs worse than the base model.

This guide covers the exact data formats for OpenAI, Anthropic, and Google fine-tuning, with copy-paste examples and validation tips. Use the [Fine-Tuning Formatter](/tools/fine-tuning-formatter) to convert your data into the correct format without manual JSONL editing.

What is JSONL and Why Fine-Tuning Uses It

JSONL (JSON Lines) is a text format where each line is a valid JSON object. It is the standard for fine-tuning data because it is streamable (you can process one example at a time without loading the entire file into memory), easy to validate (each line is independently valid JSON), and simple to append to (just add a new line).

A JSONL file looks like this:

` {"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is 2+2?"}, {"role": "assistant", "content": "4"}]} {"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "Paris."}]} `

Each line is one training example. The file has no header, no trailing comma, and no wrapping array. This is the single most common mistake: wrapping your examples in a JSON array [{...}, {...}] instead of putting each on its own line.

Use the [JSON Formatter](/tools/json-formatter) to validate individual JSON objects before assembling them into a JSONL file.

OpenAI Fine-Tuning Format

OpenAI uses a chat-based JSONL format for fine-tuning GPT-4o, GPT-4o mini, and GPT-3.5 Turbo. Each training example is a conversation with system, user, and assistant messages.

Basic Format

`json { "messages": [ {"role": "system", "content": "You are a customer support agent for Acme Corp."}, {"role": "user", "content": "I want to return my order"}, {"role": "assistant", "content": "I can help with that. Could you provide your order number?"} ] } `

Multi-Turn Conversations

You can include multiple turns in a single example. The model learns from the _assistant_ messages only; user and system messages provide context.

`json { "messages": [ {"role": "system", "content": "You are a technical support agent."}, {"role": "user", "content": "My app keeps crashing"}, {"role": "assistant", "content": "Which version of the app are you running?"}, {"role": "user", "content": "Version 3.2.1"}, {"role": "assistant", "content": "Version 3.2.1 has a known memory leak. Please update to 3.2.2 which resolves this issue."} ] } `

Function Calling Fine-Tuning

OpenAI also supports fine-tuning with function/tool calls:

`json { "messages": [ {"role": "system", "content": "You have access to a weather API."}, {"role": "user", "content": "What is the weather in Tokyo?"}, {"role": "assistant", "content": null, "tool_calls": [{"id": "call_1", "type": "function", "function": {"name": "get_weather", "arguments": "{\"city\": \"Tokyo\"}"}}]}, {"role": "tool", "tool_call_id": "call_1", "content": "{\"temp\": 22, \"condition\": \"sunny\"}"}, {"role": "assistant", "content": "It is currently 22 degrees C and sunny in Tokyo."} ] } `

OpenAI Requirements

  • Minimum 10 examples (recommended: 50-100 for noticeable improvement)
  • System message is optional but should be consistent across examples if used
  • Each JSONL line must be under 4MB
  • Total training data can be up to 50M tokens for GPT-4o fine-tuning
  • The messages array must end with an assistant message

Key Takeaway

OpenAI uses a chat-based JSONL format for fine-tuning GPT-4o, GPT-4o mini, and GPT-3.5 Turbo.

Anthropic Fine-Tuning Format

Anthropic offers fine-tuning for Claude models using a similar chat format, but with some differences in structure and naming.

Basic Format

`json { "messages": [ {"role": "user", "content": "Summarize this legal clause: [clause text here]"}, {"role": "assistant", "content": "This clause establishes a non-compete period of 12 months..."} ] } `

With System Prompt

Anthropic separates the system prompt from the messages array:

`json { "system": "You are a legal document analyst. Respond concisely.", "messages": [ {"role": "user", "content": "Summarize this legal clause: [clause text here]"}, {"role": "assistant", "content": "This clause establishes a non-compete period of 12 months..."} ] } `

Multi-Turn with Prefill

Anthropic supports "prefill" patterns where you start the assistant's response:

`json { "messages": [ {"role": "user", "content": "Classify this email: [email text]"}, {"role": "assistant", "content": "Category: Support Request\nPriority: High\nSummary: Customer reports billing discrepancy on invoice #4521."} ] } `

Anthropic Requirements

  • Messages must alternate between user and assistant roles
  • The conversation must start with a user message and end with an assistant message
  • System prompt goes in a separate system field, not as a message
  • Minimum training examples depend on the fine-tuning tier
  • Content blocks can be strings or structured content arrays

Google (Gemini) Fine-Tuning Format

Google uses a slightly different format for Gemini fine-tuning through Vertex AI.

Basic Format

`json { "contents": [ {"role": "user", "parts": [{"text": "Translate to French: Good morning"}]}, {"role": "model", "parts": [{"text": "Bonjour"}]} ] } `

The key differences from OpenAI/Anthropic: Google uses contents instead of messages, model instead of assistant, and wraps text in a parts array with text objects.

With System Instruction

`json { "systemInstruction": { "parts": [{"text": "You are a French translator. Translate accurately and naturally."}] }, "contents": [ {"role": "user", "parts": [{"text": "How are you?"}]}, {"role": "model", "parts": [{"text": "Comment allez-vous ?"}]} ] } `

Google Requirements

  • Use model role instead of assistant
  • Content must be wrapped in parts array with text objects
  • System instructions use systemInstruction field
  • Fine-tuning is available through Vertex AI, not the standard Gemini API
  • Supports both text and multimodal fine-tuning examples

Key Takeaway

Google uses a slightly different format for Gemini fine-tuning through Vertex AI.

Format Comparison at a Glance

| Feature | OpenAI | Anthropic | Google (Gemini) | |---|---|---|---| | File format | JSONL | JSONL | JSONL | | Messages field | messages | messages | contents | | Assistant role name | assistant | assistant | model | | System prompt | Inside messages array | Separate system field | Separate systemInstruction | | Content structure | Plain string | String or content blocks | parts array with text | | Tool/function calls | Supported | Supported | Supported | | Min examples | 10 | Varies | Varies | | Multi-turn | Yes | Yes | Yes |

The differences are small but critical. A training file formatted for OpenAI will fail validation on Anthropic or Google because of field naming differences. The [Fine-Tuning Formatter](/tools/fine-tuning-formatter) handles these conversions automatically.

Best Practices for Training Data Quality

The format gets your data accepted. The _quality_ of your data determines whether the fine-tuned model actually improves.

Consistency Over Volume

50 high-quality, consistent examples outperform 500 sloppy ones. Every example should demonstrate the exact behavior you want. If your assistant messages vary in tone, length, or format across examples, the model learns that inconsistency.

Match Your Production Format

Your training examples should mirror exactly how users will interact with the model in production. If your app always includes a specific system prompt, include it in every training example. If users typically send short messages, do not train with long, detailed prompts.

Include Edge Cases

Do not just train on the happy path. Include examples where the user asks something outside the model's scope, provides incomplete information, or makes mistakes. Show the model how to handle these gracefully.

Balance Your Categories

If you are fine-tuning a classifier, ensure each category has roughly equal representation. A dataset with 90% positive examples and 10% negative examples will produce a model biased toward positive classifications.

Validate Before Uploading

Common validation errors that waste time:

  • Trailing commas in JSON objects
  • Missing closing brackets or braces
  • Empty content fields
  • Wrong role names (bot instead of assistant)
  • Conversations that do not end with the assistant role
  • Lines that exceed the provider's size limit

Run your file through the [JSON Formatter](/tools/json-formatter) line by line to catch syntax errors before uploading.

Key Takeaway

The format gets your data accepted.

FAQ

How many training examples do I need for fine-tuning?

OpenAI requires a minimum of 10 examples but recommends 50-100 for a noticeable improvement. For complex tasks like code generation or domain-specific reasoning, 200-500 high-quality examples typically produce strong results. Beyond 1,000 examples, improvements plateau unless you are addressing a very diverse set of tasks. Quality matters more than quantity.

Can I fine-tune open-source models like Llama with the same data format?

Not directly. Open-source models use different training frameworks (like Hugging Face Transformers, Axolotl, or LLaMA Factory), each with their own data format conventions. However, the conceptual structure is similar: input-output pairs organized as conversations. The [Fine-Tuning Formatter](/tools/fine-tuning-formatter) can convert between common formats.

How long does fine-tuning take?

OpenAI fine-tuning typically completes in 30 minutes to 2 hours depending on dataset size and model. Anthropic and Google timelines vary. Self-hosted fine-tuning on Llama 70B can take 4-24 hours on a single A100 GPU depending on the dataset and training configuration.

Is fine-tuning better than prompt engineering?

They solve different problems. Prompt engineering is faster, cheaper, and more flexible. Fine-tuning is better for consistent formatting, domain-specific behavior, and reducing token usage (because you no longer need lengthy few-shot examples in every prompt). Start with prompt engineering. Fine-tune only when you have identified a specific behavior that prompting alone cannot achieve reliably.

What happens if my training data has errors?

The model learns from your errors. If 10% of your training examples have incorrect outputs, the fine-tuned model will reproduce those errors approximately 10% of the time. This is why data quality review is critical. Have a domain expert validate a random sample of your training data before uploading it.

Can I fine-tune a model to forget its safety training?

No. All major providers have safeguards that prevent fine-tuning from overriding safety behavior. Attempting to do so will result in your fine-tuning job being rejected or your API access being revoked. Fine-tuning is for specialization within the model's existing safety boundaries.