The Great Token Efficiency Debate: Minified JSON, YAML, TOON or CSV?

Introduction
Alternative approach #1
Alternative approach #2

TOON format for less token usage

My colleagues Ajay Joshi & Deepak Shah were discussing https://jsontoon.com/ which converts JSON to TOON format which is supposedly a more compact format with less token usage on LLMs.

TOON: Token-Oriented Object Notation is a compact, human-readable serialization format designed for passing structured data to Large Language Models with significantly reduced token usage. It's intended for LLM input as a lossless, drop-in representation of JSON data. Website: https://github.com/toon-format/toon

The example on https://jsontoon.com/ website compares TOON to "pretty json" (but not to minified json):

But I was curious to compare: if toon format is consistently using lesser tokens than "minified json". So I searched reddit:

Alternative approach #1

First I found another idea that someone has shared:

Here the OP is compressing json

{ "main_character": { "name": "Miles Piper", "traits": "middle-aged, wiry, musician" }, "setting": { "city": "Nooga", "season": "Spring" } }

{"mc":{"n":"MilesPiper","t":"mid-aged,wiry,musician"},"set":{"c":"Nooga","s":"Spring"}}

Disclaimer for this approach: Make sure your original JSONs have enough self‑contained context. When you shorten keys, like if main_character = mc, you’re removing semantic hints [for LLM model]. To keep things clear for the LLM, your original JSON should include enough surrounding info or a parent scope so it’s obvious what domain you’re in.

I think that we can solve this with a Taxanomy.md added in system prompt which simply has key value pairs e.g. mc:main_character;n=name;t=traits....

Alternative approach #2

Next, TOON's github readme then pointed me to this website: Format Tokenization Playground

Tokenization Playground for different formats (JSON, TOON, CSV etc)

When a CSV lossless representation of data exists, CSV is always most token efficient
If a CSV lossless representation doesn't exists, minified json is always most token efficient.

So with a cursory read / limited understanding of this table seems to suggest that CSV/minifiedJSON seems better, but then TOON's readme has benchmarks section (that looks at both token usage as well as accuracy) and paints a different story: https://github.com/toon-format/toon?tab=readme-ov-file#benchmarks

This isn't the end of it. I intend to dive deeper into this, but got a busy week ahead. I will update this post if and when I get time to explore more about this topic.

Until next time...

The Great Token Efficiency Debate: Minified JSON, YAML, TOON or CSV?

Table of Contents

TOON format for less token usage

Alternative approach #1

Alternative approach #2

Related post

Send slack message via api using `xoxp` User token

Featured Posts

Recommended Topics