The Great Token Efficiency Debate: Minified JSON, YAML, TOON or CSV?

Table of Contents

TOON format for less token usage

My colleagues Ajay JoshiDeepak Shah were discussing https://jsontoon.com/ which converts JSON to TOON format which is supposedly a more compact format with less token usage on LLMs.

TOON: Token-Oriented Object Notation is a compact, human-readable serialization format designed for passing structured data to Large Language Models with significantly reduced token usage. It's intended for LLM input as a lossless, drop-in representation of JSON data. Website: https://github.com/toon-format/toon

 
The example on https://jsontoon.com/ website compares TOON to "pretty json" (but not to minified json):
 
JSON to TOON
 
 
But I was curious to compare: if toon format is consistently using lesser tokens than "minified json". So I searched reddit:

Alternative approach #1

 
First I found another idea that someone has shared:
Here the OP is compressing json
{ "main_character": { "name": "Miles Piper", "traits": "middle-aged, wiry, musician" }, "setting": { "city": "Nooga", "season": "Spring" } }
 
to
{"mc":{"n":"MilesPiper","t":"mid-aged,wiry,musician"},"set":{"c":"Nooga","s":"Spring"}}
 
Disclaimer for this approach: Make sure your original JSONs have enough self‑contained context. When you shorten keys, like if main_character = mc, you’re removing semantic hints [for LLM model]. To keep things clear for the LLM, your original JSON should include enough surrounding info or a parent scope so it’s obvious what domain you’re in.
 
I think that we can solve this with a Taxanomy.md added in system prompt which simply has key value pairs e.g. mc:main_character;n=name;t=traits....

Alternative approach #2

 

Next, TOON's github readme then pointed me to this website: Format Tokenization Playground
Tokenization Playground for different formats (JSON, TOON, CSV etc)

  • When a CSV lossless representation of data exists, CSV is always most token efficient
  • If a CSV lossless representation doesn't exists, minified json is always most token efficient.
 
So with a cursory read / limited understanding of this table seems to suggest that CSV/minifiedJSON seems better, but then TOON's readme has benchmarks section (that looks at both token usage as well as accuracy) and paints a different story: https://github.com/toon-format/toon?tab=readme-ov-file#benchmarks
 

 
This isn't the end of it. I intend to dive deeper into this, but got a busy week ahead. I will update this post if and when I get time to explore more about this topic.
 
Until next time...

Related post