AI Fails Over Tables? Stop Using Markdown and Start Minifying

In retrieval-augmented generation (RAG) pipelines, input efficiency is paramount, not just in terms of tokens, but also character limits

When building a multilingual embedding pipeline, I faced a real challenge:the Cohere multilingual model imposes a maximum of 2048 characters not a token limit per input.

This article walks you through a clever solution: preprocessing Markdown tables into dense JSON blocks using pymtd2json, to ensure smooth, efficient embeddings without errors.

The Challenge: Character Limits vs Token Limits

Classical chunking methods, like SentenceSplitter from LlamaIndex, are token-focused: you set a maximum number of tokens per chunk — but not characters

Why This Matters:

Markdown (especially GitHub-Flavored Markdown, GFM) wastes spacewith formatting.
A Markdown chunk might have only 170 tokens but still exceed 2048 characters
This results in rejected API requests or inefficient extra splitting.

Important Note: Markdown tables are up to 3x less token-efficient than other formats, further compounding the problem.👉 Read more on token inefficiency of Markdown tables here.

A Real-World Example: Measuring the Problem

Let's dive into a simple simulation:

Step 1: Create a Large Markdown Table

import pandas as pd

# Build data
data = {
    "Name": [f"Person{i}" for i in range(30)],
    "Age": [20 + i for i in range(30)],
    "City": [f"City{i}" for i in range(30)]
}

# Create DataFrame
df = pd.DataFrame(data)
df.columns = ["A very long row content, which leads to a lot of white spaces", "Age", "City"]

# Convert to Markdown
table_text = df.to_markdown(index=False)
print(table_text)

This generates a verbose table with 30 rows and a very long header.

Step 2: Analyze Token and Character Counts

Using Cohere’s tokenizer (available via Hugging Face):

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Cohere/Cohere-embed-multilingual-v3.0")
encoded = tokenizer(table_text, return_tensors="pt", add_special_tokens=False)

num_tokens = encoded.input_ids.shape[-1]
num_chars = len(table_text)

print(f"Characters: {num_chars}")
print(f"Tokens: {num_tokens}")

Result:

Characters: 2719
Tokens: 432

⚡ Problem: While token count is fine, character count exceeds 2048, causing API errors like:

cohere.error.CohereAPIError: input text exceeds maximum allowed size of 2048 characters

The Solution: Minifying Tables into JSON

Instead of traditional Markdown, why not store the data in a dense JSON block?

Benefits of Minifying Tables:

Remove pipes, dashes, and whitespace, all formatting overhead.
Preserve semantic meaning.
Shrink text to meet character limits safely.

Example of the compact JSON:

{"Name":["Person0","Person1","Person2",...],"Age":["20","21","22",...],"City":["City0","City1","City2",...]}

New Stats:

Characters: 1027
Tokens: 461

✅ Now well within Cohere’s input limit!

Applying Minification in Practice

Want to prepare documents before chunking? Here's how you can automatically process all Markdown files:

from pathlib import Path
from llama_index import SimpleDirectoryReader
from your_minifier import MinifyMDT

source_dir = Path("example_dir", "markdown")

documents = SimpleDirectoryReader(source_dir, required_exts=[".md"], recursive=True).load_data()

doc_texts = []
for idx, doc in enumerate(documents):
    doc_texts.append(MinifyMDT(doc.text_resource.text).transform())

👉 And voilà: Your data is compact, clean, and embedding-ready!

Final Thoughts

Working with multilingual RAG systems means optimizing every byte. Whitespace-heavy Markdown tables might look nice for humans, but they’re expensive for machine understanding.

By minifying your tables with pymtd2json, you:

Cut down API errors.
Reduce token overhead.
Boost overall performance.

Efficiency isn't optional, it's a superpower. 🚀

AI Fails Over Tables? Stop Using Markdown and Start Minifying

Too Long; Didn't Read

People Mentioned

The Challenge: Character Limits vs Token Limits

Why This Matters:

A Real-World Example: Measuring the Problem

Step 1: Create a Large Markdown Table

Step 2: Analyze Token and Character Counts

The Solution: Minifying Tables into JSON

Benefits of Minifying Tables:

Applying Minification in Practice

Final Thoughts

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

Trending Topics

Classic

Neon Noir

Minty

Newspaper

HN StartUps

AI Fails Over Tables? Stop Using Markdown and Start Minifying

Too Long; Didn't Read

People Mentioned

The Challenge: Character Limits vs Token Limits

Why This Matters:

A Real-World Example: Measuring the Problem

Step 1: Create a Large Markdown Table

Step 2: Analyze Token and Character Counts

The Solution: Minifying Tables into JSON

Benefits of Minifying Tables:

Applying Minification in Practice

Final Thoughts

About Author

TOPICS

THIS ARTICLE WAS FEATURED IN...

RELATED STORIES

Trending Topics

Classic

Neon Noir

Minty

Newspaper

HN StartUps