In retrieval-augmented generation (RAG) pipelines, input efficiency is paramount, not just in terms of tokens, but also character limits
When building a multilingual embedding pipeline, I faced a real challenge:the Cohere multilingual model imposes a maximum of 2048 characters not a token limit per input.
This article walks you through a clever solution: preprocessing Markdown tables into dense JSON blocks using pymtd2json
, to ensure smooth, efficient embeddings without errors.
The Challenge: Character Limits vs Token Limits
Classical chunking methods, like SentenceSplitter
from LlamaIndex, are token-focused: you set a maximum number of tokens per chunk — but not characters
Why This Matters:
- Markdown (especially GitHub-Flavored Markdown, GFM) wastes spacewith formatting.
- A Markdown chunk might have only 170 tokens but still exceed 2048 characters
- This results in rejected API requests or inefficient extra splitting.
Important Note: Markdown tables are up to 3x less token-efficient than other formats, further compounding the problem.👉 Read more on token inefficiency of Markdown tables here.
A Real-World Example: Measuring the Problem
Let's dive into a simple simulation:
Step 1: Create a Large Markdown Table
import pandas as pd
# Build data
data = {
"Name": [f"Person{i}" for i in range(30)],
"Age": [20 + i for i in range(30)],
"City": [f"City{i}" for i in range(30)]
}
# Create DataFrame
df = pd.DataFrame(data)
df.columns = ["A very long row content, which leads to a lot of white spaces", "Age", "City"]
# Convert to Markdown
table_text = df.to_markdown(index=False)
print(table_text)
This generates a verbose table with 30 rows and a very long header.
Step 2: Analyze Token and Character Counts
Using Cohere’s tokenizer (available via Hugging Face):
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Cohere/Cohere-embed-multilingual-v3.0")
encoded = tokenizer(table_text, return_tensors="pt", add_special_tokens=False)
num_tokens = encoded.input_ids.shape[-1]
num_chars = len(table_text)
print(f"Characters: {num_chars}")
print(f"Tokens: {num_tokens}")
Result:
- Characters: 2719
- Tokens: 432
⚡ Problem: While token count is fine, character count exceeds 2048, causing API errors like:
cohere.error.CohereAPIError: input text exceeds maximum allowed size of 2048 characters
The Solution: Minifying Tables into JSON
Instead of traditional Markdown, why not store the data in a dense JSON block?
Benefits of Minifying Tables:
- Remove pipes, dashes, and whitespace, all formatting overhead.
- Preserve semantic meaning.
- Shrink text to meet character limits safely.
Example of the compact JSON:
{"Name":["Person0","Person1","Person2",...],"Age":["20","21","22",...],"City":["City0","City1","City2",...]}
New Stats:
- Characters: 1027
- Tokens: 461
✅ Now well within Cohere’s input limit!
Applying Minification in Practice
Want to prepare documents before chunking? Here's how you can automatically process all Markdown files:
from pathlib import Path
from llama_index import SimpleDirectoryReader
from your_minifier import MinifyMDT
source_dir = Path("example_dir", "markdown")
documents = SimpleDirectoryReader(source_dir, required_exts=[".md"], recursive=True).load_data()
doc_texts = []
for idx, doc in enumerate(documents):
doc_texts.append(MinifyMDT(doc.text_resource.text).transform())
👉 And voilà: Your data is compact, clean, and embedding-ready!
Final Thoughts
Working with multilingual RAG systems means optimizing every byte. Whitespace-heavy Markdown tables might look nice for humans, but they’re expensive for machine understanding.
By minifying your tables with pymtd2json
, you:
- Cut down API errors.
- Reduce token overhead.
- Boost overall performance.
Efficiency isn't optional, it's a superpower. 🚀