Multi-LLM Orchestration — Routing Tasks to the Right Model — Bench

The Problem with One Model for Everything

A pipeline that sends every task to the same LLM is easy to set up and painful to optimize. Different stages have fundamentally different requirements: extraction needs determinism and structured output, generation needs creativity and depth, review needs critical accuracy. Optimizing for one trades off against the others.

The alternative: route each stage to the model best suited for it.

The Initial Model Assignment

# Deterministic extraction — structure matters, creativity doesn't
gpt4_turbo_llm = ChatOpenAI(model="gpt-4-turbo", temperature=0)

# Context-building — light reasoning, fast
openai_llm = ChatOpenAI(model="gpt-4o", temperature=0.3)

# Note generation — quality and depth are the whole point
gemini_llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-pro-preview-05-06",
    temperature=0.3,
)

Three models, three roles:

Stage	Model	Temp	Why
Extraction (OCR → structured JSON)	GPT-4-Turbo	0	Needs consistent output, no hallucinated fields
Context builder	GPT-4o	0.3	Fast, cheap, good at topic classification
First-pass note generation	Gemini 2.5 Pro	0.3	Best output quality on this domain
Teacher evaluation	GPT-4o	0.3	Critical review, doesn’t need creativity

Temperature 0 for anything that needs to be reliable and repeatable. Temperature 0.3 for anything that benefits from some variation — notes that all sound identical aren’t useful study material.

Why Gemini for Generation

The note generation stage was the quality bottleneck. First-pass notes had to be substantive, well-structured, and academically appropriate for 1st semester MBA students. GPT-4o produced notes that were technically correct but somewhat generic.

Gemini 2.5 Pro performed noticeably better on this specific task — richer explanations, better use of examples, output that felt like a human had actually thought about the question. Empirical, not theoretical.

This is the practical argument for multi-model pipelines: you’re not locked to one provider’s strengths.

The Cost Squeeze

Running GPT-4-Turbo for extraction + GPT-4o for context + Gemini for generation + GPT-4o for review on a full exam paper of 20+ questions adds up fast. The v2 iteration simplified the model lineup:

# v2: everything through GPT-4o-mini
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

And disabled the two most expensive downstream stages (clarity booster, final review). The trade-off: lower per-run cost, slightly reduced note polish. For a prototype, that’s the right call. For production with real users, the teacher evaluation stage alone probably justifies Gemini on generation.

What Each Temperature Value Does

Temperature 0 doesn’t mean the model gives the same answer every time — it means it picks the highest-probability token at each step rather than sampling. For structured extraction, this matters: you want the model to always produce valid JSON with consistent field names, not occasionally decide to rename question_text to question_content because that seemed more natural.

Temperature 0.3 is low enough that outputs are coherent and on-topic, high enough that generating notes for 20 questions on the same subject doesn’t produce 20 identical-sounding answers.

What to Watch

Provider latency variance. Gemini API latency is less consistent than OpenAI’s. On a 20-question paper, a few slow Gemini calls stretch total pipeline time significantly. Worth measuring p95 latency per model per stage, not just average.

Output format drift across providers. Each provider has slightly different conventions for how it structures JSON, handles special characters, and deals with ambiguous instructions. Pydantic validation catches format violations, but the prompts that work cleanly with GPT-4o don’t always translate directly to Gemini without tweaking.

API key sprawl. Two providers means two keys to manage, two rate limits to track, two billing dashboards. Not a technical problem, but an operational one that compounds as more models get added.

What’s Next

Benchmark each stage independently: freeze the prompt, run 20 questions through each candidate model, score outputs manually
Add fallback logic: if Gemini returns a non-parseable response after two retries, fall back to GPT-4o for that question
Explore Anthropic Claude for the teacher evaluation stage — its tendency toward careful, critical reasoning might suit it better than GPT-4o-mini