
Developers run into this quickly:
This is not superstition. It's architecture.
Here are the real causes you can actually act on.
Routing platforms (example: OpenRouter) may route to different upstream providers based on cost/uptime/latency and can use fallbacks when errors occur. (OpenRouter)
OpenRouter documents:
If you don't lock routing during evaluation, you're not comparing models. You're comparing a moving target.
Even when the checkpoint is "the same model family," providers can serve it with different inference stacks:
Quantization is a first-order lever for cost and throughput, and providers tune it. The same model label doesn't guarantee the same numerical execution.
If you're using OpenRouter specifically: their SDK docs include routing configuration and even mention filtering by quantization levels as part of provider selection parameters. (OpenRouter)
Even with identical model weights, providers often wrap your request with:
This can change style, refusals, verbosity, and even reasoning patterns.
Example: xAI's Grok model cards explicitly mention deploying with a fixed system prompt prefix and input filters in production API deployments. (data.x.ai)
So "same model" is not always "same prompt context."
There are two fallback types:
OpenRouter's model fallback guide explains that any error can trigger fallback, including context validation errors, moderation flags, rate limits, and downtime. The final model used is returned in the response. (OpenRouter)
This is great for reliability, but bad for consistency unless you control it.
If you can: disable fallbacks and explicitly specify provider order/only lists while evaluating. (OpenRouter)
Fix temperature/top_p/max_tokens/etc. Small changes here can swamp provider differences.
Test with prompts that expose precision/wrapper differences:
If your routing layer supports it, use debugging to confirm the transformed request body (e.g., OpenRouter debug option). (OpenRouter)
Quantization-related degradation often looks like:
In production, "model quality" is not only weights + architecture.
It is also:
If you treat "model name" as the whole contract, you'll keep getting surprised. If you treat "serving stack" as part of the model, debugging becomes straightforward.
If you want consistency, your job is to reduce degrees of freedom: