Three different quantization variants all scored 15/15 on automated quality tests. One couldn't translate a sentence into Spanish.
Introduction
When you quantize a large language model, the first question is: "did it collapse?" Model collapse, where the model spits out garbage, loops on repeated text, or returns nothing, is the most dramatic failure mode. It's easy to spot and obviously unacceptable.
The problem is that collapse detection has become the de facto quality bar for quantization. If the model doesn't collapse, it ships. We fell into this trap ourselves. It took manually reading individual responses to discover that one of our quantized models was hallucinating facts and mixing languages, failures that are arguably worse than collapse because they're so much harder to catch.
The Standard Collapse Test
Our collapse test suite (representative of what's commonly used in the community) sends 15 diverse prompts to the model and checks for:
- Minimum response length, is the response at least N characters?
- Keyword presence, does the response contain expected terms? (e.g., "class" for a coding prompt)
- Repetition detection, does the response contain repeated phrases?
- Empty response check, did the model return nothing?
The 15 prompts span coding, math, reasoning, creative writing, knowledge, multilingual, and general conversation.
The Results That Gave Us False Confidence
We tested three quantized variants of a large MoE model:
| Variant | Pass | Warn | Fail | Avg Time | Avg Length |
|---|---|---|---|---|---|
| Quantized variant A (custom kernels) | 15 | 0 | 0 | 47.3s | 1,111 chars |
| Quantized variant B (standard 4-bit) | 15 | 1 | 0 | 8.2s | 1,139 chars |
| Quantized variant C (with expert pruning) | 15 | 0 | 0 | 7.7s | 973 chars |
All three pass. The pruned variant actually has fewer warnings than the unpruned baseline. By every automated metric, it looks like the best one.
What Manual Inspection Revealed
When we actually read the responses, the picture changed completely.
Test 9: Spanish Translation
Prompt: "Translate the following to Spanish: 'The weather is beautiful today and I plan to go hiking in the mountains.'"
Unpruned baseline:
El clima está hermoso hoy y planeo ir de caminata a las montañas.
Correct, natural Spanish. ✓
Pruned quantized variant:
El clima está hermoso hoy y我 plane to go hiking in the mountains.
A Chinese character (我, meaning "I") appears mid-sentence. The second half isn't even translated. ✗
Why the test passed: The collapse test for this prompt only checks minimum response length. No content validation at all. Any response longer than a few characters passes, regardless of what it says.
Test 14: Functional Programming Concepts
Prompt: "Explain what a monad is in functional programming, with a practical example in Haskell or Python."
Unpruned baseline:
This is a great question about functional programming concepts. Let me explain monads clearly with practical examples. A monad is a design pattern in functional programming that allows for sequencing computations...
Correct explanation with examples. ✓
Pruned quantized variant:
Actually, there no such term as "monad" in functional programming. You may be confusing it with one of the following similar concepts...
The model confidently denies the existence of one of the most fundamental concepts in FP. ✗
Why the test passed: No keyword verification was configured for this prompt. Any non-empty, non-repetitive response of sufficient length passes.
Test 5: River Crossing Puzzle
Prompt: "A farmer needs to cross a river with a fox, a chicken, and a bag of corn..."
Unpruned baseline:
1,119 characters with complete step-by-step solution
Pruned quantized variant:
206 characters, mostly restates the problem without solving it
Why the test passed: The keyword check looks for "chicken," which appears in the restated problem. The response clears the minimum length threshold. Neither check verifies the puzzle was actually solved.
The Taxonomy of Undetected Failures
These failures fall into categories that standard collapse tests simply don't cover:
1. Semantic Correctness
The model produces fluent, well-formatted text that is factually wrong. "There's no such term as 'monad'" is grammatically perfect and stylistically appropriate. It's just completely false.
Detection requires: Domain-specific fact-checking, or at minimum, keyword checks for expected concepts (checking that the response actually references "monad" would have caught this).
2. Language Contamination
The model mixes languages or scripts where it shouldn't. Chinese characters in a Spanish translation are obvious to a human reader but invisible to a length/repetition checker.
Detection requires: Script detection (checking that the response uses the expected character set) or reference-based translation quality metrics.
3. Task Abandonment
The model partially addresses the prompt but doesn't finish the job. Restating a puzzle without solving it is a subtle form of failure that length thresholds can't catch.
Detection requires: Task-specific completion checks (for a puzzle, verify the response contains a sequence of steps; for code, verify it compiles or runs).
4. Quality Degradation
Responses are shorter, less detailed, or less nuanced. The pruned variant averaged 973 characters vs the baseline's 1,139 characters, a 15% reduction. Each individual response passes the minimum length check, but the aggregate tells a story.
Detection requires: Statistical comparison of response distributions across model variants.
Why This Matters for Quantization Research
MoE quantization papers typically rely on perplexity and downstream task accuracy to evaluate quality. But most use aggregate metrics, a single accuracy number across hundreds of test examples. Aggregate metrics can hide individual catastrophic failures.
Think about it: if a model answers 98% of questions correctly but confidently denies monads exist, produces Chinese in Spanish text, and can't solve logic puzzles, it would score around 97% on a general benchmark. That looks fine in a paper. It's not fine in production.
The Iceberg Problem
What collapse tests catch:
▓▓▓▓▓▓▓▓ Complete model collapse (garbage output)
▓▓▓▓▓▓▓ Repetitive text loops
▓▓▓▓▓▓ Empty responses
▓▓▓▓▓ Extremely short responses
What collapse tests miss:
░░░░░░░░ Factual hallucinations
░░░░░░░ Language contamination
░░░░░░ Task abandonment
░░░░░ Knowledge domain gaps
░░░░ Reasoning quality degradation
░░░ Nuance and detail reduction
░░ Style and tone changes
░ Subtle instruction following failures
What the tests catch is the tip of the iceberg. The submerged portion is much larger.
A Better Evaluation Protocol
Based on our experience, here's a minimum evaluation protocol for quantized models:
Level 1: Collapse Detection (automated, fast)
Standard collapse tests as a first pass. If the model collapses, nothing else matters.
- Response length thresholds
- Repetition detection
- Empty response detection
- Expected keyword presence
Time: ~2 minutes for 15 prompts
Level 2: Functional Probes (automated, medium)
Task-specific correctness checks that go beyond keyword matching. Verify that translations don't contain characters from unrelated scripts. Check that factual questions get acknowledged rather than denied. Confirm that puzzle-solving prompts produce step-by-step solutions, not mere restatements.
Time: ~5 minutes for 20-30 probes
Level 3: Academic Benchmarks (automated, slow)
Standard benchmarks (MMLU-Pro, ARC-Challenge, GSM8K, HumanEval) with stratified sampling. Compare against the unpruned/unquantized baseline using an identical evaluation protocol.
Critical: Run the same benchmark on both quantized and unquantized models. Absolute scores mean little without knowing the delta.
Time: 30-90 minutes depending on model speed and sample count
Level 4: Perplexity Evaluation (automated, slow)
Measure perplexity on diverse held-out text. Perplexity is more sensitive than downstream accuracy to quantization damage because it measures every token prediction, not just the final answer.
Use text from multiple domains:
- Wikipedia (general knowledge)
- Code repositories (programming)
- Multilingual text (language capability)
- Scientific papers (specialized knowledge)
Time: 30-60 minutes
Level 5: Manual Spot-Checks (human, slow)
Read 20-30 responses manually, focusing on:
- Prompts in the model's weakest domains (identified by Levels 2-4)
- Multilingual generation
- Niche domain knowledge
- Multi-step reasoning
This is the slowest step but also the most sensitive. It's how we caught the Chinese-in-Spanish and monad-denial failures.
Time: 1-2 hours
Recommendations
- Never ship a quantized model based on collapse tests alone. They're a necessary but grossly insufficient quality bar.
- Compare against the unquantized model, not absolute thresholds. A score of 74.7% on ARC-Challenge means nothing without knowing what the base model scores.
- Design adversarial probes for known quantization weaknesses. If you pruned experts, test multilingual capability. If you aggressively quantized attention layers, test long-range reasoning.
- Measure response distributions, not just pass/fail. The 15% drop in average response length (973 vs 1,139 chars) is a signal that collapse tests completely miss.
- Include at least one manual inspection round. It's slow, but it catches failures that no automated test will find. Our highest-impact discovery (Chinese characters in Spanish) came from a human reading the response.
- Publish individual failures, not just aggregate scores. A paper reporting "96% on ARC-Challenge" hides whether the 4% failures are random noise or systematic capability loss. Report the specific failure modes.
Conclusion
Our experience with three quantization variants of a large MoE model shows that standard collapse tests create a false sense of security. A model can:
- Score 15/15 on collapse tests with 0 warnings
- Produce Chinese characters in Spanish translations
- Confidently deny the existence of fundamental programming concepts
- Fail to solve simple logic puzzles
...all at the same time. Collapse tests verify that the model can produce something. They don't verify that the something is correct.
For anyone doing MoE quantization research, treat collapse tests as the floor, not the ceiling. Invest in functional probes, perplexity measurement, and manual spot-checks. The quality issues that collapse tests miss are exactly the ones that matter most in production.
This is the final article in the MoE quantization series. For the full technical details, code, and evaluation data, see our research repository.
Series Index
- Profiling Expert Activation Patterns in 512-Expert MoE Models
- Per-Expert Mixed-Bit Quantization via Mask-and-Combine Dispatch
- Expert Pruning in MoE Models, When Dead Experts Aren't Dead
- MLX Quantization on Apple Silicon, Engineering Pitfalls and Workarounds
- Layer-Level vs Expert-Level Granularity in MoE Quantization
- Why Collapse Tests Are Insufficient for Quantization Quality Assessment
Read the Full Paper
The full MoE expert quantization paper, covering expert activation profiling, per-expert mixed-bit allocation, and evaluation across 512-expert architectures, is available on our HuggingFace:
MoE Expert Quantization: Per-Expert Mixed-Precision for Mixture-of-Experts Models, Full Paper
huggingface.co/spaces/baa-ai/MoE-Expert-QuantizationLicensed under CC BY-NC-ND 4.0