Why Collapse Tests Are Insufficient for Quantization Quality Assessment

Three different quantization variants all scored 15/15 on automated quality tests. One couldn't translate a sentence into Spanish.

Introduction

When you quantize a large language model, the first thing you check is: "did it collapse?" Model collapse — where the model produces garbage, repetitive text, or empty responses — is the most dramatic failure mode of quantization. It's easy to detect and obviously unacceptable.

The problem is that collapse detection has become the de facto quality bar for quantization. If the model doesn't collapse, it ships. We fell into this trap, and it took manual inspection of individual responses to discover that one of our quantized models was producing factual hallucinations and language contamination — failures that are arguably worse than collapse because they're harder to detect.

The Standard Collapse Test

Our collapse test suite (representative of what's commonly used in the community) sends 15 diverse prompts to the model and checks for:

Minimum response length — is the response at least N characters?
Keyword presence — does the response contain expected terms? (e.g., "class" for a coding prompt)
Repetition detection — does the response contain repeated phrases?
Empty response check — did the model return nothing?

The 15 prompts span coding, math, reasoning, creative writing, knowledge, multilingual, and general conversation:


COLLAPSE_TESTS = [
    {"domain": "coding", "prompt": "Write a Python class for a thread-safe LRU cache...",
     "min_len": 50, "check": "class"},
    {"domain": "math", "prompt": "What is the integral of x^2 * e^x dx?...",
     "min_len": 50, "check": "e^x"},
    {"domain": "multilingual", "prompt": "Translate to Spanish: 'The weather is beautiful...'",
     "min_len": 10, "check": None},
    # ... 12 more prompts
]

The Results That Gave Us False Confidence

We tested three quantized variants of Qwen3.5-397B-A17B:

Variant	Pass	Warn	Avg Time	Avg Length
ExpertQuant (custom kernels)	15	0	47.3s	1,111 chars
Baseline (standard 4-bit)	15	1	8.2s	1,139 chars
Hybrid (with expert pruning)	15	0	7.7s	973 chars

All three pass. The Hybrid model (with 18.1% of experts pruned) actually has fewer warnings than the unpruned baseline. By every automated metric, it's the best.

What Manual Inspection Revealed

When we actually read the responses, the story changed completely.

Test 9: Spanish Translation

Prompt: "Translate the following to Spanish: 'The weather is beautiful today and I plan to go hiking in the mountains.'"

Baseline (unpruned):

El clima está hermoso hoy y planeo ir de caminata a las montañas.

Correct, natural Spanish. ✓

Hybrid (pruned):

El clima está hermoso hoy y我 plane to go hiking in the mountains.

A Chinese character (我, meaning "I") appears mid-sentence. The second half isn't even translated. ✗

Why the test passed: The collapse test for this prompt has min_len: 10 and check: None. Any response longer than 10 characters passes, regardless of content.

Test 14: Functional Programming Concepts

Prompt: "Explain what a monad is in functional programming, with a practical example in Haskell or Python."

Baseline (unpruned):

This is a great question about functional programming concepts. Let me explain monads clearly with practical examples. A monad is a design pattern in functional programming that allows for sequencing computations...

Correct explanation with examples. ✓

Hybrid (pruned):

Actually, there no such term as "monad" in functional programming. You may be confusing it with one of the following similar concepts...

The model confidently denies the existence of one of the most fundamental concepts in FP. ✗

Why the test passed: check: None — no keyword verification for this prompt. Any non-empty, non-repetitive response of sufficient length passes.

Test 5: River Crossing Puzzle

Prompt: "A farmer needs to cross a river with a fox, a chicken, and a bag of corn..."

Baseline (unpruned):

1,119 characters with complete step-by-step solution

Hybrid (pruned):

206 characters, mostly restates the problem without solving it

Why the test passed: check: "chicken" — the word "chicken" appears in the restated problem. min_len: 80 — the 206-character response exceeds the threshold.

The Taxonomy of Undetected Failures

These failures fall into categories that standard collapse tests don't address:

1. Semantic Correctness

The model produces fluent, well-formatted text that is factually wrong. "There's no such term as 'monad'" is grammatically perfect and stylistically appropriate — it's just completely false.

Detection requires: Domain-specific fact-checking, or at minimum, keyword checks for expected concepts (e.g., check: "monad" would have caught this).

2. Language Contamination

The model mixes languages or scripts inappropriately. Chinese characters in a Spanish translation are obvious to a human reader but invisible to a length/repetition checker.

Detection requires: Script detection (checking that the response uses the expected character set) or reference-based translation quality metrics.

3. Task Abandonment

The model partially addresses the prompt but doesn't complete the task. Restating a puzzle without solving it is a sophisticated form of failure that length thresholds can't catch.

Detection requires: Task-specific completion checks (e.g., for a puzzle, verify the response contains a sequence of steps; for code, verify it compiles/runs).

4. Quality Degradation

Responses are shorter, less detailed, or less nuanced. The Hybrid model averaged 973 characters vs the baseline's 1,139 characters — a 15% reduction. Each individual response passes the minimum length check, but the aggregate tells a story.

Detection requires: Statistical comparison of response distributions across model variants.

Why This Matters for Quantization Research

The MoE quantization literature relies heavily on perplexity and downstream task accuracy to evaluate quality. But most papers use aggregate metrics — a single accuracy number across hundreds of test examples. Aggregate metrics can hide individual catastrophic failures.

Consider: if a model answers 98% of questions correctly but confidently denies the existence of monads, produces Chinese in Spanish text, and can't solve logic puzzles — it would score ~97% on a general benchmark. That 97% looks fine in a paper. It's not fine in production.

The Iceberg Problem


What collapse tests catch:
▓▓▓▓▓▓▓▓ Complete model collapse (garbage output)
▓▓▓▓▓▓▓  Repetitive text loops
▓▓▓▓▓▓   Empty responses
▓▓▓▓▓    Extremely short responses

What collapse tests miss:
░░░░░░░░ Factual hallucinations
░░░░░░░  Language contamination
░░░░░░   Task abandonment
░░░░░    Knowledge domain gaps
░░░░     Reasoning quality degradation
░░░      Nuance and detail reduction
░░       Style and tone changes
░        Subtle instruction following failures

The visible tip of the iceberg (what tests catch) is small compared to the submerged portion (what tests miss).

A Better Evaluation Protocol

Based on our experience, here's a minimum evaluation protocol for quantized models:

Level 1: Collapse Detection (automated, fast)

Standard collapse tests as a first pass. If the model collapses, nothing else matters.

Response length thresholds
Repetition detection
Empty response detection
Expected keyword presence

Time: ~2 minutes for 15 prompts

Level 2: Functional Probes (automated, medium)

Task-specific correctness checks that go beyond keyword matching:


FUNCTIONAL_PROBES = [
    {
        "prompt": "Translate to Spanish: 'Hello, how are you?'",
        "check": lambda r: all(ord(c) < 0x4E00 or ord(c) > 0x9FFF for c in r),
        "name": "no_cjk_in_spanish"
    },
    {
        "prompt": "What is a monad in functional programming?",
        "check": lambda r: "monad" in r.lower() and "no such" not in r.lower(),
        "name": "monad_acknowledged"
    },
    {
        "prompt": "Solve: farmer, fox, chicken, corn river crossing",
        "check": lambda r: len(r) > 500 and any(w in r.lower() for w in ["step", "trip", "cross"]),
        "name": "puzzle_solved"
    },
]

Time: ~5 minutes for 20-30 probes

Level 3: Academic Benchmarks (automated, slow)

Standard benchmarks (MMLU-Pro, ARC-Challenge, GSM8K, HumanEval) with stratified sampling. Compare against the unpruned/unquantized baseline using identical evaluation protocol.

Critical: Run the same benchmark on both quantized and unquantized models. Absolute scores are less meaningful than the delta.

Time: 30-90 minutes depending on model speed and sample count

Level 4: Perplexity Evaluation (automated, slow)

Measure perplexity on diverse held-out text. Perplexity is more sensitive than downstream accuracy to quantization damage because it measures every token prediction, not just the final answer.

Use text from multiple domains:

Wikipedia (general knowledge)
Code repositories (programming)
Multilingual text (language capability)
Scientific papers (specialized knowledge)

Time: 30-60 minutes

Level 5: Manual Spot-Checks (human, slow)

Read 20-30 responses manually, focusing on:

Prompts in the model's weakest domains (identified by Levels 2-4)
Multilingual generation
Niche domain knowledge
Multi-step reasoning

This is the most time-consuming but also the most sensitive evaluation. It's how we caught the Chinese-in-Spanish and monad-denial failures.

Time: 1-2 hours

Recommendations

Never ship a quantized model based on collapse tests alone. They're a necessary but grossly insufficient quality bar.

Compare against the unquantized model, not absolute thresholds. A score of 74.7% on ARC-Challenge means nothing without knowing what the base model scores.

Design adversarial probes for known quantization weaknesses. If you pruned experts, test multilingual capability. If you used aggressive quantization on attention layers, test long-range reasoning.

Measure response distributions, not just pass/fail. The 15% reduction in average response length (973 vs 1,139 chars) is a signal that collapse tests completely miss.

Include at least one manual inspection round. It's slow, but it catches failures that no automated test will find. Our highest-impact discovery (Chinese characters in Spanish) came from a human reading the response.

Publish individual failures, not just aggregate scores. A paper reporting "96% on ARC-Challenge" hides whether the 4% failures are random noise or systematic capability loss. Report the specific failure modes.

Conclusion

Our experience with three quantization variants of Qwen3.5-397B demonstrates that standard collapse tests create a false sense of quality assurance. A model can:

Score 15/15 on collapse tests with 0 warnings
Produce Chinese characters in Spanish translations
Confidently deny the existence of fundamental programming concepts
Fail to solve simple logic puzzles

...all at the same time. The collapse tests check for the model's ability to produce something. They don't check whether that something is correct.

For the MoE quantization research community, we recommend treating collapse tests as the floor, not the ceiling, of quality evaluation. Invest in functional probes, perplexity measurement, and manual spot-checks. The quality issues that collapse tests miss are exactly the ones that matter most in production.

This is the final article in the ExpertQuant series. For the full technical details, code, and evaluation data, see the ExpertQuant repository.

Series Index

← Previous: Layer-Level vs Expert-Level Granularity in MoE Qua...

← Back to all articles

Why Collapse Tests Are Insufficient for Quantization Quality Assessment

Introduction

The Standard Collapse Test

The Results That Gave Us False Confidence

What Manual Inspection Revealed

Test 9: Spanish Translation

Test 14: Functional Programming Concepts

Test 5: River Crossing Puzzle

The Taxonomy of Undetected Failures

1. Semantic Correctness

2. Language Contamination

3. Task Abandonment

4. Quality Degradation

Why This Matters for Quantization Research

The Iceberg Problem

A Better Evaluation Protocol

Level 1: Collapse Detection (automated, fast)

Level 2: Functional Probes (automated, medium)

Level 3: Academic Benchmarks (automated, slow)

Level 4: Perplexity Evaluation (automated, slow)

Level 5: Manual Spot-Checks (human, slow)

Recommendations

Conclusion

Series Index

Want to apply these techniques to your AI infrastructure?