MATH-500 Boxed-Answer Extraction: Five Edge Cases That Tank Your Score
Eval Methodology

MATH-500 Boxed-Answer Extraction: Five Edge Cases That Tank Your Score

May 2026 · Black Sheep AI Research

The MATH benchmark scores answers by exact-match on the content of \boxed{...} in the model's output. A naive regex misses 5+ percentage points of correct answers because of LaTeX edge cases. Here are the five we trip on every time and the extractor that handles them.

Why extraction matters more than you think

MATH-500 is the standard 500-question subset of the MATH benchmark, competition-level math problems across 7 subjects (Algebra, Counting, Geometry, etc.) at 5 difficulty levels. Models are expected to write a worked solution and put the final answer inside \boxed{...}. The eval extracts that boxed content and compares to gold.

It sounds trivial. It is not. We measured a 5.4 pp gap between a naive \\boxed\{([^}]*)\} regex and a depth-aware extractor on the same set of model outputs from a 27B model. The model's underlying math reasoning was identical. The score difference was entirely extraction errors.

If you're benchmarking a quantized or fine-tuned model and seeing a 1-2 pp regression you can't explain, it might be the extractor.

Edge case 1: nested braces

The simplest LaTeX answer with a fraction:

\boxed{\frac{1}{2}}

A regex like \\boxed\{([^}]*)\} matches up to the first closing brace, returning \frac{1, wrong, the inner brace closes the match prematurely.

The fix: depth-track. After matching \boxed{, scan forward counting { and } until depth returns to zero:

def extract_boxed(text):
    matches = list(re.finditer(r"\\boxed\s*\{", text))
    if not matches:
        return None
    m = matches[-1]  # take the LAST \boxed in case there are intermediates
    start = m.end()
    depth = 1
    i = start
    while i < len(text) and depth > 0:
        if text[i] == "{":
            depth += 1
        elif text[i] == "}":
            depth -= 1
            if depth == 0:
                return text[start:i].strip()
        i += 1
    return None

Take the last \boxed{ because models often write intermediate boxed expressions while working ("we have so far \boxed{x^2 + 1} so squaring gives ..."). The final answer is the last one.

Edge case 2: \frac vs \dfrac vs \tfrac

Models trained on different math corpora pick different LaTeX dialects:

gold:    \frac{1}{2}
model A: \frac{1}{2}    ✓ exact match
model B: \dfrac{1}{2}   ✗ exact match fails, but mathematically equivalent
model C: \tfrac{1}{2}   ✗ same problem

\dfrac is "display fraction" (large), \tfrac is "text fraction" (small). They render differently in PDF but represent the same mathematical object. Normalize all three to \frac before comparing:

def normalize_answer(s):
    if s is None:
        return None
    s = s.replace("\\dfrac", "\\frac")
    s = s.replace("\\tfrac", "\\frac")
    # ... continue with other normalizations
    return s

This single substitution recovers ~1.5 pp on most models we've tested.

Edge case 3: \text{} wrapping

Some models emit answers like \boxed{42 \text{ apples}} or \boxed{x = \text{undefined}}. Whether to count these as matches depends on the comparison rule, but the safe normalization is to strip \text{}:

s = re.sub(r"\\text\{(.+?)\}", r"\1", s)

After normalization, \boxed{42 \text{ apples}} becomes 42 apples, which can match a gold answer of 42 only if your downstream comparator does substring or unit-stripping. We don't recommend that, too many false positives. Strip \text{} and require the remainder to exactly match.

Edge case 4: dollar signs and inline math

Models sometimes wrap their boxed answer in inline math delimiters:

\boxed{$\frac{1}{2}$}

Strip $ from both ends:

s = s.strip("$ ")

Trivial fix, but easy to miss because most outputs don't have it. The ones that do are silently wrong if you skip this.

Edge case 5: whitespace and \

Models split long expressions across multiple lines:

\boxed{
    \frac{1}{2}
}

or insert \\ line breaks:

\boxed{x = 5 \\ y = 7}

Normalize whitespace:

s = re.sub(r"\s+", "", s)  # collapse all whitespace
s = re.sub(r"\\!|\\,|\\;|\\:", "", s)  # latex spacing commands
s = re.sub(r"\\\\", "", s)  # line breaks

After normalization, \boxed{ \frac{1}{2} } and \boxed{\frac{1}{2}} both reduce to \frac{1}{2}.

The combined extractor

Putting it together, this is what we ship as the MATH-500 extractor in our public eval harness:

import re

BOXED_RE = re.compile(r"\\boxed\s*\{")

def extract_boxed(text):
    matches = list(BOXED_RE.finditer(text))
    if not matches:
        return None
    m = matches[-1]
    start = m.end()
    depth = 1
    i = start
    while i < len(text) and depth > 0:
        if text[i] == "{": depth += 1
        elif text[i] == "}":
            depth -= 1
            if depth == 0:
                return text[start:i].strip()
        i += 1
    return None

def normalize_answer(s):
    if s is None:
        return None
    s = s.strip()
    s = re.sub(r"\\!|\\,|\\;|\\:", "", s)
    s = re.sub(r"\\\\", "", s)
    s = re.sub(r"\\text\{(.+?)\}", r"\1", s)
    s = re.sub(r"\s+", "", s)
    s = s.replace("\\dfrac", "\\frac").replace("\\tfrac", "\\frac")
    s = s.strip("$ ")
    return s.lower()

def is_correct(pred_boxed, gold_boxed):
    if pred_boxed is None or gold_boxed is None:
        return False
    return normalize_answer(pred_boxed) == normalize_answer(gold_boxed)

~70 lines including the imports and a small CLI. The depth-tracker is the only nontrivial piece; the rest is normalization choices.

Spot-check protocol

Before trusting a MATH-500 score, spot-check 20 outputs by hand. The extractor should match human judgment for:

If your extractor disagrees with the human grader on more than 1 out of 20, it has a bug. The 5+ pp gap we measured was on a model where the extractor was missing 13 out of 100 correct answers, exactly the rate this 20-sample check would have caught.

Bigger picture

MATH-500 is the cleanest reasoning benchmark we have right now for quantization studies, exact-match scoring means n=500 binomial confidence intervals (~2 pp), no MC-shuffle noise like GPQA, no temperature-dependent variance like greedy decoding on open-ended generation. But the cleanliness depends entirely on the extractor matching human judgment. Spend 20 minutes hand-checking before you trust a number.


Source: open-source eval_math500.py extractor in our paper-submission scripts. Tested against MATH-500 release.

Read more: GPQA-Diamond's 4 pp Noise Floor, Mean Perplexity Is Lying to You.