The MATH benchmark scores answers by exact-match on the content of \boxed{...} in the model's output. A naive regex misses 5+ percentage points of correct answers because of LaTeX edge cases. Here are the five we trip on every time and the extractor that handles them.
Why extraction matters more than you think
MATH-500 is the standard 500-question subset of the MATH benchmark, competition-level math problems across 7 subjects (Algebra, Counting, Geometry, etc.) at 5 difficulty levels. Models are expected to write a worked solution and put the final answer inside \boxed{...}. The eval extracts that boxed content and compares to gold.
It sounds trivial. It is not. We measured a 5.4 pp gap between a naive \\boxed\{([^}]*)\} regex and a depth-aware extractor on the same set of model outputs from a 27B model. The model's underlying math reasoning was identical. The score difference was entirely extraction errors.
If you're benchmarking a quantized or fine-tuned model and seeing a 1-2 pp regression you can't explain, it might be the extractor.
Edge case 1: nested braces
The simplest LaTeX answer with a fraction:
\boxed{\frac{1}{2}}
A regex like \\boxed\{([^}]*)\} matches up to the first closing brace, returning \frac{1, wrong, the inner brace closes the match prematurely.
The fix: depth-track. After matching \boxed{, scan forward counting { and } until depth returns to zero:
def extract_boxed(text):
matches = list(re.finditer(r"\\boxed\s*\{", text))
if not matches:
return None
m = matches[-1] # take the LAST \boxed in case there are intermediates
start = m.end()
depth = 1
i = start
while i < len(text) and depth > 0:
if text[i] == "{":
depth += 1
elif text[i] == "}":
depth -= 1
if depth == 0:
return text[start:i].strip()
i += 1
return None
Take the last \boxed{ because models often write intermediate boxed expressions while working ("we have so far \boxed{x^2 + 1} so squaring gives ..."). The final answer is the last one.
Edge case 2: \frac vs \dfrac vs \tfrac
Models trained on different math corpora pick different LaTeX dialects:
gold: \frac{1}{2}
model A: \frac{1}{2} ✓ exact match
model B: \dfrac{1}{2} ✗ exact match fails, but mathematically equivalent
model C: \tfrac{1}{2} ✗ same problem
\dfrac is "display fraction" (large), \tfrac is "text fraction" (small). They render differently in PDF but represent the same mathematical object. Normalize all three to \frac before comparing:
def normalize_answer(s):
if s is None:
return None
s = s.replace("\\dfrac", "\\frac")
s = s.replace("\\tfrac", "\\frac")
# ... continue with other normalizations
return s
This single substitution recovers ~1.5 pp on most models we've tested.
Edge case 3: \text{} wrapping
Some models emit answers like \boxed{42 \text{ apples}} or \boxed{x = \text{undefined}}. Whether to count these as matches depends on the comparison rule, but the safe normalization is to strip \text{}:
s = re.sub(r"\\text\{(.+?)\}", r"\1", s)
After normalization, \boxed{42 \text{ apples}} becomes 42 apples, which can match a gold answer of 42 only if your downstream comparator does substring or unit-stripping. We don't recommend that, too many false positives. Strip \text{} and require the remainder to exactly match.
Edge case 4: dollar signs and inline math
Models sometimes wrap their boxed answer in inline math delimiters:
\boxed{$\frac{1}{2}$}
Strip $ from both ends:
s = s.strip("$ ")
Trivial fix, but easy to miss because most outputs don't have it. The ones that do are silently wrong if you skip this.
Edge case 5: whitespace and \
Models split long expressions across multiple lines:
\boxed{
\frac{1}{2}
}
or insert \\ line breaks:
\boxed{x = 5 \\ y = 7}
Normalize whitespace:
s = re.sub(r"\s+", "", s) # collapse all whitespace
s = re.sub(r"\\!|\\,|\\;|\\:", "", s) # latex spacing commands
s = re.sub(r"\\\\", "", s) # line breaks
After normalization, \boxed{ \frac{1}{2} } and \boxed{\frac{1}{2}} both reduce to \frac{1}{2}.
The combined extractor
Putting it together, this is what we ship as the MATH-500 extractor in our public eval harness:
import re
BOXED_RE = re.compile(r"\\boxed\s*\{")
def extract_boxed(text):
matches = list(BOXED_RE.finditer(text))
if not matches:
return None
m = matches[-1]
start = m.end()
depth = 1
i = start
while i < len(text) and depth > 0:
if text[i] == "{": depth += 1
elif text[i] == "}":
depth -= 1
if depth == 0:
return text[start:i].strip()
i += 1
return None
def normalize_answer(s):
if s is None:
return None
s = s.strip()
s = re.sub(r"\\!|\\,|\\;|\\:", "", s)
s = re.sub(r"\\\\", "", s)
s = re.sub(r"\\text\{(.+?)\}", r"\1", s)
s = re.sub(r"\s+", "", s)
s = s.replace("\\dfrac", "\\frac").replace("\\tfrac", "\\frac")
s = s.strip("$ ")
return s.lower()
def is_correct(pred_boxed, gold_boxed):
if pred_boxed is None or gold_boxed is None:
return False
return normalize_answer(pred_boxed) == normalize_answer(gold_boxed)
~70 lines including the imports and a small CLI. The depth-tracker is the only nontrivial piece; the rest is normalization choices.
Spot-check protocol
Before trusting a MATH-500 score, spot-check 20 outputs by hand. The extractor should match human judgment for:
- 5 trivial cases (
\boxed{42}style) - 5 fraction cases (
\frac,\dfrac) - 5 with
\text{}wrapping - 5 with multi-line / whitespace mess
If your extractor disagrees with the human grader on more than 1 out of 20, it has a bug. The 5+ pp gap we measured was on a model where the extractor was missing 13 out of 100 correct answers, exactly the rate this 20-sample check would have caught.
Bigger picture
MATH-500 is the cleanest reasoning benchmark we have right now for quantization studies, exact-match scoring means n=500 binomial confidence intervals (~2 pp), no MC-shuffle noise like GPQA, no temperature-dependent variance like greedy decoding on open-ended generation. But the cleanliness depends entirely on the extractor matching human judgment. Spend 20 minutes hand-checking before you trust a number.
Source: open-source eval_math500.py extractor in our paper-submission scripts. Tested against MATH-500 release.
Read more: GPQA-Diamond's 4 pp Noise Floor, Mean Perplexity Is Lying to You.