First-step Evaluation Report
Evaluation Report
We evaluate models on four benchmarks: AIME 2025, GPQA Diamond, MMLU-Pro, and HLE (Humanity's Last Exam). Our evaluation methodology follows Artificial Analysis Intelligence Benchmarking.
1. Evaluation Methodology
1.1 AIME 2025 (American Invitational Mathematics Examination)
- Dataset: 30 questions from AIME 2025 I & II
- Repeats: 10 per question (300 total samples)
-
Response format: Numerical answer in
\boxed{} - Scoring: pass@1, averaged over all 300 samples
Prompt template:
Solve the following math problem step by step. Put your answer inside \boxed{}.
{question}
Remember to put your answer inside \boxed{}.
Answer extraction & scoring (two stages):
-
Regex extraction: Extract the content of the last
\boxed{}in the model output. - Numeric comparison (fast path): If the extracted value can be parsed as a number, compare it directly with the ground truth. This handles the majority of AIME answers (always integers 0--999).
-
LLM-as-judge fallback: If numeric comparison fails (e.g. symbolic expressions), call a judge model (default:
gpt-4o-mini) to determine equivalence. The judge is prompted to only perform trivial simplifications and respond with "Yes" or "No".
Judge prompt (from OpenAI simple-evals):
Look at the following two expressions (answers to a math problem) and judge whether
they are equivalent. Only perform trivial simplifications
Examples:
Expression 1: $2x+3$
Expression 2: $3+2x$
Yes
Expression 1: 3/2
Expression 2: 1.5
Yes
Expression 1: $x^2+2x+1$
Expression 2: $y^2+2y+1$
No
Expression 1: $x^2+2x+1$
Expression 2: $(x+1)^2$
Yes
Expression 1: 3245/5
Expression 2: 649
No
(these are actually equal, don't mark them equivalent if you need to do nontrivial
simplifications)
Expression 1: 2/(-3)
Expression 2: -2/3
Yes
(trivial simplifications are allowed)
Expression 1: 72 degrees
Expression 2: 72
Yes
(give benefit of the doubt to units)
Expression 1: 64
Expression 2: 64 square feet
Yes
(give benefit of the doubt to units)
---
YOUR TASK
Respond with only "Yes" or "No" (without quotes). Do not include a rationale.
Expression 1: {target}
Expression 2: {extracted}
1.2 GPQA Diamond (Graduate-Level Google-Proof Q&A)
- Dataset: 198 questions (Diamond subset), covering biology, physics, and chemistry
- Repeats: 5 per question (990 total samples)
- Response format: 4-option multiple choice (A--D)
- Scoring: pass@1, multi-stage regex extraction (no LLM judge)
Prompt template:
Answer the following multiple choice question. The last line of your response should
be in the following format: 'Answer: A/B/C/D' (e.g. 'Answer: A').
{question}
A) {A}
B) {B}
C) {C}
D) {D}
Answer extraction: Multi-stage regex pipeline. The primary pattern looks for Answer: X format. If it fails, 8 fallback patterns are tried in sequence (LaTeX \boxed{}, "answer is X", standalone letter at end of response, etc.). The last match is always taken to account for self-correction. No LLM judge is needed -- scoring is purely regex-based.
Primary regex:
(?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])
1.3 MMLU-Pro (Multi-Task Language Understanding, Pro version)
- Dataset: 12,032 questions across diverse domains
- Repeats: 1 per question (12,032 total samples)
- Response format: Up to 10-option multiple choice (A--J)
- Scoring: pass@1, multi-stage regex extraction (no LLM judge)
Prompt template:
Answer the following multiple choice question. The last line of your response should
be in the following format: 'Answer: A/B/C/D/E/F/G/H/I/J' (e.g. 'Answer: A').
{question}
A) {A}
B) {B}
C) {C}
D) {D}
E) {E}
F) {F}
G) {G}
H) {H}
I) {I}
J) {J}
Answer extraction: Same multi-stage regex pipeline as GPQA Diamond. Purely regex-based, no LLM judge.
1.4 HLE (Humanity's Last Exam)
- Dataset: 2,158 text-only questions from cais/hle (multimodal samples skipped)
- Repeats: 1
- Response format: Open-ended (exact answer)
- Scoring: LLM-as-judge (GPT-4o or similar), pass@1
Two-step pipeline:
Step 1 -- Model Prediction: The model generates answers with the following system prompt:
Your response should be in the following format:
Explanation: {your explanation for your answer choice}
Answer: {your chosen answer}
Confidence: {your confidence score between 0% and 100% for your answer}
(For multiple-choice HLE questions, the system prompt says "your explanation for your answer choice"; for exact-answer questions, it says "your explanation for your final answer" with "Exact Answer" instead of "Answer".)
Step 2 -- Judge Scoring: A judge model (default: o3-mini) evaluates correctness using structured output (response_format). The judge prompt follows the original HLE paper (Hendrycks et al.):
Judge whether the following [response] to [question] is correct or not based on the
precise and unambiguous [correct_answer] below.
[question]: {question}
[response]: {response}
Your judgement must be in the format and criteria specified below:
extracted_final_answer: The final exact answer extracted from the [response]. Put the
extracted answer as 'None' if there is no exact, final answer to extract from the
response.
[correct_answer]: {correct_answer}
reasoning: Explain why the extracted_final_answer is correct or incorrect based on
[correct_answer], focusing only on if there are meaningful differences between
[correct_answer] and the extracted_final_answer. Do not comment on any background to
the problem, do not attempt to solve the problem, do not argue for any answer
different than [correct_answer], focus only on whether the answers match.
correct: Answer 'yes' if extracted_final_answer matches the [correct_answer] given
above, or is within a small margin of error for numerical problems. Answer 'no'
otherwise, i.e. if there if there is any inconsistency, ambiguity, non-equivalency,
or if the extracted answer is incorrect.
confidence: The extracted confidence score between 0% and 100% from [response]. Put
100 if there is no confidence score available.
2. Results
Each benchmark cell is our run / Artificial Analysis. Use — on the AA side when Artificial Analysis has no corresponding community-uploaded data for that model and benchmark.
| Model | AIME 2025 | GPQA Diamond | MMLU-Pro | HLE |
|---|---|---|---|---|
| deepseek-reasoner | 0.913 / 0.92 | 0.840 / 0.840 | 0.852 / 0.862 | 0.2312 / 0.222 |
| Qwen3.5-4B | 0.857 / — | 0.761 / 0.77 | 0.794 / — | 0.1149 / 0.08 |
| DeepSeek-R1-Distill-Qwen-32B | 0.577 / 0.63 | 0.643 / 0.615 | 0.786 / 0.739 | 0.0538 / 0.055 |