First-step Evaluation Report

Evaluation Report

We evaluate models on four benchmarks: AIME 2025, GPQA Diamond, MMLU-Pro, and HLE (Humanity's Last Exam). Our evaluation methodology follows Artificial Analysis Intelligence Benchmarking.

1. Evaluation Methodology

1.1 AIME 2025 (American Invitational Mathematics Examination)

Dataset: 30 questions from AIME 2025 I & II
Repeats: 10 per question (300 total samples)
Response format: Numerical answer in \boxed{}
Scoring: pass@1, averaged over all 300 samples

Prompt template:

Solve the following math problem step by step. Put your answer inside \boxed{}.

{question}

Remember to put your answer inside \boxed{}.

Answer extraction & scoring (two stages):

Regex extraction: Extract the content of the last \boxed{} in the model output.
Numeric comparison (fast path): If the extracted value can be parsed as a number, compare it directly with the ground truth. This handles the majority of AIME answers (always integers 0--999).
LLM-as-judge fallback: If numeric comparison fails (e.g. symbolic expressions), call a judge model (default: gpt-4o-mini) to determine equivalence. The judge is prompted to only perform trivial simplifications and respond with "Yes" or "No".

Judge prompt (from OpenAI simple-evals):

Look at the following two expressions (answers to a math problem) and judge whether
they are equivalent. Only perform trivial simplifications

Examples:

    Expression 1: $2x+3$
    Expression 2: $3+2x$

Yes

    Expression 1: 3/2
    Expression 2: 1.5

Yes

    Expression 1: $x^2+2x+1$
    Expression 2: $y^2+2y+1$

No

    Expression 1: $x^2+2x+1$
    Expression 2: $(x+1)^2$

Yes

    Expression 1: 3245/5
    Expression 2: 649

No
(these are actually equal, don't mark them equivalent if you need to do nontrivial
simplifications)

    Expression 1: 2/(-3)
    Expression 2: -2/3

Yes
(trivial simplifications are allowed)

    Expression 1: 72 degrees
    Expression 2: 72

Yes
(give benefit of the doubt to units)

    Expression 1: 64
    Expression 2: 64 square feet

Yes
(give benefit of the doubt to units)

---

YOUR TASK

Respond with only "Yes" or "No" (without quotes). Do not include a rationale.

    Expression 1: {target}
    Expression 2: {extracted}

1.2 GPQA Diamond (Graduate-Level Google-Proof Q&A)

Dataset: 198 questions (Diamond subset), covering biology, physics, and chemistry
Repeats: 5 per question (990 total samples)
Response format: 4-option multiple choice (A--D)
Scoring: pass@1, multi-stage regex extraction (no LLM judge)

Prompt template:

Answer the following multiple choice question. The last line of your response should
be in the following format: 'Answer: A/B/C/D' (e.g. 'Answer: A').

{question}

A) {A}
B) {B}
C) {C}
D) {D}

Answer extraction: Multi-stage regex pipeline. The primary pattern looks for Answer: X format. If it fails, 8 fallback patterns are tried in sequence (LaTeX \boxed{}, "answer is X", standalone letter at end of response, etc.). The last match is always taken to account for self-correction. No LLM judge is needed -- scoring is purely regex-based.

Primary regex:

(?i)[\*\_]{0,2}Answer[\*\_]{0,2}\s*:[\s\*\_]{0,2}\s*([A-Z])(?![a-zA-Z0-9])

1.3 MMLU-Pro (Multi-Task Language Understanding, Pro version)

Dataset: 12,032 questions across diverse domains
Repeats: 1 per question (12,032 total samples)
Response format: Up to 10-option multiple choice (A--J)
Scoring: pass@1, multi-stage regex extraction (no LLM judge)

Prompt template:

Answer the following multiple choice question. The last line of your response should
be in the following format: 'Answer: A/B/C/D/E/F/G/H/I/J' (e.g. 'Answer: A').

{question}

A) {A}
B) {B}
C) {C}
D) {D}
E) {E}
F) {F}
G) {G}
H) {H}
I) {I}
J) {J}

Answer extraction: Same multi-stage regex pipeline as GPQA Diamond. Purely regex-based, no LLM judge.

1.4 HLE (Humanity's Last Exam)

Dataset: 2,158 text-only questions from cais/hle (multimodal samples skipped)
Repeats: 1
Response format: Open-ended (exact answer)
Scoring: LLM-as-judge (GPT-4o or similar), pass@1

Two-step pipeline:

Step 1 -- Model Prediction: The model generates answers with the following system prompt:

Your response should be in the following format:
Explanation: {your explanation for your answer choice}
Answer: {your chosen answer}
Confidence: {your confidence score between 0% and 100% for your answer}

(For multiple-choice HLE questions, the system prompt says "your explanation for your answer choice"; for exact-answer questions, it says "your explanation for your final answer" with "Exact Answer" instead of "Answer".)

Step 2 -- Judge Scoring: A judge model (default: o3-mini) evaluates correctness using structured output (response_format). The judge prompt follows the original HLE paper (Hendrycks et al.):

Judge whether the following [response] to [question] is correct or not based on the
precise and unambiguous [correct_answer] below.

[question]: {question}

[response]: {response}

Your judgement must be in the format and criteria specified below:

extracted_final_answer: The final exact answer extracted from the [response]. Put the
extracted answer as 'None' if there is no exact, final answer to extract from the
response.

[correct_answer]: {correct_answer}

reasoning: Explain why the extracted_final_answer is correct or incorrect based on
[correct_answer], focusing only on if there are meaningful differences between
[correct_answer] and the extracted_final_answer. Do not comment on any background to
the problem, do not attempt to solve the problem, do not argue for any answer
different than [correct_answer], focus only on whether the answers match.

correct: Answer 'yes' if extracted_final_answer matches the [correct_answer] given
above, or is within a small margin of error for numerical problems. Answer 'no'
otherwise, i.e. if there if there is any inconsistency, ambiguity, non-equivalency,
or if the extracted answer is incorrect.

confidence: The extracted confidence score between 0% and 100% from [response]. Put
100 if there is no confidence score available.

2. Results

Each benchmark cell is our run / Artificial Analysis. Use — on the AA side when Artificial Analysis has no corresponding community-uploaded data for that model and benchmark.

Model	AIME 2025	GPQA Diamond	MMLU-Pro	HLE
deepseek-reasoner	0.913 / 0.92	0.840 / 0.840	0.852 / 0.862	0.2312 / 0.222
Qwen3.5-4B	0.857 / —	0.761 / 0.77	0.794 / —	0.1149 / 0.08
DeepSeek-R1-Distill-Qwen-32B	0.577 / 0.63	0.643 / 0.615	0.786 / 0.739	0.0538 / 0.055

Edited Mar 22, 2026 by Yijia Wang