STEM MC QA (stem_qa_mc) — Tasks, Datasets, Metrics#
This document summarizes the STEM multiple‑choice QA task bundle (stem_qa_mc), the Hugging Face datasets it uses, and the metrics used to score it. It is self‑contained and focuses on the evaluation ideas rather than code.
1) Tasks covered in stem_qa_mc#
stem_qa_mc is a bundle of multiple‑choice STEM QA tasks. It includes:
ARC (MC, xlarge)
ARC‑Easy (MC)
ARC‑Challenge (MC)
MMLU STEM (MC)
18 STEM subjects (see list below)
MedMCQA (MC)
MedQA‑English (MC)
SciQ (MC, xlarge)
MMLU STEM subjects (MC)#
abstract_algebra
astronomy
college_biology
college_chemistry
college_computer_science
college_mathematics
college_physics
computer_security
conceptual_physics
electrical_engineering
elementary_mathematics
high_school_biology
high_school_chemistry
high_school_computer_science
high_school_mathematics
high_school_physics
high_school_statistics
machine_learning
2) Datasets used (Hugging Face dataset_path)#
Task group |
Dataset (HF |
|---|---|
ARC‑Easy / ARC‑Challenge (MC) |
|
MMLU STEM (MC) |
|
MedMCQA (MC) |
|
MedQA‑English (MC) |
|
SciQ (MC) |
|
3) Metrics used and how they are calculated#
All tasks in stem_qa_mc are multiple‑choice evaluations. Each question has N answer options. The model is scored by computing log‑likelihood for each option and comparing them.
Per‑choice scores#
For each choice i:
sum_logits_i= sum of log‑probabilities for the choice continuationnum_tokens_i,num_chars_i,num_bytes_i= length of the choice continuationlogits_per_token_i=sum_logits_i / num_tokens_ilogits_per_char_i=sum_logits_i / num_chars_ibits_per_byte_i=-log2(e) * (sum_logits_i / num_bytes_i)
Lower bits/byte is better (it corresponds to higher likelihood per byte).
Accuracy metrics (per question)#
Let gold be the correct option index.
acc_raw: 1 ifargmax(sum_logits)==gold, else 0acc_per_token: 1 ifargmax(logits_per_token)==gold, else 0acc_per_char: 1 ifargmax(logits_per_char)==gold, else 0acc_per_byte: 1 ifargmin(bits_per_byte)==gold, else 0
Optional unconditioned accuracy#
Some tasks can compute a length‑normalized, unconditioned variant:
For each choice:
score_i = sum_logits_i - sum_logits_uncond_iacc_uncond: 1 ifargmax(score_i)==gold, else 0
Correct‑choice log‑likelihood metrics#
These are the log‑likelihood values for the gold option only:
sum_logits_corr=sum_logits_goldlogits_per_token_corr=logits_per_token_goldlogits_per_char_corr=logits_per_char_goldbits_per_byte_corr=bits_per_byte_gold
Aggregation to task and suite scores#
Per task: metrics are averaged across questions (mean).
For the
stem_qa_mcbundle: metrics are macro‑averaged across its tasks (each task contributes equally, regardless of size).
Note
Because multiple‑choice options can differ in length, the suite reports both raw accuracy and length‑normalized variants (per‑token/char/byte). The “primary” score for a specific task may differ (e.g., acc_raw vs acc_per_char), but the suite itself is aggregated by macro‑average across tasks.
4) Prompt structure (MC STEM tasks)#
All tasks in stem_qa_mc use the shared multiple‑choice prompt builder. The general shape is:
Question: {question}
A. {choice_1}
B. {choice_2}
C. {choice_3}
D. {choice_4}
Answer:
Some tasks may have 5 choices (A–E). The final Answer: is the point where the model is scored on the likelihood of each choice label.
Per‑task notes#
ARC‑Easy / ARC‑Challenge (MC)
Uses the standard MC layout shown above; includes 4–5 choices depending on the question.MMLU STEM (MC)
Adds a subject‑specific intro line before the question, e.g.
The following are multiple choice questions (with answers) about {subject}.MedMCQA (MC)
Standard MC layout with 4 options (A–D).MedQA‑English (MC)
Standard MC layout, often 4 options but can be more.SciQ (MC)
Standard MC layout, but the answer choices are shuffled per example (deterministic by index) so the correct answer is not always the same letter.
Few‑shot formatting#
When num_shots > 0, few‑shot examples are appended before the target question:
{example_1_question_and_choices}{example_1_answer}
{example_2_question_and_choices}{example_2_answer}
...
{current_question_and_choices}
Answer:
This keeps the target question in the same format while providing labeled examples above it.