MMLU information

Measuring Massive Multitask Language Understanding (MMLU) is a benchmark for evaluating the capabilities of language models. It consists of about 16,000 multiple-choice questions spanning 57 academic subjects including mathematics, philosophy, law, and medicine. It is one of the most commonly used benchmarks for comparing the capabilities of large language models.^[1]

The MMLU was released by Dan Hendrycks and a team of researchers in 2020^[2] and was designed to be more challenging than then-existing benchmarks such as GLUE on which new language models were achieving better-than-human accuracy. At the time of the MMLU's release, most existing language models performed around the level of random chance (25%), with the best performing GPT-3 model achieving 43.9% accuracy.^[2] The developers of the MMLU estimate that human domain-experts achieve around 89.8% accuracy.^[2] As of 2024, some of the most powerful language models, such as Claude 3 and GPT-4, were reported to achieve scores in the mid-80s.^[3]

^ Cite error: The named reference nyt was invoked but never defined (see the help page).
^ ^a ^b ^c Cite error: The named reference paper was invoked but never defined (see the help page).
^ Cite error: The named reference claude3 was invoked but never defined (see the help page).

[nyt-1] Cite error: The named reference nyt was invoked but never defined (see the help page).

[paper-2] Cite error: The named reference paper was invoked but never defined (see the help page).

[claude3-3] Cite error: The named reference claude3 was invoked but never defined (see the help page).

MMLU information

and 5 Related for: MMLU information

MMLU

Language model

Foundation model

Large language model

Moroccan Arabic