Global Information Lookup Global Information

MMLU information


Measuring Massive Multitask Language Understanding (MMLU) is a benchmark for evaluating the capabilities of language models. It consists of about 16,000 multiple-choice questions spanning 57 academic subjects including mathematics, philosophy, law, and medicine. It is one of the most commonly used benchmarks for comparing the capabilities of large language models.[1]

The MMLU was released by Dan Hendrycks and a team of researchers in 2020[2] and was designed to be more challenging than then-existing benchmarks such as GLUE on which new language models were achieving better-than-human accuracy. At the time of the MMLU's release, most existing language models performed around the level of random chance (25%), with the best performing GPT-3 model achieving 43.9% accuracy.[2] The developers of the MMLU estimate that human domain-experts achieve around 89.8% accuracy.[2] As of 2024, some of the most powerful language models, such as Claude 3 and GPT-4, were reported to achieve scores in the mid-80s.[3]

  1. ^ Cite error: The named reference nyt was invoked but never defined (see the help page).
  2. ^ a b c Cite error: The named reference paper was invoked but never defined (see the help page).
  3. ^ Cite error: The named reference claude3 was invoked but never defined (see the help page).

and 5 Related for: MMLU information

Request time (Page generated in 0.535 seconds.)

MMLU

Last Update:

Measuring Massive Multitask Language Understanding (MMLU) is a benchmark for evaluating the capabilities of language models. It consists of about 16,000...

Word Count : 357

Language model

Last Update:

HellaSwag, WinoGrande, ARC, OpenBookQA, NaturalQuestions, TriviaQA, RACE, MMLU (Massive Multitask Language Understanding), BIG-bench hard, GSM8k, RealToxicityPrompts...

Word Count : 2293

Foundation model

Last Update:

evaluated relative to each other through standardized task benchmarks like MMLU, MMMU, HumanEval, and GSM8K. Given that foundation models are multi-purpose...

Word Count : 5024

Large language model

Last Update:

different evaluation datasets and tasks. Examples include GLUE, SuperGLUE, MMLU, BIG-bench, and HELM. OpenAI has released tools for running composite benchmarks...

Word Count : 12125

Moroccan Arabic

Last Update:

qal/yqul "say", kan/ykun "be" (the only examples) II FeMMeL; FeMMLu yFeMMeL, yFeMMLu beddel/ybeddel "change" FeMMit, FeMMa yFeMMi werra/ywerri "show"...

Word Count : 8124

PDF Search Engine © AllGlobal.net