BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a dataset consisting of the text of around 7,000 self-published books scraped from the indie ebook distribution website Smashwords.[1] It was the main corpus used to train the initial GPT model by OpenAI,[2] and has been used as training data for other early large language models including Google's BERT.[3] The dataset consists of around 985 million words, and the books that comprise it span a range of genres, including romance, science fiction, and fantasy.[3]
The corpus was introduced in a 2015 paper by researchers from the University of Toronto and MIT titled "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books". The authors described it as consisting of "free books written by yet unpublished authors," yet this is factually incorrect. These books were published by self-published ("indie") authors who priced them at free; the books were downloaded without the consent or permission of Smashwords or Smashwords authors and in violation of the Smashwords Terms of Service.[4][5] The dataset was initially hosted on a University of Toronto webpage.[5] An official version of the original dataset is no longer publicly available, though at least one substitute, BookCorpusOpen, has been created.[1] Though not documented in the original 2015 paper, the site from which the corpus's books were scraped is now known to be Smashwords.[5][1]
^ abcCite error: The named reference debt was invoked but never defined (see the help page).
^Cite error: The named reference gpt-1-paper was invoked but never defined (see the help page).
^ abCite error: The named reference bert-paper was invoked but never defined (see the help page).
^Cite error: The named reference bookpaper was invoked but never defined (see the help page).
^ abcCite error: The named reference swallows was invoked but never defined (see the help page).
BookCorpus (also sometimes referred to as the Toronto BookCorpus) is a dataset consisting of the text of around 7,000 self-published books scraped from...
Habeas corpus (/ˈheɪbiəs ˈkɔːrpəs/ ; from Medieval Latin, lit. 'that you have the body') is a recourse in law by which a report can be made to a court...
The Feast of Corpus Christi (Ecclesiastical Latin: Dies Sanctissimi Corporis et Sanguinis Domini Iesu Christi, lit. 'Day of the Most Holy Body and Blood...
The corpus luteum (Latin for "yellow body"; pl.: corpora lutea) is a temporary endocrine structure in female ovaries involved in the production of relatively...
Corpus linguistics is an empirical method for the study of language by way of a text corpus (plural corpora). Corpora are balanced, often stratified collections...
The Corpus Hermeticum is a collection of 17 Greek writings whose authorship is traditionally attributed to the legendary Hellenistic figure Hermes Trismegistus...
Book of the Dead was unique, containing a different mixture of spells drawn from the corpus of texts available. For most of the history of the Book of...
Corpus Mysticum: L'Eucharistie et l’Église au moyen âge was a book written by Henri de Lubac, published in Paris in 1944. The book aimed to, in de Lubac's...
American National Corpus Bank of English BookCorpus British National Corpus Bergen Corpus of London Teenage Language (COLT) Brown Corpus, forming part of...
Corpus Linguistics and Linguistic Theory is a peer-reviewed academic journal that publishes articles and book reviews on corpus linguistics, with a focus...
The Hippocratic Corpus (Latin: Corpus Hippocraticum), or Hippocratic Collection, is a collection of around 60 early Ancient Greek medical works strongly...
edition of the encyclopedia, which was originally issued as Corpus Juris by the American Law Book Company (from 1914 to 1937). CJS is published by West in...
The Corpus Cluniacense or Corpus Islamolatinum, sometimes erroneously the Corpus Toledanum, is a collection of Latin writings about Islam compiled in 1142–1143...
The Galenic corpus is the collection of writings of Galen, a prominent Greek physician, surgeon and philosopher in the Roman Empire during the second century...
The Corpus Juris Canonici (lit. 'Body of Canon Law') is a collection of significant sources of the Canon law of the Catholic Church that was applicable...
Victor Navarro Corpus (October 4, 1944 – April 4, 2024) was a Filipino military officer and public official best known for his 1970 defection from the...
timestamp). The Large Corpus, a set of large (megabyte-size) files. Contains an E. coli genome, a King James bible, and the CIA world fact book. Last updated...
Corpus Coranicum is a digital research project of the Berlin-Brandenburg Academy of Sciences and Humanities. The project makes sources accessible that...
A speech corpus (or spoken corpus) is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other...
The book relies on elicitation experiments as well as three corpora: a corpus from the Survey of English Usage, the Lancaster-Oslo-Bergen Corpus (UK English)...
Brown Corpus in 1967. Their findings were similar, but not identical, to the findings of the OEC analysis. According to The Reading Teacher's Book of Lists...
Analects, the History of the Peloponnesian War, the Hippocratic Corpus and the Corpus Aristotelicum. List of best-selling books Bokklubben World Library...
In United States law, habeas corpus (/ˈheɪbiəs ˈkɔːrpəs/) is a recourse challenging the reasons or conditions of a person's confinement under color of...
The Corpus Christi Carol or Falcon Carol is a Middle or Early Modern English hymn (or carol), first written down by an apprentice grocer named Richard...
The Calgary corpus is a collection of text and binary data files, commonly used for comparing data compression algorithms. It was created by Ian Witten...