BookCorpus information

BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a dataset consisting of the text of around 7,000 self-published books scraped from the indie ebook distribution website Smashwords.^[1] It was the main corpus used to train the initial GPT model by OpenAI,^[2] and has been used as training data for other early large language models including Google's BERT.^[3] The dataset consists of around 985 million words, and the books that comprise it span a range of genres, including romance, science fiction, and fantasy.^[3]

The corpus was introduced in a 2015 paper by researchers from the University of Toronto and MIT titled "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books". The authors described it as consisting of "free books written by yet unpublished authors," yet this is factually incorrect. These books were published by self-published ("indie") authors who priced them at free; the books were downloaded without the consent or permission of Smashwords or Smashwords authors and in violation of the Smashwords Terms of Service.^[4]^[5] The dataset was initially hosted on a University of Toronto webpage.^[5] An official version of the original dataset is no longer publicly available, though at least one substitute, BookCorpusOpen, has been created.^[1] Though not documented in the original 2015 paper, the site from which the corpus's books were scraped is now known to be Smashwords.^[5]^[1]

^ ^a ^b ^c Cite error: The named reference debt was invoked but never defined (see the help page).
^ Cite error: The named reference gpt-1-paper was invoked but never defined (see the help page).
^ ^a ^b Cite error: The named reference bert-paper was invoked but never defined (see the help page).
^ Cite error: The named reference bookpaper was invoked but never defined (see the help page).
^ ^a ^b ^c Cite error: The named reference swallows was invoked but never defined (see the help page).

[debt-1] Cite error: The named reference debt was invoked but never defined (see the help page).

[gpt-1-paper-2] Cite error: The named reference gpt-1-paper was invoked but never defined (see the help page).

[bert-paper-3] Cite error: The named reference bert-paper was invoked but never defined (see the help page).

[bookpaper-4] Cite error: The named reference bookpaper was invoked but never defined (see the help page).

[swallows-5] Cite error: The named reference swallows was invoked but never defined (see the help page).

BookCorpus information

and 25 Related for: BookCorpus information

BookCorpus

Habeas corpus

Feast of Corpus Christi

Corpus luteum

Corpus linguistics

Corpus Hermeticum

Book of the Dead

Corpus Mysticum

List of text corpora

Corpus Linguistics and Linguistic Theory

Hippocratic Corpus

Corpus Juris Secundum

Corpus Cluniacense

Galenic corpus

Corpus Juris Canonici

Victor Corpus

Canterbury corpus

Corpus Coranicum

Speech corpus

A Comprehensive Grammar of the English Language

Most common words in English

The 100 Most Influential Books Ever Written

Habeas corpus in the United States

Corpus Christi Carol

Calgary corpus