Global Information Lookup Global Information

BookCorpus information


BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a dataset consisting of the text of around 7,000 self-published books scraped from the indie ebook distribution website Smashwords.[1] It was the main corpus used to train the initial GPT model by OpenAI,[2] and has been used as training data for other early large language models including Google's BERT.[3] The dataset consists of around 985 million words, and the books that comprise it span a range of genres, including romance, science fiction, and fantasy.[3]

The corpus was introduced in a 2015 paper by researchers from the University of Toronto and MIT titled "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books". The authors described it as consisting of "free books written by yet unpublished authors," yet this is factually incorrect. These books were published by self-published ("indie") authors who priced them at free; the books were downloaded without the consent or permission of Smashwords or Smashwords authors and in violation of the Smashwords Terms of Service.[4][5] The dataset was initially hosted on a University of Toronto webpage.[5] An official version of the original dataset is no longer publicly available, though at least one substitute, BookCorpusOpen, has been created.[1] Though not documented in the original 2015 paper, the site from which the corpus's books were scraped is now known to be Smashwords.[5][1]

  1. ^ a b c Cite error: The named reference debt was invoked but never defined (see the help page).
  2. ^ Cite error: The named reference gpt-1-paper was invoked but never defined (see the help page).
  3. ^ a b Cite error: The named reference bert-paper was invoked but never defined (see the help page).
  4. ^ Cite error: The named reference bookpaper was invoked but never defined (see the help page).
  5. ^ a b c Cite error: The named reference swallows was invoked but never defined (see the help page).

and 25 Related for: BookCorpus information

Request time (Page generated in 0.6167 seconds.)

BookCorpus

Last Update:

BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a dataset consisting of the text of around 7,000 self-published books scraped from...

Word Count : 362

Habeas corpus

Last Update:

Habeas corpus (/ˈheɪbiəs ˈkɔːrpəs/ ; from Medieval Latin, lit. 'that you have the body') is a recourse in law by which a report can be made to a court...

Word Count : 9431

Feast of Corpus Christi

Last Update:

The Feast of Corpus Christi (Ecclesiastical Latin: Dies Sanctissimi Corporis et Sanguinis Domini Iesu Christi, lit. 'Day of the Most Holy Body and Blood...

Word Count : 4060

Corpus luteum

Last Update:

The corpus luteum (Latin for "yellow body"; pl.: corpora lutea) is a temporary endocrine structure in female ovaries involved in the production of relatively...

Word Count : 1684

Corpus linguistics

Last Update:

Corpus linguistics is an empirical method for the study of language by way of a text corpus (plural corpora). Corpora are balanced, often stratified collections...

Word Count : 2576

Corpus Hermeticum

Last Update:

The Corpus Hermeticum is a collection of 17 Greek writings whose authorship is traditionally attributed to the legendary Hellenistic figure Hermes Trismegistus...

Word Count : 1200

Book of the Dead

Last Update:

Book of the Dead was unique, containing a different mixture of spells drawn from the corpus of texts available. For most of the history of the Book of...

Word Count : 5795

Corpus Mysticum

Last Update:

Corpus Mysticum: L'Eucharistie et l’Église au moyen âge was a book written by Henri de Lubac, published in Paris in 1944. The book aimed to, in de Lubac's...

Word Count : 243

List of text corpora

Last Update:

American National Corpus Bank of English BookCorpus British National Corpus Bergen Corpus of London Teenage Language (COLT) Brown Corpus, forming part of...

Word Count : 2444

Corpus Linguistics and Linguistic Theory

Last Update:

Corpus Linguistics and Linguistic Theory is a peer-reviewed academic journal that publishes articles and book reviews on corpus linguistics, with a focus...

Word Count : 56

Hippocratic Corpus

Last Update:

The Hippocratic Corpus (Latin: Corpus Hippocraticum), or Hippocratic Collection, is a collection of around 60 early Ancient Greek medical works strongly...

Word Count : 8558

Corpus Juris Secundum

Last Update:

edition of the encyclopedia, which was originally issued as Corpus Juris by the American Law Book Company (from 1914 to 1937). CJS is published by West in...

Word Count : 310

Corpus Cluniacense

Last Update:

The Corpus Cluniacense or Corpus Islamolatinum, sometimes erroneously the Corpus Toledanum, is a collection of Latin writings about Islam compiled in 1142–1143...

Word Count : 1188

Galenic corpus

Last Update:

The Galenic corpus is the collection of writings of Galen, a prominent Greek physician, surgeon and philosopher in the Roman Empire during the second century...

Word Count : 1908

Corpus Juris Canonici

Last Update:

The Corpus Juris Canonici (lit. 'Body of Canon Law') is a collection of significant sources of the Canon law of the Catholic Church that was applicable...

Word Count : 2367

Victor Corpus

Last Update:

Victor Navarro Corpus (October 4, 1944 – April 4, 2024) was a Filipino military officer and public official best known for his 1970 defection from the...

Word Count : 984

Canterbury corpus

Last Update:

timestamp). The Large Corpus, a set of large (megabyte-size) files. Contains an E. coli genome, a King James bible, and the CIA world fact book. Last updated...

Word Count : 222

Corpus Coranicum

Last Update:

Corpus Coranicum is a digital research project of the Berlin-Brandenburg Academy of Sciences and Humanities. The project makes sources accessible that...

Word Count : 1909

Speech corpus

Last Update:

A speech corpus (or spoken corpus) is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other...

Word Count : 474

A Comprehensive Grammar of the English Language

Last Update:

The book relies on elicitation experiments as well as three corpora: a corpus from the Survey of English Usage, the Lancaster-Oslo-Bergen Corpus (UK English)...

Word Count : 292

Most common words in English

Last Update:

Brown Corpus in 1967. Their findings were similar, but not identical, to the findings of the OEC analysis. According to The Reading Teacher's Book of Lists...

Word Count : 858

The 100 Most Influential Books Ever Written

Last Update:

Analects, the History of the Peloponnesian War, the Hippocratic Corpus and the Corpus Aristotelicum. List of best-selling books Bokklubben World Library...

Word Count : 155

Habeas corpus in the United States

Last Update:

In United States law, habeas corpus (/ˈheɪbiəs ˈkɔːrpəs/) is a recourse challenging the reasons or conditions of a person's confinement under color of...

Word Count : 8272

Corpus Christi Carol

Last Update:

The Corpus Christi Carol or Falcon Carol is a Middle or Early Modern English hymn (or carol), first written down by an apprentice grocer named Richard...

Word Count : 1182

Calgary corpus

Last Update:

The Calgary corpus is a collection of text and binary data files, commonly used for comparing data compression algorithms. It was created by Ian Witten...

Word Count : 653

PDF Search Engine © AllGlobal.net