The Switchboard Telephone Speech Corpus is a corpus of spoken English language consisted of almost 260 hours of speech. It was created in 1990 by Texas Instruments via a DARPA grant, and released in 1992 by NIST. The corpus contains 2,400 telephone conversations among 543 US speakers (302 male, 241 female).[1][2][3] Participants did not know each other, and conversations were held on topics from a predetermined list.[4]
Switchboard-2 Phase II was collected in 1999 and includes "4,472 five-minute telephone conversations involving 679 participants".[5]
The corpus was used for development of speech recognition algorithms.[6]
Text example:[7]
A: All right um well [laughter-uh] let's see i'm twenty
B: How old are you Lisa. Okay that i'm older
A: Yeah how old are you. Older [laughter]
B: Older than you [laughter-are]
A: [laughter-okay]
B: Okay we are supposed to talk about places we like to go so i'm gonna and where are you from where are you calling from?
A: I'm calling from uh Provo Utah but I'm from Plano Texas
B: Oh you are from Plano my sister lives in Plano yes her husband is the new Director of Admissions at uh University of Texas at Dallas
A: Oh really. Oh wow my dad used to work at UTD also
B: Yeah so I [vocalized-noise]. Anyway so where's your favorite place to go?
A: Um. Generally we just go on family vacations to Arizona my grandparents live there that's generally our usual summer vacation
^"Switchboard-1 Release 2 - Linguistic Data Consortium". catalog.ldc.upenn.edu. Retrieved 26 January 2024.
^"Papers with Code - Switchboard-1 Corpus Dataset". paperswithcode.com. Retrieved 26 January 2024.
^Godfrey, John J.; Holliman, Edward C.; McDaniel, Jane (23 March 1992). "SWITCHBOARD: Telephone speech corpus for research and development". [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE Computer Society. pp. 517–520. doi:10.1109/ICASSP.1992.225858. ISBN 0-7803-0532-9. S2CID 61412708. Retrieved 26 January 2024.
^"NXT Swbd Overview". groups.inf.ed.ac.uk. Retrieved 26 January 2024.
^"Switchboard-2 Phase II - Linguistic Data Consortium". catalog.ldc.upenn.edu. Retrieved 26 January 2024.
^"Switchboard Transcription System". www1.icsi.berkeley.edu. Retrieved 26 January 2024.
^Soni, Mayank; Spillane, Brendan; Gilmartin, Emer; Saam, Christian; Cowan, Benjamin R.; Wade, Vincent (2021). "An Empirical Study of Topic Transition in Dialogue". arXiv:2111.14188 [cs.CL].
and 29 Related for: Switchboard Telephone Speech Corpus information
The SwitchboardTelephoneSpeechCorpus is a corpus of spoken English language consisted of almost 260 hours of speech. It was created in 1990 by Texas...
The Enron Corpus is a database of over 600,000 emails generated by 158 employees of the Enron Corporation in the years leading up to the company's collapse...
The Brown University Standard Corpus of Present-Day American English, better known as simply the Brown Corpus, is an electronic collection of text samples...
user-defined part of speech) Note that the corpus is available only through the web interface, due to copyright restrictions. The corpus of Global Web-based...
University of Washington. EARS funded the collection of the Switchboardtelephonespeechcorpus containing 260 hours of recorded conversations from over...
linguists whose goal was a corpus of modern (at the time of building the corpus), naturally occurring language in the form of speech and text or writing that...
conversations, telephone calls, radio broadcasts, presentations, speeches, meetings, TV programmes and lectures. The Cambridge Learner Corpus (CLC) is a collection...
The Arabic SpeechCorpus is a Modern Standard Arabic (MSA) speechcorpus for speech synthesis. The corpus contains phonetic and orthographic transcriptions...
supervised by Eric Atwell. The annotated corpus includes: A manually verified part-of-speech tagged Quranic Arabic corpus. An annotated treebank of Quranic Arabic...
The Oxford English Corpus (OEC) is a text corpus of 21st-century English, used by the makers of the Oxford English Dictionary and by Oxford University...
required for access to the dataset. The TIMIT telephonecorpus was an early attempt to create a database with speech samples. It was published in the year 1988...
(show trending words) Corpus building and management – create corpora from the Web or uploaded texts including part-of-speech tagging and lemmatization...
is a corpus that is annotated with verbal propositions and their arguments—a "proposition bank". Although "PropBank" refers to a specific corpus produced...
Spoken English Corpus (SEC) is a speechcorpus collection of recordings of spoken British English compiled during 1984–1987. The corpus manual can be found...
The Persian SpeechCorpus is a Modern Persian speechcorpus for speech synthesis. The corpus contains phonetic and orthographic transcriptions of about...
includes a part-of-speech tagging and parsing of the entire corpus. The treebank can be thoroughly searched and explored with the ICE Corpus Utility Program...
included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities. The ANC...
English (BoE) is a representative subset of the 4.5 billion words COBUILD corpus, a collection of English texts. These are mainly British in origin, but...
processing procedures such as tokenization, part-of-speech tagging and word-sense disambiguation enrich corpus texts with detailed linguistic information. This...
The corpus was collected under the direction of linguist Janet Holmes and includes broadcast transcripts as well as informal conversations, telephone conversations...
The Buckeye Corpus of conversational speech is a speechcorpus created by a team of linguists and psychologists at Ohio State University led by Prof. Mark...
The PCVC (Persian Consonant Vowel Combination) Speech Dataset is a Modern Persian speechcorpus for speech recognition and also speaker recognition. The...
and 17 in schools throughout London, England. This corpus, which has been tagged for part of speech using the CLAWS 6 tagset, is one of the linguistic...
The Scottish Corpus of Texts & Speech (SCOTS) is an ongoing project to build a corpus of modern-day (post-1940) written and spoken texts in Scottish English...
lemmatization of the Greek corpus (2006) – a substantial undertaking, given the highly inflected nature of Greek and the complexity of the corpus, covering more than...
Gu, Jiatao (9 June 2020). "CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus". arXiv:2002.01320 [cs.CL]. Wikimedia Commons has media related...
The corpus contains about 2.6 million manually tagged words with a tag set that contains 550 Persian part-of-speech tags. The Bijankhan corpus was created...
The Europarl Corpus is a corpus (set of documents) that consists of the proceedings of the European Parliament from 1996 to 2012. In its first release...
The Child Language Data Exchange System (CHILDES) is a corpus established in 1984 by Brian MacWhinney and Catherine Snow to serve as a central repository...