The Bijankhan corpus (Persian: پیکرهٔ بیجنخان) is a tagged corpus that is suitable for natural language processing (NLP) research on the Persian language. This collection is gathered from daily news and common texts. In this collection all documents are categorized into different subjects such as political, cultural, etc.; in about 4300 different subject categories. The corpus contains about 2.6 million manually tagged words with a tag set that contains 550 Persian part-of-speech tags.
The Bijankhan corpus was created by the Database Research Group at the University of Tehran.[1] The corpus is non-free in that it is not free for commercial use, although these restrictions vary by country. The Bijankhan corpus is named after Mahmood Bijankhan, professor of linguistics at the University of Tehran due to his contributions in this area.
^"Database Research Group". Archived from the original on 2017-05-15. Retrieved 2016-12-25.
The Bijankhancorpus (Persian: پیکرهٔ بیجنخان) is a tagged corpus that is suitable for natural language processing (NLP) research on the Persian language...
The Enron Corpus is a database of over 600,000 emails generated by 158 employees of the Enron Corporation in the years leading up to the company's collapse...
The Brown University Standard Corpus of Present-Day American English, better known as simply the Brown Corpus, is an electronic collection of text samples...
University of Tehran. He is the creator of BijankhanCorpus and a winner of Khwarizmi International Award. Bijankhan received his BSc in applied mathematics...
British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British...
The Cambridge International Corpus (CIC) is a collection of over 800 million words of real spoken and written English . The texts are stored in a database...
lemmatization of the Greek corpus (2006) – a substantial undertaking, given the highly inflected nature of Greek and the complexity of the corpus, covering more than...
have been the creation and analysis of an electronic corpus of contemporary text, the Collins Corpus, later leading to the development of the Bank of English...
The Quranic Arabic Corpus (Arabic: المدونة القرآنية العربية, romanized: al-modwana al-Qurʾāni al-ʿArabiyya) is an annotated linguistic resource consisting...
The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired...
The Oxford English Corpus (OEC) is a text corpus of 21st-century English, used by the makers of the Oxford English Dictionary and by Oxford University...
The Europarl Corpus is a corpus (set of documents) that consists of the proceedings of the European Parliament from 1996 to 2012. In its first release...
September 2007, about 150,000 English-Japanese sentence pairs from the Tanaka Corpus — a public-domain compilation released in 2001 by Hyogo University professor...
Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing since 2003. Its purpose is to enable people studying language...
The Arabic Speech Corpus is a Modern Standard Arabic (MSA) speech corpus for speech synthesis. The corpus contains phonetic and orthographic transcriptions...
Spoken English Corpus (SEC) is a speech corpus collection of recordings of spoken British English compiled during 1984–1987. The corpus manual can be found...
TIMIT is a corpus of phonemically and lexically transcribed speech of American English speakers of different sexes and dialects. Each transcribed element...
English (BoE) is a representative subset of the 4.5 billion words COBUILD corpus, a collection of English texts. These are mainly British in origin, but...
Slovenian National Corpus FidaPLUS is the 621 million words (tokens) corpus of the Slovenian language, gathered from selected texts written in Slovenian...
is a corpus that is annotated with verbal propositions and their arguments—a "proposition bank". Although "PropBank" refers to a specific corpus produced...
of Tehran. The corpus is free for research use, after obtaining permission from the corpus aggregator. BijankhanCorpus Hamshahri Corpus TMC description...
The Russian National Corpus (Russian: Национальный корпус русского языка, lit. 'National Corpus of the Russian language') is a corpus of the Russian language...
The Czech National Corpus (CNC) (Czech : Český národní korpus) is a large electronic corpus of written and spoken Czech language, developed by the Institute...
The German Reference Corpus (original: Deutsches Referenzkorpus; short: DeReKo) is an electronic archive of text corpora of contemporary written German...
The Buckeye Corpus of conversational speech is a speech corpus created by a team of linguists and psychologists at Ohio State University led by Prof. Mark...
The Scottish Corpus of Texts & Speech (SCOTS) is an ongoing project to build a corpus of modern-day (post-1940) written and spoken texts in Scottish English...
The Croatian Language Corpus (CLC) (Croatian: Hrvatski jezični korpus, HJK) is a corpus of Croatian compiled at the Institute of Croatian Language and...
The Bergen Corpus of London Teenage Language (COLT) is a data set of samples of spoken English that was compiled in 1993 from tape recorded and transcribed...
tasks). The corpus is available for download in XML format. BijankhanCorpus Persian Today Corpus Tehran Monolingual Corpus Text corpus Information retrieval...
The Wellington Corpus of Spoken New Zealand English is a one-million-word corpus of transcribed English compiled from materials collected between 1988...