This article does not cite any sources. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. Find sources: "Tehran Monolingual Corpus" – news · newspapers · books · scholar · JSTOR(December 2010) (Learn how and when to remove this message)
The Tehran Monolingual Corpus (TMC) is a large-scale Persian monolingual corpus. TMC is suited for Language Modeling and relevant research areas in Natural Language Processing.
The corpus is extracted from Hamshahri Corpus and ISNA news agency website. The quality of Hamshahri corpus is improved for language modeling purpose by a series of tokenization and spell-checking steps.
TMC comprises more than 250 million words. The total number of unique words (with frequency of two or more) of the corpus is about 300 thousand, which is relatively good for a highly-inflectional language like Persian.
TMC is created by Natural Language Processing Lab of University of Tehran. The corpus is free for research use, after obtaining permission from the corpus aggregator.
and 29 Related for: Tehran Monolingual Corpus information
The TehranMonolingualCorpus (TMC) is a large-scale Persian monolingualcorpus. TMC is suited for Language Modeling and relevant research areas in Natural...
The Enron Corpus is a database of over 600,000 emails generated by 158 employees of the Enron Corporation in the years leading up to the company's collapse...
became available for commercial and academic research. The BNC is a monolingualcorpus, as it records samples of language use in British English only, although...
The Brown University Standard Corpus of Present-Day American English, better known as simply the Brown Corpus, is an electronic collection of text samples...
The Oxford English Corpus (OEC) is a text corpus of 21st-century English, used by the makers of the Oxford English Dictionary and by Oxford University...
The Cambridge International Corpus (CIC) is a collection of over 800 million words of real spoken and written English . The texts are stored in a database...
electronic corpus of contemporary text, the Collins Corpus, later leading to the development of the Bank of English, and the production of the monolingual learner's...
Company, a defunct British automotive manufacturer TehranMonolingualCorpus, a Persian monolingual text corpus, Iran Texas Medical Center, a medical institution...
The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired...
English (BoE) is a representative subset of the 4.5 billion words COBUILD corpus, a collection of English texts. These are mainly British in origin, but...
tasks). The corpus is available for download in XML format. Bijankhan Corpus Persian Today CorpusTehranMonolingualCorpus Text corpus Information retrieval...
The Quranic Arabic Corpus (Arabic: المدونة القرآنية العربية, romanized: al-modwana al-Qurʾāni al-ʿArabiyya) is an annotated linguistic resource consisting...
September 2007, about 150,000 English-Japanese sentence pairs from the Tanaka Corpus — a public-domain compilation released in 2001 by Hyogo University professor...
is a corpus that is annotated with verbal propositions and their arguments—a "proposition bank". Although "PropBank" refers to a specific corpus produced...
lemmatization of the Greek corpus (2006) – a substantial undertaking, given the highly inflected nature of Greek and the complexity of the corpus, covering more than...
University of Tehran. The corpus is non-free in that it is not free for commercial use, although these restrictions vary by country. The Bijankhan corpus is named...
from monolingual or bilingual texts. The terminology extraction feature provides a list of relevant terms based on comparison with a large corpus of general...
The International Corpus of English (ICE) is a set of text corpora representing varieties of English from around the world. Over twenty countries or groups...
The Croatian Language Corpus (CLC) (Croatian: Hrvatski jezični korpus, HJK) is a corpus of Croatian compiled at the Institute of Croatian Language and...
The Switchboard Telephone Speech Corpus is a corpus of spoken English language consisted of almost 260 hours of speech. It was created in 1990 by Texas...
TIMIT is a corpus of phonemically and lexically transcribed speech of American English speakers of different sexes and dialects. Each transcribed element...
The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently...
The Arabic Speech Corpus is a Modern Standard Arabic (MSA) speech corpus for speech synthesis. The corpus contains phonetic and orthographic transcriptions...
The Child Language Data Exchange System (CHILDES) is a corpus established in 1984 by Brian MacWhinney and Catherine Snow to serve as a central repository...
Spoken English Corpus (SEC) is a speech corpus collection of recordings of spoken British English compiled during 1984–1987. The corpus manual can be found...
Slovenian National Corpus FidaPLUS is the 621 million words (tokens) corpus of the Slovenian language, gathered from selected texts written in Slovenian...
The National Corpus of Polish (Polish : Narodowy Korpus Języka Polskiego NKJP) is the biggest and the most important corpus of the Polish language. A linguistic...
The Europarl Corpus is a corpus (set of documents) that consists of the proceedings of the European Parliament from 1996 to 2012. In its first release...
TalkBank is a multilingual corpus established in 2002 and currently directed and maintained by Brian MacWhinney. The goal of TalkBank is to foster fundamental...