Global Information Lookup Global Information

Tehran Monolingual Corpus information


The Tehran Monolingual Corpus (TMC) is a large-scale Persian monolingual corpus. TMC is suited for Language Modeling and relevant research areas in Natural Language Processing.

The corpus is extracted from Hamshahri Corpus and ISNA news agency website. The quality of Hamshahri corpus is improved for language modeling purpose by a series of tokenization and spell-checking steps.

TMC comprises more than 250 million words. The total number of unique words (with frequency of two or more) of the corpus is about 300 thousand, which is relatively good for a highly-inflectional language like Persian.

TMC is created by Natural Language Processing Lab of University of Tehran. The corpus is free for research use, after obtaining permission from the corpus aggregator.

and 29 Related for: Tehran Monolingual Corpus information

Request time (Page generated in 0.7927 seconds.)

Tehran Monolingual Corpus

Last Update:

The Tehran Monolingual Corpus (TMC) is a large-scale Persian monolingual corpus. TMC is suited for Language Modeling and relevant research areas in Natural...

Word Count : 128

Enron Corpus

Last Update:

The Enron Corpus is a database of over 600,000 emails generated by 158 employees of the Enron Corporation in the years leading up to the company's collapse...

Word Count : 712

British National Corpus

Last Update:

became available for commercial and academic research. The BNC is a monolingual corpus, as it records samples of language use in British English only, although...

Word Count : 3894

Brown Corpus

Last Update:

The Brown University Standard Corpus of Present-Day American English, better known as simply the Brown Corpus, is an electronic collection of text samples...

Word Count : 1056

Oxford English Corpus

Last Update:

The Oxford English Corpus (OEC) is a text corpus of 21st-century English, used by the makers of the Oxford English Dictionary and by Oxford University...

Word Count : 345

Cambridge English Corpus

Last Update:

The Cambridge International Corpus (CIC) is a collection of over 800 million words of real spoken and written English . The texts are stored in a database...

Word Count : 1016

COBUILD

Last Update:

electronic corpus of contemporary text, the Collins Corpus, later leading to the development of the Bank of English, and the production of the monolingual learner's...

Word Count : 175

TMC

Last Update:

Company, a defunct British automotive manufacturer Tehran Monolingual Corpus, a Persian monolingual text corpus, Iran Texas Medical Center, a medical institution...

Word Count : 388

Corpus of Contemporary American English

Last Update:

The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired...

Word Count : 1135

Bank of English

Last Update:

English (BoE) is a representative subset of the 4.5 billion words COBUILD corpus, a collection of English texts. These are mainly British in origin, but...

Word Count : 147

Hamshahri Corpus

Last Update:

tasks). The corpus is available for download in XML format. Bijankhan Corpus Persian Today Corpus Tehran Monolingual Corpus Text corpus Information retrieval...

Word Count : 333

Quranic Arabic Corpus

Last Update:

The Quranic Arabic Corpus (Arabic: المدونة القرآنية العربية, romanized: al-modwana al-Qurʾāni al-ʿArabiyya) is an annotated linguistic resource consisting...

Word Count : 599

Tatoeba

Last Update:

September 2007, about 150,000 English-Japanese sentence pairs from the Tanaka Corpus — a public-domain compilation released in 2001 by Hyogo University professor...

Word Count : 2056

PropBank

Last Update:

is a corpus that is annotated with verbal propositions and their arguments—a "proposition bank". Although "PropBank" refers to a specific corpus produced...

Word Count : 377

Thesaurus Linguae Graecae

Last Update:

lemmatization of the Greek corpus (2006) – a substantial undertaking, given the highly inflected nature of Greek and the complexity of the corpus, covering more than...

Word Count : 596

Bijankhan Corpus

Last Update:

University of Tehran. The corpus is non-free in that it is not free for commercial use, although these restrictions vary by country. The Bijankhan corpus is named...

Word Count : 161

Sketch Engine

Last Update:

from monolingual or bilingual texts. The terminology extraction feature provides a list of relevant terms based on comparison with a large corpus of general...

Word Count : 1419

International Corpus of English

Last Update:

The International Corpus of English (ICE) is a set of text corpora representing varieties of English from around the world. Over twenty countries or groups...

Word Count : 1229

Croatian Language Corpus

Last Update:

The Croatian Language Corpus (CLC) (Croatian: Hrvatski jezični korpus, HJK) is a corpus of Croatian compiled at the Institute of Croatian Language and...

Word Count : 481

Switchboard Telephone Speech Corpus

Last Update:

The Switchboard Telephone Speech Corpus is a corpus of spoken English language consisted of almost 260 hours of speech. It was created in 1990 by Texas...

Word Count : 453

TIMIT

Last Update:

TIMIT is a corpus of phonemically and lexically transcribed speech of American English speakers of different sexes and dialects. Each transcribed element...

Word Count : 561

American National Corpus

Last Update:

The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently...

Word Count : 605

Arabic Speech Corpus

Last Update:

The Arabic Speech Corpus is a Modern Standard Arabic (MSA) speech corpus for speech synthesis. The corpus contains phonetic and orthographic transcriptions...

Word Count : 388

CHILDES

Last Update:

The Child Language Data Exchange System (CHILDES) is a corpus established in 1984 by Brian MacWhinney and Catherine Snow to serve as a central repository...

Word Count : 521

Spoken English Corpus

Last Update:

Spoken English Corpus (SEC) is a speech corpus collection of recordings of spoken British English compiled during 1984–1987. The corpus manual can be found...

Word Count : 1278

Slovenian National Corpus

Last Update:

Slovenian National Corpus FidaPLUS is the 621 million words (tokens) corpus of the Slovenian language, gathered from selected texts written in Slovenian...

Word Count : 168

National Corpus of Polish

Last Update:

The National Corpus of Polish (Polish : Narodowy Korpus Języka Polskiego NKJP) is the biggest and the most important corpus of the Polish language. A linguistic...

Word Count : 462

Europarl Corpus

Last Update:

The Europarl Corpus is a corpus (set of documents) that consists of the proceedings of the European Parliament from 1996 to 2012. In its first release...

Word Count : 800

TalkBank

Last Update:

TalkBank is a multilingual corpus established in 2002 and currently directed and maintained by Brian MacWhinney. The goal of TalkBank is to foster fundamental...

Word Count : 210

PDF Search Engine © AllGlobal.net