Tehran Monolingual Corpus information

The Tehran Monolingual Corpus (TMC) is a large-scale Persian monolingual corpus. TMC is suited for Language Modeling and relevant research areas in Natural Language Processing.

The corpus is extracted from Hamshahri Corpus and ISNA news agency website. The quality of Hamshahri corpus is improved for language modeling purpose by a series of tokenization and spell-checking steps.

TMC comprises more than 250 million words. The total number of unique words (with frequency of two or more) of the corpus is about 300 thousand, which is relatively good for a highly-inflectional language like Persian.

TMC is created by Natural Language Processing Lab of University of Tehran. The corpus is free for research use, after obtaining permission from the corpus aggregator.

Tehran Monolingual Corpus information

and 29 Related for: Tehran Monolingual Corpus information

Tehran Monolingual Corpus

Enron Corpus

British National Corpus

Brown Corpus

Oxford English Corpus

Cambridge English Corpus

COBUILD

TMC

Corpus of Contemporary American English

Bank of English

Hamshahri Corpus

Quranic Arabic Corpus

Tatoeba

PropBank

Thesaurus Linguae Graecae

Bijankhan Corpus

Sketch Engine

International Corpus of English

Croatian Language Corpus

Switchboard Telephone Speech Corpus

TIMIT

American National Corpus

Arabic Speech Corpus

CHILDES

Spoken English Corpus

Slovenian National Corpus

National Corpus of Polish

Europarl Corpus

TalkBank