Annotated Bangla News Corpus and Lexicon Development with POS Tagging and Stemming

Authors

  • Abdul Matin

  • Tasnim Haider Chaudhury

  • M.S. Hossain

Keywords:

corpus, POS, tagging, stemming, lexicon

Abstract

In this paper, we have developed a mono-linguistic Bengali news corpus using knowledge based AI (Artificial Intelligence) technique from some widely read Bengali newspapers which will be used as a reference corpus and will be very useful for lexicon development, morphological analysis, and automatic parts of speech detection. The corpus contains 74,698 word forms. The words in the lexicon are annotated with a combination of manual tags addressing Parts-of-Speech, Stemming, Morphemes, and other grammatical features are very important for almost all Natural Language Processing (NLP) applications. The lexicon contains around 14 thousand entries. In this paper we present some statistical analysis on some Bengali newspapers Prothom-Alo, Daily Janakantha, Daily Kalerkantho and Amardesh online from 1st January, 2012 to 31st January, 2012 those are the most popular Bengali newspapers in Bangladesh. We proposed a user friendly software interface to the user to annotate a large existing Bengali word set for the lexicon build up process.

How to Cite

Abdul Matin, Tasnim Haider Chaudhury, & M.S. Hossain. (2017). Annotated Bangla News Corpus and Lexicon Development with POS Tagging and Stemming. Global Journals of Research in Engineering, 17(J1), 5–12. Retrieved from https://engineeringresearch.org/index.php/GJRE/article/view/1597

Annotated Bangla News Corpus and Lexicon Development with POS Tagging and Stemming

Published

2017-01-15