Annotated Bangla News Corpus and Lexicon Development with POS Tagging and Stemming
Keywords:
corpus, POS, tagging, stemming, lexicon
Abstract
In this paper, we have developed a mono-linguistic Bengali news corpus using knowledge based AI (Artificial Intelligence) technique from some widely read Bengali newspapers which will be used as a reference corpus and will be very useful for lexicon development, morphological analysis, and automatic parts of speech detection. The corpus contains 74,698 word forms. The words in the lexicon are annotated with a combination of manual tags addressing Parts-of-Speech, Stemming, Morphemes, and other grammatical features are very important for almost all Natural Language Processing (NLP) applications. The lexicon contains around 14 thousand entries. In this paper we present some statistical analysis on some Bengali newspapers Prothom-Alo, Daily Janakantha, Daily Kalerkantho and Amardesh online from 1st January, 2012 to 31st January, 2012 those are the most popular Bengali newspapers in Bangladesh. We proposed a user friendly software interface to the user to annotate a large existing Bengali word set for the lexicon build up process.
Downloads
- Article PDF
- TEI XML Kaleidoscope (download in zip)* (Beta by AI)
- Lens* NISO JATS XML (Beta by AI)
- HTML Kaleidoscope* (Beta by AI)
- DBK XML Kaleidoscope (download in zip)* (Beta by AI)
- LaTeX pdf Kaleidoscope* (Beta by AI)
- EPUB Kaleidoscope* (Beta by AI)
- MD Kaleidoscope* (Beta by AI)
- FO Kaleidoscope* (Beta by AI)
- BIB Kaleidoscope* (Beta by AI)
- LaTeX Kaleidoscope* (Beta by AI)
How to Cite
Published
2017-01-15
Issue
Section
License
Copyright (c) 2017 Authors and Global Journals Private Limited
This work is licensed under a Creative Commons Attribution 4.0 International License.