1. Introduction

he significance of large annotated corpus is a widely known fact. It is an important tool for researchers in Machine Translation (MT), Information Retrieval (IR), Speech Processing, Knowledge Based Computer System and Natural Language Processing (NLP). But in Bengali language we do not have large annotated corpus. The development of corpus creation and distribution of language resources and its availability is must for enhancing Language processing capabilities and research in this field [1]. A corpus is also an essential language resource for creating automatic dictionary from a huge collection of language text [2]. It is the central repository of data for all language processing applications. Researchers are taking this field as a huge sector of researching. In [3, 4, and 5] they focus on automatic Bangla corpus creation by combination of Bangla font. It contains information for human consumption as well as computer programs. The book of "Corpus linguistics and Language Technology" [6] is a warehouse for corpus related studies with special attention to Bangla, where discussed almost every linguistic features of this language. In this paper, we are trying to present a reference corpus and a new approach into language investigation to understand how a text corpus database is utilized to obtain new result on a language or its properties. Bangla Corpus can be used for several purposes including spell checkers and morphological analysis for Bangla language.

A Bangla corpus can be extracted systematically from a Bangla corpus since it is considered a source of all words which is important for verification of Bangla sentence structure [7,8]. This paper proposes another process to manually build up a corpus, which is essentially a list of all words in the language and tag the words sufficiently with features such as word meaning, Parts-Of-Speech (POS) and all other grammatical features. All these information need to be stored in a database and properly formatted before display to end users. The aim of the project is to formalize a procedure for a collaborative effort by different individuals or groups towards producing a tagged Bangla corpus. This requires a POS tagging interface, both web based and standalone that would provide a common platform for different contributors to enter tag information, semantic and other grammatical information that is available in a dictionary. So we used this huge and mighty source as our source of our corpus data.

2. II.

3. Web as Corpus Source

The use of the web as a corpus for teaching and research on language has been proposed a number of times. There has been a special issue of the Computational Linguistics journal on Web as Corpus. Several studies have used different methods to mine web data. Our attempt was to create an annotated Bangla text corpus which will contain Bangla text from most popular and well read newspapers of Bangladesh based on date (1 st January, 2012 to 31 st January, 2012) and several categories (Sports, Crime, Editorial, International, National, State, Sports, Business), so as to make it representative of every linguistic phenomena of Bangla. This project was based on huge data or text available in electronic format. As we are lacking good Bangla OCR applications for collecting Bangla text from printed book, journal and newspaper, so we had to restrict our attempt to collect corpus text from whatever resources we have available mainly from web. We have selected four newspapers that available in online and we used these in order to create a news corpus. These news corpuses contain 74698 word tokens and 13550 distinct words in this corpus.

4. III.

5. Data Collection a) Collection of the Raw Text

Many newspaper in our country have own online versions, but we choose four newspapers Prothom-Alo, Daily Janakantha, Daily Kalerkantho and Amardesh online among them mainly because they are the widely read newspapers in Bangladesh and with less spelling mistakes. We consider the raw text of those newspaper available in web and download all news from web for collecting corpus The raw text for the corpus was collected from these newspapers through downloading all the news available for the year of 2012 (from 1st January to 31st January) including magazines and periodicals, which were all in html format. The process took about one month to collect all these available data manually. At this point we ended up with news of thirty days with each day having several text files that contained news of different genres. The corpus size is eighty megabytes.

6. b) Conversion to UTF-8 Format

Then we manually convert these entire text file to UTF-8 format to make these data available and correctly readable for our corpus creator program which only allow UTF-8 formatted text for processing.

7. c) Classification of Collected Data

For quick extraction of information of a UTF-8 formatted Bangla text file, we save these text files in some arranged folder format where name of these arranged folders will give information such as date of collection, source and source type, genre of data. Consider the following example like E:\08102011\news.prothom-alocom\crime\file_001.txt that shown in Fig. 1.

8. d) Database/System Design

9. Text Pre-Processing

In this step, remove following information from the chosen data as a pre-processing -A. First of all, we replace English letters and numeric alphabet both in Bangla and English from the UTF text data by a single space.

str=str.replaceAll("[[\u0000-\u007F][\u09E6-\u09EF]?""&&[^?!?''-\\.]]"," ");

B. Then we replace all punctuation marks (expect Purnacched, Colon, Question Mark, Exclamation mark, Apostrophe, Dash/Hyphen, Dot) by a single space.str=str.replaceAll("[\\p{Punct}&&[^\\.?!?'-]]"," "); C. We also replace all Dots by a single space except those which is preceded by alphabet and followed by space.

10. Data Collection

Newspaper's website (www.prothom-alo.com) D. We also replace all Dash/Hyphen by a single space except those which is preceded and followed by alphabet. str = str.replaceAll("-| -|-"," ");

11. Source of

????????-????,?? ??? ????????????????????? ?????? ? ????????? (Here Dash/Hyphen has not eliminated) ?????????????????????????????????????????? ?? ????????????????????????????????????? (Here Dash/Hyphen has eliminated) E. We also replace all Apostrophe/Inverse Comma/Quotation Mark by a single space except those which is preceded and followed by alphabet. str = str.replaceAll(" '| ' |' "," "); '?????????? ??????,??????????? ? ??????'(Here quotation mark has eliminated) ?????? '????????????? ???????????????????? ???????????????????????????? (Here Apostrophe has eliminated) F. We replace all sequences of single or more spaces by a single space. str = str.replaceAll("( )+", " "); G. Then output file will save as "D:\PreProcessed_File.txt" and this pre-processed text will be used for POS tagging, Stemming and Morphological Analysis. H. This pre-processed text file will be referred to as our Raw Corpus. This corpus will be useful to evaluate the performance of annotated text corpus and also will be used for input text of machine translation system.

We proposed a user-friendly software interface to the user to annotate a large existing Bangla word set for the lexicon build up process. The effort will be a significant progress towards development of a properly annotated lexicon. This user interface has two distinct parts -one for building corpus and show text information such as source, source type, category, date, title and news content. Another parts for issuing manual addressing Parts-of-Speech, Stemming, Morphemes, issue clitic, ambiguous condition and other grammatical features.

A supervised machine learning method has been used for lexicon development from the Bengali news corpus. No extensive knowledge about the language is required except the knowledge of the different inflections that can appear with the different words in Bengali. To make proper annotation of word form, we accomplished each word form with POS tag, stem form, suffix, prefix, ambiguous condition (if exists), statistical counting. Initially, all the words (infected and uninfected) are extracted from the pre-processed text and added to a database with proper POS, stem, prefix, and suffix. The system retrieves the words from the preprocessed text and creates a database of distinct word forms with fully annotation. Here is given the structure of lexicon development process and some sample word forms shown in Fig. 2.

12. Fig.2: Structure of Lexicon development

The Part-of-Speech (POS) tagging is the process of assigning each word of a text with an appropriate parts of speech tag. POS tags often signify the morphological [9], phonological and contextual properties of a word, and also provide information about neighboring words. In Bengali, there are five different POS namely, noun, pronoun, verb, adjective, and indeclinable (prepositions, con-junctions, and interjections). Noun, verb and adjective belong to the open class of POS in Bengali. In this lexicon analysis, we use seven parts-ofspeech by extraction main five parts-of-speech in Bangla language. As, we know, there are a lot of word with proper noun are used in Bangla language. So we keep the proper noun distinct from other noun to get us more detail of the word form which is in the range of noun. To handle clitic which is one of the most common ambiguous situation in natural language Processing (NLP), we define a new POS form named clitic. If we add a word form in our lexicon database without setting any POS to that word form, our corpus creator software automatically set its POS as UNKNOWN. Noun and verb words are tagged by looking at their infections. Some infections may be common to some word form. In these cases, more than one POS may be generated for few words form. But here we set the mechanism for only one POS of a word form.

We only suggest a procedure to handle this ambiguity for initial level where POS ambiguity is resolved by checking he number of occurrences of these possible root words along with the POS tags as derived from same word forms. Pronoun and indeclinable are basically closed class of POS [10] in Bengali and these are added to the lexicon manually. It has been observed that adjectives in Bengali generally occur in four different forms based on the suffixes attached. For simplicity of counting or detecting sentence, we propose a user define POS named EOL (end of line). Here we detect POS for Purnacched, Question Mark, and Exclamation mark as EOL. The short description of POS categories is given in Table I.

Stemming is an operation that splits a word into the constituent root part and affix without doing complete morphological analysis. It is used to improve the performance of spelling checkers and information retrieval applications, where morphological analysis would be too computationally expensive. Terms with common stems tend to have similar meaning. So it can drastically reduce the dictionary sized used in various NLP applications, especially for highly inflected languages. We handle this stemming process manually like previous POS tagging process. After tokenizing preprocessed text into individual word form we manually set root of that word form by removing prefix and suffix of it. Then this stem form is stored in STEM field of lexicon database.

In our process, we first stripped off the suffix part from Bengali words depending upon the type of suffixes. Then we checked for the validity of the suffix stripped word as root word, using a Bengali dictionary. If it is not sufficient we strip the affix part of the remaining part of the word form. It can bring a set of word with same root form in a series to learn about them easily. We can get almost similar word by retrieving word which has same root/stem form.

A smallest meaningful linguistic unit is consisting of a word or a word element that can't be divided into smaller meaningful parts. At the time we set stem word for a word from, we also store the stripped part of the word form as morphemes. First of all we split the suffix part of a word and store it in out lexicon database. Then we check out remaining part of our word form whether there is any prefix part of that word. If exist, we strip it from the remaining word part and sore it to database by fully manually.

A Corpus from linguistic point of view is defined as a collection of transcribed speech or written text compiled mainly to enhance linguistic research. The key resource to any linguistic research is a trained, annotated corpus which can elevate language processing capability such as automatic part of-speech tagging, machine translation, questionanswering, stemming etc.

We design and develop a view of annotated corpus which is mainly based on knowledge based representation (Knowledge based AI technique). Here we used our Lexicon as knowledge reference for our corpus. Our lexicon is the collection of word forms with fully annotated where each word form is accomplished with parts-of-speech, stem, Morphemes (suffix, prefix) and statistical counting. When we add a word form to corpus, we bring all the morphological and grammatical information from lexicon and add this information with that word form. Corpus procedure and flow are shown in Fig. 3 First of all we take pre-processed text as an input of our corpus creation. 2. Then tokenize this text into word forms. Then all these word forms is stored in an iterative list. 3. This iterative list is looped and gets each word forms as a sequence they were in pre-processed text. Then for each word, POS, Stem is brought from lexicon database and adds this information following the word form separating with for slash (/). Then we make a small change in lexicon, we increase the count value of lexicon by 1 of a word from each time we find this word form. This helps us to find the number of occurrence of a word form. 4. For defining end of a sentence, we use EOL as word form and EOL as POS.

? If sentence is an Assertive sentence, we use as stem. Example: ???????/UNK/UNK 6. Before adding tokenized word form of preprocessed text to corpus, we add the entire information associate with this new text with some predefine TAG to raw corpus. Tag format of our news corpus has given in Table II.

13. a) Statistical Analysis

Regardless of the size of the corpus, it may subjected to both qualitative as well as quantitative analysis using various methods of statistics . Both these types of corpus analysis have different perspectives. Quantitative analysis focuses classifying different linguistic properties where qualitative analysis aims to give some complete and detailed description of the observed phenomena. We wish to focus on some simple quantitative analysis using U-Gram model.

We develop our corpus development program in such efficient away where researcher can easily get a lot of common and most focused perspective statistical output without any further processing. Here also some user define output generator where user can get output with is desire requirement.

Here we divide our statistical output generator procedure in two distinct parts:

? One for automated query based information.

? Another for user defines query based information.

As result of automated query base perspective statistical output, we provide twelve statistical counting results. This type of statistical counting will be very helpful for linguistic analysis, machine translation, Morphological analysis, spelling variations, morphological structure, and word sense analysis. These statistical counting are,

? Number of source from where this corpus data collected and there list.

? Number of source type and their list of this source of data.

Table II Corpus is considered as basic resource for language analysis and research for many foreign languages. This reflects both ideological and technological change in the area of language research. The effort will be a significant progress towards development of a properly annotated lexicon. The outcome of the research will significantly be helpful for future analyzer in the processes of Morphological Analysis, Automatic grammar Extraction and Machine Translation for Bangla.

14. Global

Figure 1. ?

Structure:
Example:	WORD/SETM/POS		Preprocessed Text
????/???/ADJ/???????/?????/NN/????/? Individual Word form ??/NN/????/????/PRO/???? /???? /ADJ/????/???/RB/
????/???/ADV/?????/?????/NN/?????? ,????
?/ADJ/???-???/??? ???/NN/??????/?????/NN/??????????/???????/N Write "WORD/"				1	Get POS
N/??????/??????/ADJ/???????/??????/NN/????/
????/PRO/?? ?/?? ?/NN/???/???/VRB/????/???/VRB/
???? / ???? /ADJ/??/??/VRB/EOL/AS/EOL		Increase			Database
5.	count by1		Get Steam form		Is UNK?	No	Year 2017
					Yes	Write "POS/"	9
			Is UNK? Write "STEAM/"		Get POS from User & Add to Database Get Steam from User & Add to Database	( ) Volume XVII Issue I Version I J
	Tag Description Noun(Except Proper Noun) Proper Noun Adjective Adverb Verb Pronoun Interjection) Indeclinable(Preposition, Conjunction &		Tag Label NN PN ADJ ADV VRB PRO IND		Examples ????,??????, ?????, ????, ??? ???????,????????, ?????? ????, ???????,??, ??????, ????? ?????,??? ????,???????, ???????? ??????,??à¦?"??, ??????, ????, ??? ???,????,???????,?????? ???,????,???,????, ???, ?????		Journal of Researches in Engineering
						Global

Figure 2. :

		Annotated Bangla News Corpus and Lexicon Development with POS Tagging and Stemming
Year 2017
10
XVII Issue I Version I	VII.	Experiment and Data Analysis
( ) Volume J		Tag Name	Tag Description / Purpose
Global Journal of Researches in Engineering		<ENTRY> <SOURCE></SOURCE> <TYPE></TYPE> <DATE></DATE> <CATAGORY></CATAGORY> <TITLE></TITLE> <CONTENT></CONTENT> </ENTRY>	Statistical counting of our annotated Bangla text provide some qualitative analysis aims to give some complete and detailed description of the observed corpus is shown in Table III. Our Corpus program also To define start of a new news information/data. Source of data. (www.prothom-alo.com) Source type of data (news, blog) Date of collection of data (11-01-12) Genres of that data (sports, crime) Title of news/data Main content of the news. To define end of this news information/data.
			phenomena which include word level frequency
			analysis, behavior of bangle word, use of non-Bangla
			word etc. These type of information can be get by using
			user defines query based annotated text corpus
			program interface.
			b) Word frequency Analysis
			Study of frequency calculation can provide
			important information about the usage of words in a
	© 2017 Global Journals Inc. (US)

Note: text

Figure 3. Table III :

III

Figure 4. Table IV :

Figure 5. Table V :

Annotated Bangla News Corpus and Lexicon Development with POS Tagging and Stemming
Serial	Information		Count
No
1	Number of source		4
2	Number of source type		1
3	Number of fields/genres		19
4	Number of Raw word/Number		74698
5 6 7	Number of Unique word Number of Unique Stem word Total Number of Sentence		13550 1423 5472	Year 2017
8	Number of Assertive Sentence		5377	11
9 10 11 12 Word à¦?" ??	Number of Interrogative Sentence. Number of Exclamatory Sentence. Number of Clitic Number of occurrence of Clitics Percentag e Word 1.78 ? 1.34 ??		72 23 3 136 Percenta ge 0.39 0.30	J ( ) Volume XVII Issue I Version I
??? ?? ??? ??? ?? Word ??? ?? ??? ????	1.21 1.15 0.95 0.52 0.47 Percentage 0.4 0.25 0.23 0.20	??? ???? ?? ??? ? ???? Word Percentage 0.30 0.26 0.17 0.08 0.069 ??? 0.16 ?? 0.15 ?? 0.13 ?? 0.13		Global Journal of Researches in Engineering
????	0.18	??	0.13
????	0.16	???	0.13
???	0.16	??	0.10
??	0.16	???	0.10

Figure 6. Table VI :

Figure 7. Table VII :

VII

Figure 8.

	POS Name		Percentage
		NN		56.43
		VRB		20.53
		ADJ		16.41
		PN		13.71
		ADV		5.94
		PRO		3.39
		IND		1.98
		CLK		0.104
		UNK		1.35
Year 2017	Prefix	Percentage	Suffix	Percentage
12	?	9.30	?	15.70
I	??	4.07	??	15.43
J ( ) Volume XVII Issue I Version	Conclusion ?? ??? ?	2.23 2.23 1.74	? ? ?	8.53 4.72 4.63
Journal of Researches in Engineering

⁶

Annotated Bangla News Corpus and Lexicon Development with POS Tagging and Stemming

Table of contents