# Introduction

he significance of large annotated corpus is a widely known fact. It is an important tool for researchers in Machine Translation (MT), Information Retrieval (IR), Speech Processing, Knowledge Based Computer System and Natural Language Processing (NLP). But in Bengali language we do not have large annotated corpus. The development of corpus creation and distribution of language resources and its availability is must for enhancing Language processing capabilities and research in this field [1]. A corpus is also an essential language resource for creating automatic dictionary from a huge collection of language text [2]. It is the central repository of data for all language processing applications. Researchers are taking this field as a huge sector of researching. In [3, 4, and 5] they focus on automatic Bangla corpus creation by combination of Bangla font. It contains information for human consumption as well as computer programs. The book of "Corpus linguistics and Language Technology" [6] is a warehouse for corpus related studies with special attention to Bangla, where discussed almost every linguistic features of this language. In this paper, we are trying to present a reference corpus and a new approach into language investigation to understand how a text corpus database is utilized to obtain new result on a language or its properties. Bangla Corpus can be used for several purposes including spell checkers and morphological analysis for Bangla language.

A Bangla corpus can be extracted systematically from a Bangla corpus since it is considered a source of all words which is important for verification of Bangla sentence structure [7,8]. This paper proposes another process to manually build up a corpus, which is essentially a list of all words in the language and tag the words sufficiently with features such as word meaning, Parts-Of-Speech (POS) and all other grammatical features. All these information need to be stored in a database and properly formatted before display to end users. The aim of the project is to formalize a procedure for a collaborative effort by different individuals or groups towards producing a tagged Bangla corpus. This requires a POS tagging interface, both web based and standalone that would provide a common platform for different contributors to enter tag information, semantic and other grammatical information that is available in a dictionary. So we used this huge and mighty source as our source of our corpus data.


# II.


# Web as Corpus Source

The use of the web as a corpus for teaching and research on language has been proposed a number of times. There has been a special issue of the Computational Linguistics journal on Web as Corpus. Several studies have used different methods to mine web data. Our attempt was to create an annotated Bangla text corpus which will contain Bangla text from most popular and well read newspapers of Bangladesh based on date (1 st January, 2012 to 31 st January, 2012) and several categories (Sports, Crime, Editorial, International, National, State, Sports, Business), so as to make it representative of every linguistic phenomena of Bangla. This project was based on huge data or text available in electronic format. As we are lacking good Bangla OCR applications for collecting Bangla text from printed book, journal and newspaper, so we had to restrict our attempt to collect corpus text from whatever resources we have available mainly from web. We have selected four newspapers that available in online and we used these in order to create a news corpus. These news corpuses contain 74698 word tokens and 13550 distinct words in this corpus.


# III.


# Data Collection a) Collection of the Raw Text

Many newspaper in our country have own online versions, but we choose four newspapers Prothom-Alo, Daily Janakantha, Daily Kalerkantho and Amardesh online among them mainly because they are the widely read newspapers in Bangladesh and with less spelling mistakes. We consider the raw text of those newspaper available in web and download all news from web for collecting corpus The raw text for the corpus was collected from these newspapers through downloading all the news available for the year of 2012 (from 1st January to 31st January) including magazines and periodicals, which were all in html format. The process took about one month to collect all these available data manually. At this point we ended up with news of thirty days with each day having several text files that contained news of different genres. The corpus size is eighty megabytes.


# b) Conversion to UTF-8 Format

Then we manually convert these entire text file to UTF-8 format to make these data available and correctly readable for our corpus creator program which only allow UTF-8 formatted text for processing.


# c) Classification of Collected Data

For quick extraction of information of a UTF-8 formatted Bangla text file, we save these text files in some arranged folder format where name of these arranged folders will give information such as date of collection, source and source type, genre of data. Consider the following example like E:\08102011\news.prothom-alocom\crime\file_001.txt that shown in Fig. 1.


# d) Database/System Design

We  


# Text Pre-Processing

In this step, remove following information from the chosen data as a pre-processing -A. First of all, we replace English letters and numeric alphabet both in Bangla and English from the UTF text data by a single space.

str=str.replaceAll("[[\u0000-\u007F][\u09E6-\u09EF]?""&&[^?!?''-\\.]]"," ");

B. Then we replace all punctuation marks (expect Purnacched, Colon, Question Mark, Exclamation mark, Apostrophe, Dash/Hyphen, Dot) by a single space.str=str.replaceAll("[\\p{Punct}&&[^\\.?!?'-]]"," "); C. We also replace all Dots by a single space except those which is preceded by alphabet and followed by space.


# Data Collection

Newspaper's website (www.prothom-alo.com) D. We also replace all Dash/Hyphen by a single space except those which is preceded and followed by alphabet. str = str.replaceAll("-| -|-"," ");


# Source of

????????-????,?? ??? ????????????????????? ?????? ? ????????? (Here Dash/Hyphen has not eliminated) ?????????????????????????????????????????? ?? ????????????????????????????????????? (Here Dash/Hyphen has eliminated) E. We also replace all Apostrophe/Inverse Comma/Quotation Mark by a single space except those which is preceded and followed by alphabet. str = str.replaceAll(" '| ' |' "," "); '?????????? ??????,??????????? ? ??????'(Here quotation mark has eliminated) ?????? '????????????? ???????????????????? ???????????????????????????? (Here Apostrophe has eliminated) F. We replace all sequences of single or more spaces by a single space. str = str.replaceAll("( )+", " "); G. Then output file will save as "D:\PreProcessed_File.txt" and this pre-processed text will be used for POS tagging, Stemming and Morphological Analysis. H. This pre-processed text file will be referred to as our Raw Corpus. This corpus will be useful to evaluate the performance of annotated text corpus and also will be used for input text of machine translation system.

We proposed a user-friendly software interface to the user to annotate a large existing Bangla word set for the lexicon build up process. The effort will be a significant progress towards development of a properly annotated lexicon. This user interface has two distinct parts -one for building corpus and show text information such as source, source type, category, date, title and news content. Another parts for issuing manual addressing Parts-of-Speech, Stemming, Morphemes, issue clitic, ambiguous condition and other grammatical features.

A supervised machine learning method has been used for lexicon development from the Bengali news corpus. No extensive knowledge about the language is required except the knowledge of the different inflections that can appear with the different words in Bengali. To make proper annotation of word form, we accomplished each word form with POS tag, stem form, suffix, prefix, ambiguous condition (if exists), statistical counting. Initially, all the words (infected and uninfected) are extracted from the pre-processed text and added to a database with proper POS, stem, prefix, and suffix. The system retrieves the words from the preprocessed text and creates a database of distinct word forms with fully annotation. Here is given the structure of lexicon development process and some sample word forms shown in Fig. 2.


# Fig.2: Structure of Lexicon development

The Part-of-Speech (POS) tagging is the process of assigning each word of a text with an appropriate parts of speech tag. POS tags often signify the morphological [9], phonological and contextual properties of a word, and also provide information about neighboring words. In Bengali, there are five different POS namely, noun, pronoun, verb, adjective, and indeclinable (prepositions, con-junctions, and interjections). Noun, verb and adjective belong to the open class of POS in Bengali. In this lexicon analysis, we use seven parts-ofspeech by extraction main five parts-of-speech in Bangla language. As, we know, there are a lot of word with proper noun are used in Bangla language. So we keep the proper noun distinct from other noun to get us more detail of the word form which is in the range of noun. To handle clitic which is one of the most common ambiguous situation in natural language Processing (NLP), we define a new POS form named clitic. If we add a word form in our lexicon database without setting any POS to that word form, our corpus creator software automatically set its POS as UNKNOWN. Noun and verb words are tagged by looking at their infections. Some infections may be common to some word form. In these cases, more than one POS may be generated for few words form. But here we set the mechanism for only one POS of a word form.

We only suggest a procedure to handle this ambiguity for initial level where POS ambiguity is resolved by checking he number of occurrences of these possible root words along with the POS tags as derived from same word forms. Pronoun and indeclinable are basically closed class of POS [10] in Bengali and these are added to the lexicon manually. It has been observed that adjectives in Bengali generally occur in four different forms based on the suffixes attached. For simplicity of counting or detecting sentence, we propose a user define POS named EOL (end of line). Here we detect POS for Purnacched, Question Mark, and Exclamation mark as EOL. The short description of POS categories is given in Table I.

Stemming is an operation that splits a word into the constituent root part and affix without doing complete morphological analysis. It is used to improve the performance of spelling checkers and information retrieval applications, where morphological analysis would be too computationally expensive. Terms with common stems tend to have similar meaning. So it can drastically reduce the dictionary sized used in various NLP applications, especially for highly inflected languages. We handle this stemming process manually like previous POS tagging process. After tokenizing preprocessed text into individual word form we manually set root of that word form by removing prefix and suffix of it. Then this stem form is stored in STEM field of lexicon database.

In our process, we first stripped off the suffix part from Bengali words depending upon the type of suffixes. Then we checked for the validity of the suffix stripped word as root word, using a Bengali dictionary. If it is not sufficient we strip the affix part of the remaining part of the word form. It can bring a set of word with same root form in a series to learn about them easily. We can get almost similar word by retrieving word which has same root/stem form.

A smallest meaningful linguistic unit is consisting of a word or a word element that can't be divided into smaller meaningful parts. At the time we set stem word for a word from, we also store the stripped part of the word form as morphemes. First of all we split the suffix part of a word and store it in out lexicon database. Then we check out remaining part of our word form whether there is any prefix part of that word. If exist, we strip it from the remaining word part and sore it to database by fully manually.

A Corpus from linguistic point of view is defined as a collection of transcribed speech or written text compiled mainly to enhance linguistic research. The key resource to any linguistic research is a trained, annotated corpus which can elevate language processing capability such as automatic part of-speech tagging, machine translation, questionanswering, stemming etc.

We design and develop a view of annotated corpus which is mainly based on knowledge based representation (Knowledge based AI technique). Here we used our Lexicon as knowledge reference for our corpus. Our lexicon is the collection of word forms with fully annotated where each word form is accomplished with parts-of-speech, stem, Morphemes (suffix, prefix) and statistical counting. When we add a word form to corpus, we bring all the morphological and grammatical information from lexicon and add this information with that word form. Corpus procedure and flow are shown in Fig. 3  First of all we take pre-processed text as an input of our corpus creation. 2. Then tokenize this text into word forms. Then all these word forms is stored in an iterative list. 3. This iterative list is looped and gets each word forms as a sequence they were in pre-processed text. Then for each word, POS, Stem is brought from lexicon database and adds this information following the word form separating with for slash (/). Then we make a small change in lexicon, we increase the count value of lexicon by 1 of a word from each time we find this word form. This helps us to find the number of occurrence of a word form. 4. For defining end of a sentence, we use EOL as word form and EOL as POS.

? If sentence is an Assertive sentence, we use as stem. Example: ???????/UNK/UNK 6. Before adding tokenized word form of preprocessed text to corpus, we add the entire information associate with this new text with some predefine TAG to raw corpus. Tag format of our news corpus has given in Table II.


# a) Statistical Analysis

Regardless of the size of the corpus, it may subjected to both qualitative as well as quantitative analysis using various methods of statistics . Both these types of corpus analysis have different perspectives. Quantitative analysis focuses classifying different linguistic properties where qualitative analysis aims to give some complete and detailed description of the observed phenomena. We wish to focus on some simple quantitative analysis using U-Gram model.

We develop our corpus development program in such efficient away where researcher can easily get a lot of common and most focused perspective statistical output without any further processing. Here also some user define output generator where user can get output with is desire requirement.

Here we divide our statistical output generator procedure in two distinct parts:

? One for automated query based information.

? Another for user defines query based information.

As result of automated query base perspective statistical output, we provide twelve statistical counting results. This type of statistical counting will be very helpful for linguistic analysis, machine translation, Morphological analysis, spelling variations, morphological structure, and word sense analysis. These statistical counting are,

? Number of source from where this corpus data collected and there list.

? Number of source type and their list of this source of data.

Table II  Corpus is considered as basic resource for language analysis and research for many foreign languages. This reflects both ideological and technological change in the area of language research. The effort will be a significant progress towards development of a properly annotated lexicon. The outcome of the research will significantly be helpful for future analyzer in the processes of Morphological Analysis, Automatic grammar Extraction and Machine Translation for Bangla. 


# Global


Structure:Example:WORD/SETM/POSPreprocessed Text????/???/ADJ/???????/?????/NN/????/? Individual Word form ??/NN/????/????/PRO/???? /???? /ADJ/????/???/RB/????/???/ADV/?????/?????/NN/?????? ,?????/ADJ/???-???/??? ???/NN/??????/?????/NN/??????????/???????/N Write "WORD/"1Get POSN/??????/??????/ADJ/???????/??????/NN/????/????/PRO/?? ?/?? ?/NN/???/???/VRB/????/???/VRB/???? / ???? /ADJ/??/??/VRB/EOL/AS/EOLIncreaseDatabase5.count by1Get Steam formIs UNK?NoYear 2017YesWrite "POS/"9Is UNK? Write "STEAM/"Get POS from User & Add to Database Get Steam from User & Add to Database( ) Volume XVII Issue I Version I JTag Description Noun(Except Proper Noun) Proper Noun Adjective Adverb Verb Pronoun Interjection) Indeclinable(Preposition, Conjunction &Tag Label NN PN ADJ ADV VRB PRO INDExamples ????,??????, ?????, ????, ??? ???????,????????, ?????? ????, ???????,??, ??????, ????? ?????,??? ????,???????, ???????? ??????,??à¦?"??, ??????, ????, ??? ???,????,???????,?????? ???,????,???,????, ???, ?????Journal of Researches in EngineeringGlobal
Annotated Bangla News Corpus and Lexicon Development with POS Tagging and StemmingYear 201710XVII Issue I Version IVII.Experiment and Data Analysis( ) Volume JTag NameTag Description / PurposeGlobal Journal of Researches in Engineering<ENTRY> <SOURCE></SOURCE> <TYPE></TYPE> <DATE></DATE> <CATAGORY></CATAGORY> <TITLE></TITLE> <CONTENT></CONTENT> </ENTRY>Statistical counting of our annotated Bangla text provide some qualitative analysis aims to give some complete and detailed description of the observed corpus is shown in Table III. Our Corpus program also To define start of a new news information/data. Source of data. (www.prothom-alo.com) Source type of data (news, blog) Date of collection of data (11-01-12) Genres of that data (sports, crime) Title of news/data Main content of the news. To define end of this news information/data.phenomena which include word level frequencyanalysis, behavior of bangle word, use of non-Banglaword etc. These type of information can be get by usinguser defines query based annotated text corpusprogram interface.b) Word frequency AnalysisStudy of frequency calculation can provideimportant information about the usage of words in a© 2017 Global Journals Inc. (US)text
III
IV
VAnnotated Bangla News Corpus and Lexicon Development with POS Tagging and StemmingSerialInformationCountNo1Number of source42Number of source type13Number of fields/genres194Number of Raw word/Number746985 6 7Number of Unique word Number of Unique Stem word Total Number of Sentence13550 1423 5472Year 20178Number of Assertive Sentence5377119 10 11 12 Word à¦?" ??Number of Interrogative Sentence. Number of Exclamatory Sentence. Number of Clitic Number of occurrence of Clitics Percentag e Word 1.78 ? 1.34 ??72 23 3 136 Percenta ge 0.39 0.30J ( ) Volume XVII Issue I Version I??? ?? ??? ??? ?? Word ??? ?? ??? ????1.21 1.15 0.95 0.52 0.47 Percentage 0.4 0.25 0.23 0.20??? ???? ?? ??? ? ???? Word Percentage 0.30 0.26 0.17 0.08 0.069 ??? 0.16 ?? 0.15 ?? 0.13 ?? 0.13Global Journal of Researches in Engineering????0.18??0.13????0.16???0.13???0.16??0.10??0.16???0.10© 2017 Global Journals Inc. (US)
VI
VII
POS NamePercentageNN56.43VRB20.53ADJ16.41PN13.71ADV5.94PRO3.39IND1.98CLK0.104UNK1.35Year 2017PrefixPercentageSuffixPercentage12?9.30?15.70I??4.07??15.43J ( ) Volume XVII Issue I VersionConclusion ?? ??? ?2.23 2.23 1.74? ? ?8.53 4.72 4.63Journal of Researches in Engineering
			© 2017 Global Journals Inc. (US)
			Year 2017 J
		
		
* 
	
		Issues in Corpus Creation and Distribution: The Evolution of the Linguistic Data Consortium, University of Pennsylvania and Linguistic Data Consortium Philadelphia
		
			CCieri
		
		
			MLiberman
		
		
			Pennsylvania, USA
		
	
* 
	
		Master's thesis, School of Computer Science and Information Technology
		
			JHasan
		
		
			2001
		
		
			RMIT University
		
	
	Automatic dictionary construction from large collections of text


* 
	
		Automatic Bangla Corpus Creation
		
			DewanAsif Iqbal Sarkar
		
		
			MumitShahriar Hossain Pavel
		
		
			Khan
		
		
			Dhaka, Bangladesh
		
		
			BRAC University
		
	
* 
	
		Analysis of and Observations from a Bangla News Corpus
		
			KhairMd
		
		
			Yeasir Arafat
		
		
			MdMajumder
		
		
			NaushadIslam
		
		
			MumitUz Zaman
		
		
			Khan
		
		
			Dhaka, Bangladesh
		
		
			Center for Research on Bangla Language Processing, BRAC University
		
	
* 
	
		Some Observations Regarding Corpora of Some Indian Languages
		
			ABharati
		
		
			RSangal
		
		
			SMBendre
		
	
		Proc. Intl. Conf. Knowledge Based Computer Systems (KBCS98)
				Intl. Conf. Knowledge Based Computer Systems (KBCS98)NCST, Mumbai
		
			19 Dec. 1998
			17
		
	
* 
	
		Corpus Linguistics and Language Technology
		
			NSDash
		
		
			2005
			New Delhi
		
	
	Mittal


* 
	
		Verification of Bangla Sentence Structure using N-Gram
		
			MdNur Hossain Khan
		
		
			MdKhan
		
		
			MdIslam
		
		
			HabiburRahman
		
		
			BappaSarker
		
	
		Global Journal of Computer Science and Technology
		
			14
			2014
		
	
	Issue 1 Version 1.0 Year


* 
	
		Parts of speech tagging using morphological analysis in bangla
		
			Md Hanif Seddiqui
		
		
			AbdullahAlRana
		
		
			TaufiqueMahmud
		
		
			Sayeed
		
	
		Proceeding of the 6th International Conference on Computer and Information Technology (ICCIT)
				eeding of the 6th International Conference on Computer and Information Technology (ICCIT)Bangladesh
		
	
* 
	
		Morphological Analysis of Bangla Words for Automatic Machine Translation
		
			MM Asaduzzaman
		
		
			Muhammad MasroorAli
		
	
		th International Conference on Computer and Information Technology (ICCIT) 2003. Jahangirnagar University
				Dhaka, Bangladesh
		
			2003
			
		
* 
	
		A global model for joint lemmatization and part-of-speech prediction
		
			KristinaToutanova
		
		
			ColinCherry
		
	
		Proceeding on ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP
				eeding on ACL '09 eedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language essing of the AFNLP
		
			1