Abstract' In optical character recognition (OCR), some of the most important stages are segmentation of line, character and word. Due to the imperfection in segmentation, most of the recognition system produce poor recognition rate. In this paper we are discussing some novel approach for line, word and character segmentation of printed Assamese document. Few works has been done for optical character recognition on other Indian script however in case of Assamese language it is almost negligible. So this work is an attempt to produce a report on segmentation of documents containing Assamese script forms. Keeping these things in mind here, in this paper we are discussing some approach to succeed in the above mentioned task. Here first we are discussing about the structure of Assamese language, and then we discuss some idea for segmentation of line, word and character from degraded Assamese document. Finally we discuss about various existing recognition technique.
Keywords'Histogram, Line-segmentation, Word segmentation, Character segmentation, HMM, SVM.
Optical character recognition (OCR) is a technique to read optical documents and to convert the text in these documents into machine recognizable codes. For proper optical character recognition, the start and end of a line in the document must be identified and segmented correctly. If the segmentation of a line is not done correctly, it reflects in the improper segmentation of words and characters thereby affecting the recognition rate of these words and characters. Segmentation of machine printed documents in Assamese, Bangla and Devnagari (Hindi) script is a very crucial task. This is due to the fact that all character are joined by a head line (Matra) and there is a number of overlapping and joining between the vowels and consonants which makes segmentation complex.
Several techniques for segmenting of printed text are detailed in references [1, 3, 5, 6, 12, 23].Based on these techniques, an OCR system for Assamese language has been developed. Although some work on Assamese numerals has already been done but in case of alphabet we believe it will be the first attempt.
The rest of the paper is organized as follows: Section II discusses a few properties of Assamese
Fig.1. Assamese Vowel.
Fig .2. Assamese Consonant.
Fig .3.Assamese Conjunction Character.
script, Section III describe the segmentation process. Recognition approaches are discussed in Section IV. Finally, the conclusion is provided in Section V.
II. ASSAMESE SCRIPT PROPERTIES
Assamese is one of the 22 languages recognized by the Indian constitution and spoken by 20 million people in the eastern region of India. Assamese language structure is closely related to Bangla and Oriya. Assamese script consists of 41
Fig.4. Original printed document.
Fig.5. After Binarization of original printed document.
consonants and 11 vowels. Like most of the Indian Script Assamese script is also very much cursive. It is written from Left to right. But Assamese Character set doesn't have small and capital concept. Though the total character symbol is 51, but Assamese script has a huge number of composite characters, which makes it difficult to automatic recognition.
III. SEGMENTATION PROCESS
For recognition purpose it is necessary to perform some pre-processing operation on several scanned document. Pre-processing basically involved the following operation such as: thresholding, gray scale image to binary image, removal of noise, segmenting the lines, separating individual lines of text, segmenting words, separation of individual word from text line, character segmentation, separation of individual character.
In order to increase the processing speed and to reduce storage space thresholding is done on color image (Fig 4) or gray scale image. The main purpose of thresholding is to separate the background from the foreground by using the information present on a histogram of an image. The histogram of a gray scale image gives two peaks values- high and low. The higher values correspond to the white background and the lower values represent the foreground (Fig 5).
B. Noise Removal
The noise present in typed or machine printed document are generally created by writing instruments or optical scanning devices and it may cause disconnected line segmentation, gaps in between line etc. There are several noise removal techniques available and they are categorized as .
Fig.6. Horizontal Histogram for printed document.
Fig.7. Line segmentation for printed document.
1) Filtering: The median filtering technique is very much popular for removing noise like salt & pepper noise. The filtering operation is implemented by sliding a 3*3 size window over an image to assign a value to the pixel as a function of gray values of its neighboring pixels. Filtering can be used to perform smoothing, thresholding, sharpening etc.
2) Morphological Operations: Morphological operation is a class of algorithms that transform the geometric structure of an image. It can be used in various kind of operation such as edge detection, restoration, can be designed to connect broken strokes and decomposed the connected strokes etc.
C. Line Segmentation
A white space separates a text line from the previous and the following ones. Applying horizontal projection method we can compute corresponding horizontal histogram [Fig 6] of an image, and depending on the high and low points in the histogram, each and every row of a document image can be separate individually [Fig 7].
D. Word Segmentation
Word segmentation method    assume that the gaps between the word is always larger than the gaps between the characters. So if we perform vertical projection to compute corresponding vertical histogram [Fig 8] and we can determine the words [Fig 10] based on the threshold distance (D) between two consecutive zero [Fig 9] break of the histogram, .
E. Character Segmentation
In case of character segmentation there are several techniques proposed, one of the method  gives an idea
Fig.8. Vertical Histogram.
Fig.9. Threshold Distance (D) between two corresponding words.
Fig.10. Word Segmentation.
about headline or matra detection presented in Assamese or Bangla script. The horizontal projection of the word image box is first computed. The row with the highest number of black pixels is the header line [Fig.11].
Once we have detected the headline [Fig.12], we can erase the head line by making the entire black pixels present in the headline region to its corresponding white pixels [Fig.13]. The columns that have no black pixels are treated as boundaries for extracting image boxes corresponding to characters .Fig.14 shows the character segmentation.
But sometimes head line detection is not so easy task, specially when printed text lines with leading and trailing ligatures. For this reason normalization of subsequent features value gets affected .To overcome this problem simple heuristics is used to detect headline first and after that feature is computed .
Recognition is the most critical phase of optical character recognition (OCR) system, where character recognition system use the methodologies to classify a test sample as one of the known classes. Few of the approaches for pattern recognition are discussed below:
A. Template Matching
This approach  is the most basic method of recognizing characters. The main idea is based on matching the training templates(which is already available) against the features extracted from the text character. Matching technique can be perform in three classes.
1) Direct Matching: In direct matching, the test character is directly compared to the training template for that character.
2) Deformable Template and Elastic Matching: In this technique a known image database is already created and to match the unknown image an image deformation is done.
3) Relaxation Matching: It is based on symbolic level image matching technique, where feature-based description for the character image is used.
Fig.11. Gives an idea how the highest horizontal histogram profile always
represent the head line.
Fig.12. Head lines are detected according to histogram profile.
B. Syntactic or Structural Matching
In Syntactic approach a comparison is made between the syntax of a language and the structure of the pattern. Structural matching technique describes the construction of a pattern from its primitives. Character is classified by the CR system and it can be represented by performing union operation on the structure primitives.
In , 'structural and syntactic approach' has been employed for the recognition of Bangla Words. Matrra is detected using the structural approach and then feature based recognition is done in order to recognize the residual character.
C. Neural Network
In neural networks (NN) all interconnected 'neural' processors  are connected parallely and due to this computing architecture it can perform at a higher rate compare to the other classical techniques. Neural network are adaptive in nature, it can adapt the changes in data easily and the characteristics of the input signal.
However, neural network has some limitation. The data or parameters used in NN must all be defined at the beginning. If at the time of processing there is extra additional data which is
Fig.13. After head line is deleted from each text line.
Fig.14. Character segmentation on the basis of vertical projection
not present at the beginning time, then it becomes impossible to do, developers need to start the whole process again .This nature makes NN not suitable for those systems where data are collected in time series.
D. Statistical Classifier
In statistical technique, decision boundaries between pattern classes are established using the concepts from statistical decision theory. This method deals with identifying which of a set of categories a new observation belongs; on the basis of a training set of data containing observations whose category membership is known.
Statistical method is classified into two types Parametric and non-parametric. Parametric methods are the one in which handwriting samples are categorized using a set of parameters which are selected based on the training data. Non-parametric methods are based on direct estimation from the training data.
Several Technique such as HMM i.e. Hidden Markov Models, k-NN i.e. k-Nearest Neighbors, Bayes Decision Rule and SVM  i.e. Support Vector Machines are few important examples of statistical recognition.
1) HMM: Among all these Technique HMM is most widely used Technique in Handwriting recognition. HMMs can also solve the problem of segmentation implicitly. Furthermore HMM's recognition accuracy is very high compared to other techniques as well as HMM can represent other knowledge sources with the same structure .Since HMM is a stochastic model, thus coping up with the noise and distortions in handwriting is easier. A stroke based recognition approach, where strokes are identified using HMM i.e. Hidden Markov Models is employed in . The concept of HMM is also used in   for online handwritten word recognition.
2) SVM: SVM is a new classifier that is used in many pattern recognition applications with good generalization performance. SVM has been used in recent years as an alternative to popular methods such as neural network. The advantage of SVM, is that it takes into account both experimental data and structural behavior for better generalization capability based on the principle of structural risk minimization (SRM) . In  a recognition technique based on SVM has been discussed.
Apart from using single classifiers for recognition, hybrid classifiers are also used for better and optimal results. The hybrid HMM-MLP approach is used in  which combine Hidden Markov Model and multi layer perception classifier.
V. CONCLUSION AND FUTURE WORK
In this work, the basic phases and approaches of segmentation and recognition of printed document and how it could be implemented in Assamese script was discussed. However, each approach has its own advantages and disadvantages. As far as our knowledge goes, this work is one of the first work done for segmentation of machine printed document in Assamese Script. In future we will try to develop one convenient recognition system in Assamese Script.
 K. Wong, R. Casey and F. Wahl 'Document Analysis System ', IBM j.Res . Dev., 26(6), pp.647-656, 1982.
 S.K Parui, K.Guin, U. Bhattacharya and B.B. Chaudhari. Online Handwritten Bangla Character Recognition using HMM, Proc of 19th Int. Conf. on Patt. Recog., 2008.
 F. Hones and J. Litcher, 'Layout extraction of mixed mode documents', Machine Vision Application, vol. 7, pp. 237'246, 1994.
 Ashok Bandyopadhyay, Basabi Chakrabati. Development of Online Handwriting Recognition System: A case study with Handwritten Bangla Character, IEEE Trans, World Congress on Nature & Biologically Inspired Computing, pp.514-519,2009.
 G. Nagy, S. Seth, and M. Viswanathan, 'A prototype document image analysis system for technical journals',Computer, vol. 25, pp. 10-22, 1992.
 Vijay Kumar, Pankaj K.Senegar, 'Segmentation of Printed Text in Devnagari Script and Gurmukhi Script ', IJCA: International Journal of Computer Applications, Vol.3,pp. 24-29, 2010.
 Ahmed Shah Mashiyat, Ahmed Shah Mehadi Kamrul Hasan Talukder. Bangla off-line Handwritten Character Recognition Using Superimposed Matrices .7th ICCIT-2004.
 S. Basu ,C. Chaudhuri, M. Kundu ,M. Nasipuri D.K. Basu Segmentation of Offline Handwritten Bengali Script. Proc. of 28th IEEE ACE, pp. 171-174, Dec-2002, Science City, Kolkata.
 U. Bhattacharya, M. Shridhar and S.K. Parui. On Recognition of Handwritten Bangla Characters. P. Kalra and S. Peleg (Eds.): ICVGIP 2006, LNCS 4338, pp. 817'828, 2006.Springer-Verlag Berlin Heidelberg 2006.
 Re??jean Plamondon, Fellow, IEEE, and Sargur N. Srihari, Fellow, IEEE. On-Line and Off-Line Handwriting Recognition:A Comprehensive Survey. IEEE transactions on pattern analysis and machine intelligence, vol. 22, no. 1, january 2000.
 Nafiz Arica and Fatos T. Yarman-Vural. An Overview of Character Recognition Focused on Off-Line Handwriting' IEEE transactions on systems, man, and cybernetics'part c: applications and reviews, vol. 31, no. 2, may 2001.
 U. Pal and Sagarika Datta, "Segmentation of Bangla Unconstrained Handwritten Text", Proc. 7th Int. Conf. on Document Analysis and Recognition, pp.1128-1132, 2003.
 U. Mahadevan and R.C. Nagabhushanam, ??Gap Metrics for Word Separation in Handwritten Lines,??Proc. Third Int'l Conf. Document Analysis and Recognition,pp. 124-127, Montreal, (ICDAR '95), Aug.1995
 G. Seni and E. Cohen, ??External Word Segmentation of Off-Line Handwritten Text Lines,'Pattern Recognition,vol. 27, no. 1, pp. 41-52, 1994.
 Joshi, N., Sita, G., Ramakrishnan, A. G., Deepu, V., and Madhvanath, S., Machine Recognition of Online Handwritten Devanagari Characters. Proc. of the Eighth International Conference of Document Analysis and Recognition, vol. 2, pp. 1156-1160 ,2005.
 Ahmad, A. R., Khalid, M., Gaudin, C. V. and Poisson, E., Online Handwriting Recognition using Support Vector machine. IEEE Region 10 Conference, vol. 1, pp. 311-314, 2004.
 Jain AK, Duin RPW, Mao J. Statistical pattern Recognition: A Review. IEEE Trans Pattern Analysis and Machine Intelligence, 20(1), 4-38, 2000.
 Wang F, Vuurpij L, Schomaker L. Support vector Machines for the classification of western handwritten capitals. Proc. 7th Int. workshop on frontiers in Handwriting Recognition. Amsterdam, Netherlands, pp.167-176, 2000.
 Huang XD, Ariki Y, Jack MA. Hidden Marcov Models for Speech Recognition. Edinburgh University Press,1990.
 Kumar, A. and Bhattacharya, S., Online Devanagari Isolated Character Recognition for the iPhone using Hidden Markov Model. Proceedings of the IEEE Students' Technology Symposium, pp. 300-304, 2010.
 Claus Bahlmann, Bernard Haasdonk, Hans Burkhardt. On-line Handwriting Recognition with Support Vector Machines'A Kernel Approach. Proc of IWFHR. Vol-02, 2002.
 Ahmad Sanmorino, Setiadi Yazid. A survey for Handwritten Signature Verification. IEEE Trans Int. Conf. on Uncertainity Reasoning and knowledge Engineering, pp.54-57, 2012.
 Nallapareddy Priyanka, Srikanta Pal, Ranju Mandal 'Line and Word Segmentation Approach for Printed Documents', IJCA Special Issue on 'Recent Trends in Image Processing and Pattern Recognition' RTIPPR, pp.31-33, 2010.