Handwritten Optical Character Extraction and Recognition from Catalogue Sheets
DOI:
https://doi.org/10.14419/ijet.v7i4.5.20005Published:
2018-09-22Keywords:
Character Extraction and Recognition, CNN, Erode, k-means, OCRAbstract
The dataset consists of 20000 scanned catalogues of fossils and other artifacts compiled by the Geological Sciences Department. The images look like a scanned form filled with blue ink ball pen. The character extraction and identification is the first phase of the research and in the second phase we are planning to use the HMM model to extract the entire text from the form and store it in a digitized format. We used various image processing and computer vision techniques to extract characters from the 20000 handwritten catalogues. Techniques used for character extraction are Erode, MorphologyEx, Dilate, canny edge detection, find Counters, Counter Area etc. We used Histogram of Gradients (HOGs) to extract features from the character images and applied k-means and agglomerative clustering to perform unsupervised learning. This would allow us to prepare a labelled training dataset for the second phase. We also tried converting images from RGB to CMYK to improve k-means clustering performance. We also used thresholding to extract blue ink characters from the form after converting the image in HSV color format, but the background noise was significant, and results obtained were not promising. We are researching a more robust method to extract characters that doesn’t deform the characters and takes alignment into consideration.
References
[1] Ayatullah Faruk Mollah , Nabamita Majumder , Subhadip Basu and Mita Nasipuri “Design of an Optical Character Recognition System for Camerabased Handheld Devicesâ€-IJCSI
[2] Noman Islam, Zeeshan Islam, Nazia Noor,†A Survey on Optical Character Recognition Systemâ€-JICE
[3] Goyal, Aditi, Kartikay Khandelwal and Piyush Keshri. Optical Character Recognition for Handwritten Hindi.(2010).
[4] S. Iamsa-at and P. Horata, â€Handwritten Character Recognition Using Histograms of Oriented Gradient Features in Deep Learning of Artificial Neural Net-work,â€(2013)
[5] A. Chaudhuri et al., Optical Character Recognition Systems for Different Languages with Soft Computing, Studies in Fuzziness and Soft Computing 352, DOI 10.1007/978-3-319-50252-6_2
[6] Chirag I Patel, Ripal Patel, Palak Patel Handwritten Character Recognition using Neural Network International Journal of Scientific & Engineering Research Volume 2, Issue 5, May-2011
[7] https://docs.opencv.org/2.4/doc/tutorials/imgproc/erosiondilata-tion/erosiondilatation.html
[8] https://docs.opencv.org/3.1.0/da/d22/tutorialpycanny.html
[9] http://opencvexamples.blogspot.com/2013/09/find-contour.html
[10] https://docs.opencv.org/3.1.0/dd/d49/tutorialpycontourfeatures.html
[11] https://www.learnopencv.com/histogram-of-oriented-gradientsMcMahon GT, Gomes HE, Hohne SH, Hu TM, Levine BA & Conlin PR (2005), Web-based care management in patients with poorly controlled diabetes. Diabetes Care 28, 1624–1629.
How to Cite
License
Authors who publish with this journal agree to the following terms:- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution Licensethat allows others to share the work with an acknowledgement of the work''s authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal''s published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
Accepted 2018-09-21
Published 2018-09-22