Client is a data-driven career discovery platform enabling students to identify their dream career and develop skills required to pursue to it.
At the heart of platform lies a well-researched scientific framework comprising of various skills, interests and abilities; and a deep understanding of which of these are required for performing well in a given profession. Taking unconventional assessments and specialised courses enable students to benchmark their strengths and develop specific skills or knowledge in a specific domain. Interacting with accomplished mentors from various walks of life, on the other hand, allows students get first-hand exposure to what different career options are like and how to pursue them.
Client’s sales team visits schools for enrolling students on their platform. As a part of enrolment students are required to fill out an application form that captures their personal information including demographic details, contact details and career interests. These application forms are hand filed by students.
These application forms are then brought to client’s central office where data entry team makes an entry into the platform for each application. As the scale of enrolment grew, this manual data entry placed constraints on timely registration and increased data quality issues. It also became difficult to hire reliable data entry operators and ensure standardized data capture.
It was therefore desired to have an automated data extraction engine that can process scanned application forms and create a digital application record.
ICR (Intelligent Character Recognition) powered by deep learning & Image processing was the underlying technology for digitizing application forms. Our Machine Learning team worked closely with the client for the first couple of weeks in discovering solution framework. There were several potential issues our approach had to consider, key ones being
- Handwritten text filled by lakhs of students across the country will have a lot of variety in style of writing and even noise where character overlaps with the edges of boxes.
- Scanned forms can vary in resolution and hence, could impact the accuracy of algorithms.
- Creation of training datasets for character recognition in boxes and recognition of free text.
Our team went about creating ICR platform using OpenCV for processing of scanned application forms and Tensor flow based deep learning for prediction of characters’ images extracted from the application form. Every application form required writing specific code in OpenCV for identification of content zones and extraction of individual character (single character of field) images. These character block images were then passed to deep learning based character recognition algorithms to assign correct character labels.
In order to train these character recognition algorithms our team had to create labelled training sets artificially and through crowdsourced initiatives asking individuals to provide hundreds of different version of each character. Capturing different variations of character allowed us to improve the accuracy of classification on an ongoing basis. We also agreed upon minimum resolution for scanning application forms and other suggestions that would keep the scanned image as noise free as possible during a scan.
Overall first version of the solution was developed through extensive research and trials in six months time. Character recognition algorithms gave 99.5 percent accuracy in the last iteration.
Although not fully automated solution yet, this platform has reduced dependence on data entry operators who now review sample of application forms for quality checks rather than fully data entry.
Data entry team is able to process 10 times more forms per hour than previously. We are making consistent efforts in improving accuracy and having prediction confidence measures at form level so a certain percentage of forms can have straight through processing.