Challenges
- Variety of internal document data structures (e.g. PDF, Excel, JSON) across different service providers
- Constant data structure changes
- Sufficient quality and quantity of data for model training
- Data anonymization for model training
Solutions
- A custom tool for extracted data review, approval, and pushing to destination systems
- Data lake and data warehouse for collection and management of the extracted data from documents
- Clustering for different vendors and types of documents
- UI interface for the team to undertake data labeling for ML model training
- A tool for document classification and named entity recognition
Results
- Ability to process industry leading service provider documents
- Manual efforts associated with data extraction decreased by 60%
- Automated document processing