Tech mahindra
Tech mahindra

Deep Knowledge Discovery & Information Retrieval From Unstructured Data

Posted by: Hareesh Kumar P S On June 12, 2018 04:21 PM facebook linked in twitter

Expert’s around the world predict that about 80 to 90 percent of the data which is there as part of any organization is unstructured (This could be in any format, documents, emails, web pages, digital archives etc.). The important thing is that the amount of unstructured data in enterprises is growing significantly day by day -- often many times faster than structured databases are growing.

Here’s what leading analysts say about unstructured data growth:

  • IDG: Unstructured data is growing at the rate of 62% per year.
  • IDG: By 2022, 93% of all data in the digital universe was unstructured.
  • Gartner: Data volume is set to grow 800% over the next 5 years and 80% of it will reside as unstructured data.

Source: Reference 1

Many organizations do believe that their unstructured data stores include untapped information that could help them make better business decisions. The Unfortunate thing is that it is often very difficult to analyse unstructured data as it comes in high volume and high variance.

High priority problem statements that majority of the companies expect to be addressed on the unstructured data includes Knowledge Extraction, Information Retrieval, Custom Entity Extraction, Domain Specific Language Modelling etc. The companies had to either involve their high quality human resource to manually go through the documents/contents and collate relevant information or write rule based engine to extract it for them. The drawbacks with such an approach includes wastage of expert resource hours, time to complete the task always on the higher side, head count required on the higher side.

Thanks to Machine Learning, Deep learning (Particularly Recurrent Neural Networks (RNNs)), Natural Language Processing (NLP), Natural Language Understanding (NLU) and other cognitive tools which includes Genism, Apache Lucene, we now have higher end trained systems which does such high volume, high precision jobs better than humans. This would help companies to glean actionable information that could help a business succeed in a competitive environment. The benefits includes Cost and time saving, Quality and accurate results, Better utilization of resources.

In 2016, a report estimated the global deep learning market to reach $272 million. From 2014, the deep learning market showed a continuous growth with the latest report stating that this market will reach $10.2 billion by the end of 2025.

We foresee similar pace of applicability of the latest Deep learning technologies & techniques ( RNN precisely LSTM ) for custom entity extraction & Language Modelling, Natural language processing for knowledge extraction, Genism & Lucene for Information Retrieval to real world, which could be in any vertical or domain. One of such implementations of deep learning by Tech Mahindra for engineering customer is discussed below:

Implementation # 1: Aerospace Supplier Shortlisting system using Deep Knowledge Discovery & Information Retrieval

Problem Statement: The current supplier short listing procedure involves an expert or a team of experts manually reading the suppler assessment documents captured in an un-structured format by analyst (The structured of assessment documents varies across different analyst & suppliers) against each supplier and scrutinizes the document line by line. This process would help in identifying targeted meta details like Strengths, Recommendations, Weakness, Observations, Engineering Functional, Capabilities etc. which would help the company in decision-making (Shortlisting Top Suppliers). The expert or team of experts who are assigned to do the job should be strong with deep domain knowledge to collate the data manually with maximum precession. The other approach, which is being used currently, is a rule-based detection using micro automation jobs, but this only works if the structure/pattern is consistent across the document corpus. The problem the company faced was that the volume of data that needs to be processed was on the higher side and the variance in the document structure was high. The company had to invest a lot on resources and the time required to get the desired outcome was on the higher side. As Manual effort required for collating data was huge, the output was prone to errors.

Solution: Implement a cognitive system, which had multiple modules, which solves the different areas of problem statement. The modules includes Knowledge discovery/Extraction module for identifying, annotating & building domain specific models using deep learning techniques precisely Recurrent Neural Network (Long Short Term Memory)) and extracting targeted entities which could be used for short listing the suppliers based on the training done. Information Retrieval from the collated data using user defined natural language queries using Genism & Apache Lucene, Data Pre Processing & Extraction using NLTK. Opinion mining using Machine learning. Using built in deep learning capabilities and custom orchestrator build using python the system was able to process huge volume of data with high variance with reduced time and zero manual intervention.

Business Benefits:

  • Better management of information captured as part of supplier assessment document corpus.
  • Searching through the assessments for knowledge discovery made fast and easy
  • High volume and variety of data could be processed & explored faster in no time saving lot of effort and time
  • Decision making made faster with help of data exploration modules (Compare, Predict, and Search)


About Author

Hareesh Kumar P S, Principal Solution Architect – AI COE

Hareesh is a technocrat and a Desiring Innovation evangelist in Tech Mahindra with over 15 years of diverse experience in the IT Industry. In his career he has conceptualized and built many, ahead of the curve, business value driven solutions using emerging technologies .Hareesh is a core Principal Solution Architect in Tech Mahindra’s AI competency and supports multiple verticals and clients across all regions. Experienced and specialized in conceptualizing, architecting & delivering solutions across diverse streams which include Artificial intelligence (Deep Learning), Machine Learning, Automation, Big Data Analytics, Data Mining, Data Aggregation and Enterprise Applications.

Tags: Connected Platforms & Solutions
(*) symbol is mandatory field
* Email Address:


(*) symbol is mandatory field

Post a Comment

* First Name:
* Email Address:
Image Code
* Enter Image Code

Contact Us


For further information please write to

For further information please write to