February 8, 2017

156 Views

Executive Summary @ a Finger Click – A Technocrat View

Text summarization is the process of automatically creating a compressed version of a given document, preserving its information, content and intent. Summary is a small piece of text that is produced from one or more text paragraphs or even documents, that conveys important information from the original document. It is generally half of the original content and ideally should be even less than that. Summaries should retain the important information, should be controllable, succinct and short. The link between a text element in the summary and its position in the original text should also be easily traceable.

For eg: The Ramayana which contains 24,000 verses (Shlokas) in 500 chapters (Sargas) organized as 7 volumes (Kandas) can be summarized as

  • Built – Rama built bridge,
  • Hit – Rama killed Ravana
  • Brought – Rama brought back Sita.

Summarization Approaches

The three approaches to automatic summarization are: Extraction, Abstraction & Aided summarization

Extraction: Extractive summarization methods simplify the problem of summarization by selecting a representative subset of the sentences in the original documents.

Abstraction: Abstractive methods are built an internal semantic representation and use natural language generation techniques to create a summary that is closer to what a human might generate. Such a summary might contain words not explicitly present in the original.

Aided Summarization:  Machine learning techniques from closely related fields such as information retrieval or text mining have been successfully adapted to help automatic summarization with the help of human suggestions/corrections.

The process of text summarization can be decomposed into ‘Analysis’ - analyzing the input text/documents and selecting a few salient features, ‘Transformation’ - transforming the results of the analysis into a summary representation, and ‘Synthesis’ - taking the summary representation, and producing an appropriate summary aligned to users’ specifications.

Modules of Summarizing

A Text Summarization tool must have three components corresponding to the above three states – Text preprocessor, Text Analyzer and Machine Language engine. There are numerous algorithms applicable to each of the above components. 

We have implemented Text summarization engine with algorithms such as ‘Term frequency’, ‘K- means’, ‘Self Organizing map’, and ‘Unsupervised Forward Selection’. Find below a sample paragraph and summarization results from the stated models with threshold 0.4.

Sample Data

Intelligent Sustenance Engineering is a service level unit that caters to ERS customers who are looking for smarter ways of maintenance, product management and sustenance.ISE which is a short for of Intelligent Sustenance Engineering comprises of several use case scenarios related to development, testing and software product support that are implemented using state of the art machine learning algorithms. Machine learning algorithms are typically data driven and rely on historical data repositories.Defect localization, test case recommendation, similar issue identification, and defect forecast are some of the popular use cases that are developed based on machine learning algorithms such as random forest, call graph, collaborative filtering and single value decomposition and ARIMA. R is the popular machine learning algorithm implantation language.

Python, Scala, Weka, Knime and Matlab are few other popular language tools for machine learning implementation. 

Result/Output: From the above highlighted text below is expected as the result.

Term Frequency

K-Means

Self-Organizing Map

UFS with weighted TF

ISE which is a short for of Intelligent Sustenance Engineering comprises of several use case scenarios related to development, testing and software product support that are implemented using state of the art machine learning algorithms.

Defect localization, test case recommendation, similar issue identification, and defect forecast are some of the popular use cases that are developed based on machine learning algorithms such as random forest, call graph, collaborative filtering and single value decomposition and ARIMA

ISE which is a short for of Intelligent Sustenance Engineering comprises of several use case scenarios related to development, testing and software product support that are implemented using state of the art machine learning algorithms.

Defect localization, test case recommendation, similar issue identification, and defect forecast are some of the popular use cases that are developed based on machine learning algorithms such as random forest, call graph, collaborative filtering and single value decomposition and ARIMA

Defect localization, test case recommendation, similar issue identification, and defect forecast are some of the popular use cases that are developed based on machine learning algorithms such as random forest, call graph, collaborative filtering and single value decomposition and ARIMA.

R is the popular machine learning algorithm implantation language

ISE which is a short for of Intelligent Sustenance Engineering comprises of several use case scenarios related to development, testing and software product support that are implemented using state of the art machine learning algorithms.

Defect localization, test case recommendation, similar issue identification, and defect forecast are some of the popular use cases that are developed based on machine learning algorithms such as random forest, call graph, collaborative filtering and single value decomposition and ARIMA

Term Frequency

ISE which is a short for of Intelligent Sustenance Engineering comprises of several use case scenarios related to development, testing and software product support that are implemented using state of the art machine learning algorithms.

Defect localization, test case recommendation, similar issue identification, and defect forecast are some of the popular use cases that are developed based on machine learning algorithms such as random forest, call graph, collaborative filtering and single value decomposition and ARIMA

K-Means

ISE which is a short for of Intelligent Sustenance Engineering comprises of several use case scenarios related to development, testing and software product support that are implemented using state of the art machine learning algorithms.

Defect localization, test case recommendation, similar issue identification, and defect forecast are some of the popular use cases that are developed based on machine learning algorithms such as random forest, call graph, collaborative filtering and single value decomposition and ARIMA

Self-Organizing Map

Defect localization, test case recommendation, similar issue identification, and defect forecast are some of the popular use cases that are developed based on machine learning algorithms such as random forest, call graph, collaborative filtering and single value decomposition and ARIMA.

R is the popular machine learning algorithm implantation language

UFS with weighted TF

ISE which is a short for of Intelligent Sustenance Engineering comprises of several use case scenarios related to development, testing and software product support that are implemented using state of the art machine learning algorithms.

Defect localization, test case recommendation, similar issue identification, and defect forecast are some of the popular use cases that are developed based on machine learning algorithms such as random forest, call graph, collaborative filtering and single value decomposition and ARIMA

We shall share more implementation details and experiences related to running time, summary output comparison, various technology platforms/tools (R, Python…), and the suitability of deep learning strategies in the next blog.

References:

  1. https://en.wikipedia.org/wiki/Automatic_summarization