With a large amount of data (especially unstructured text data) collected every day, a significant challenge is to organize, search, and understand vast quantities of these texts. This document collection analysis scenario demonstrates an efficient and automated end-to-end workflow for analyzing large document collection and enabling downstream NLP tasks.
The key elements delivered by this scenario are:
1. Learning salient multi-words phrase from documents.
2. Discovering underlying topics presented in the document collection.
3. Representing documents by the topical distribution.
4. Presenting methods for organizing, searching, and summarizing documents based on the topical content.
The methods presented in this scenario could enable a variety of critical industrial workloads, such as discovery of topic trends anomaly, document collection summarization, and similar document search. It can be applied to many different types of document analysis, such as government legislation, news stories, product reviews, customer feedbacks, and scientific research articles.
The machine learning techniques/algorithms used in this scenario include:
1. Text processing and cleaning
2. Phrase Learning
3. Topic modeling
4. Corpus summarization
5. Topical trends and anomaly detection
* The **detailed documentation** for this document collection analysis scenario includes the step-by-step walk-through: https://docs.microsoft.com/en-us/azure/machine-learning/preview/scenario-document-collection-analysis.
* For code samples, click the **View Project** icon on the right and visit the project GitHub repository.
* Key components needed to run this example:
1. An [Azure account](https://azure.microsoft.com) (free trials are available).
1. An installation of Azure Machine Learning Workbench with a workspace created.
1. This example could be run on any compute context. However, it is recommended to run it on a multi-core machine with at least of 16-GB memory and 5-GB disk space.