LDA with topterms

May 6, 2019

Report Abuse
Using built-in Latent Dirichlet Allocation module to clusterize text
In LDA each document in the corpus is represented as a multinomial distribution over topics. Each topic is represented as the multinomial distribution over words. Based on the likelihood it is possible to claim that only a small number of words are important. In this case the model simultaneously learns the topics by iteratively sampling topic assignment to every word in every document (in other words calculation of distribution over distributions), using the Gibbs sampling update. Every topic is a multinomial distribution over terms. Consequently, a standard way of interpreting a topic is extracting top terms with the highest marginal probability (a probability that the terms belongs to a given topic). However, for tasks where the topics distributions are provided to humans as a 1rst-order output, it may be difficult to interpret the rich statistical information encoded in the topics. In r there is an excellent tm package (which is already pre-installed on AML virtual machine) that contains the LDA facility. It allows you to see the topics as this multinomial distribution, like in the following image (taken from David Blei’s research paper - M. I. J. David M. Blei, Andrew Y. Ng. Latent dirichlet allocation. Journal of Machine Learning Research, 3, 2003 ) ![LDA top terms][1] Another solution may be using Vowpal Wabbit module, which is memory friendly and is very easy to use. According to Microsoft Docs (https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/latent-dirichlet-allocation): This module takes a column of text, and generates these outputs: - The source text, together with a score for each - A feature matrix, containing extracted terms and coefficients for each category - A transformation, which you can save and reapply to new text used as input - Because this module uses the Vowpal Wabbit library, it is very fast. After you have followed all the steps the module output represents all the documents with their most relevant topics and all the terms with their topics. However, the output does not at all look like a classic topterms output. Nevertheless, the output is saved as a dataframe, thus we could try applying some transformation and obtain our top terms. This will convert the output into our usual top terms matrix. [1]: https://static.wixstatic.com/media/749f52_72125bcf917e4edeb721a0ca8b8ea524~mv2.png/v1/fill/w_624,h_487,al_c,lg_1,q_90/749f52_72125bcf917e4edeb721a0ca8b8ea524~mv2.webp