Latent Dirichlet Allocation

What is it?

In a general view, LDA is an unsupervised method for clustering documents. It models (purified) documents as bag of words. Also it assumes each word (and document) has a mixture model of topics i.e. each word (and document) may belongs to each of the topics by a probability. It takes number of clusters in the corpus as input then, simply assigns each word in each document a random topic. Then tries for

It was a very general description of LDA.

How it is work?

The process of LDA depends on the bag of words model of documents. First of all there are K topics that is input of LDA (guessed!). We have totally D documents and V distinct vocabulary in the document set. The generative process is:

  1. For k = 1 … K:
    1. φ(k) Dirichlet(β)
  2. For each d in D:
    1. Θd Dirichlet(α)
    2. For each word wi in d
      1. zi Discrete(θd)
      2. wi Disctete(φ(zi))

This is the total process. But what it means?


In simple words, Dirichlet is a probabilistic distribution that has K concentration parameters. Each parameter (α) is a random number greater than zero (α > 0). Following is an example of Dirichlet distribution for 20 documents with 4 topics. Parameters for this example (α1 = 10, α2 = 5, α3 = 3, and α4 = 20).



This is a Dirichlet distribution for the Kth topic. The φ is a KxV matrix where each element is the probability of belonging the vth word to the kth topic.


Similarly the Θd is a Dirichlet for the document d. It shows the belonging of the document to each of the topics.


The process is as below in simple words:

  1. For each topic:
    1. Randomly initialize belonging probability of each word in vocabulary to the topics.
  2. For each document:
    1. Randomly initialize belonging probability of current document to the topics.
    2. For each word:
      1. Choose a topic from Θd (zi)
      2. Randomly choose a new word from φ(k) where k is the selected topic in the previous part.

The last step, helps us to find words similar to the current chosen one to be in same cluster.

In the next post, I will explore the mathematics behind the LDA. Any comments?

Your email address will not be published. Required fields are marked *


LinkedIn Auto Publish Powered By :