## What is Latent Dirichlet Allocation?

In a general view, LDA is an unsupervised method for clustering documents. It models (*purified*) documents as bag of words. Also it assumes each word (and document) has a mixture model of topics i.e. each word (and document) may belongs to each of the topics by a probability. It takes number of clusters in the corpus as input then, simply assigns each word in each document a random topic. Then tries for

It was a very general description of LDA.

### How it is work?

The process of LDA depends on the bag of words model of documents. First of all there are K topics that is input of LDA (guessed!). We have totally D documents and V distinct vocabulary in the document set. The generative process is:

- For k = 1 … K:
- φ(k) ∼ Dirichlet(β)

- For each d in D:
- Θ
_{d}∼ Dirichlet(α) - For each word w
_{i}in d*z*_{i}*∼**Discrete*(*θ*)_{d}*w*_{i}*∼**Disctete*(*φ*^{(}^{z}_{i}^{)})

- Θ

This is the total process. But what it means?

### Dirichlet

In simple words, Dirichlet is a probabilistic distribution that has K concentration parameters. Each parameter (α) is a random number greater than zero (α > 0). Following is an example of Dirichlet distribution for 20 documents with 4 topics. Parameters for this example (α_{1} = 10, α_{2} = 5, α_{3} = 3, and α_{4} = 20).

#### φ(k)

This is a Dirichlet distribution for the K^{th} topic. The φ is a KxV matrix where each element is the probability of belonging the v^{th} word to the k^{th} topic.

#### Θ_{d}

Similarly the Θ_{d} is a Dirichlet for the document d. It shows the belonging of the document to each of the topics.

## Finally

The process is as below in simple words:

- For each topic:
- Randomly initialize belonging probability of each word in vocabulary to the topics.

- For each document:
- Randomly initialize belonging probability of current document to the topics.
- For each word:
- Choose a topic from Θ
_{d}(z_{i}) - Randomly choose a new word from φ(k) where k is the selected topic in the previous part.

- Choose a topic from Θ

The last step, helps us to find words similar to the current chosen one to be in same cluster.

In the next post, I will explore the mathematics behind the LDA. Any comments?