Latent Dirichlet allocation

A way of thinking about topic modeling is as a matrix factorization problem. For m documents and v words we want k topics
MxK (topic assignments) KxV (affinity of  word to topics, distribution over words for each topic) = MxV dataset

if you use SVD this would be called LSA and very popular for information retrieval

Dirichlet distribution:
It is a multivariate generalization of the beta distribution (Conjugate prior probability distribution for the Bernoulli, binomial, negative binomial and geometric distributions), hence its alternative name of multivariate beta distribution (MBD)

Dirichlet distributions are commonly used as prior distributions in Bayesian statistics, and in fact the Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution.

Each topic 
a distribution over words is a Dirichlet distribution (of parameter lambda)

A single document 
a distribution over topics  is a Dirichlet distribution (of parameter lambda)

Generative model of LDA:
For every document we have a Dirichlet distribution over all the possible topics it could use
it selects which topic it's going to talk about in the document
To generate the words in a document each word selects a topic its going to use, and this comes from the multi-nomial distribution. The first word chooses to use entertainment topic then you go to that topic which itself is a multinomial distribution and you select some word to use.

Start with a Dirichlet distribution parametrized by λ
k draws (one for each topic).
this gives us distribution over words β

We show that we have a topic and each topic is a distribution over words

Then draw distribution over topics in a document from Dirichlet distribution with parameter α

K # of topics
V # of vocabs
M # of documents

α ∈ RK
for i ∈ {1,..,M}
θi ~ DIR(α), θi ∈ RK   ==> 

Latent Dirichlet allocation

Say you have K topics
For each document, for each word assign a topic in random.
This gives a random representations of all the documents and word distributions of all the topics
For each document
For each word
For each topic
Compute two things:
1) p(topic t | document d) = % of words with topic t in document d
2) p(word w | topic t) = % of word with topic t over all topic ts
choose topic t with probability p(topic t | document d) * p(word w | topic t) for word. (according to our generative model, this is essentially the probability that topic t generated word w, so it makes sense that we resample the current word’s topic with this probability)

We’re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.
That sampling of topics based on that prob. for each word is called Gibbs Sampling.

Create a matrix where rows are terms and columns are documents. The entry is the count of each term (or tf/idf)
take a SVD (singular value decomposition) of the matrix.
SVD gives you X = UΣVT where  U and V are orthogonal matrices and Σ is a diagonal matrix.
The entries of the Σ matrix are the singular values (eigenvalue), and the U and V matrices are the left and right singular vectors, corresponding to term and documents vectors
This is simply a re-representation of the X matrix using orthogonal indexing dimensions.
LSA uses a truncated SVD, keeping only the k largest singular values and their associated vectors, so
The rows in Uk are the term vectors in LSA space and the rows in Vk are the document vectors.
Since both passages and terms are represented as vectors, it is straightforward to compute the similarity between passage-passage, term-term, and term-passage. In addition, terms and/or passages can be combined to create new vectors in the space. The process by which new vectors can be added to an existing LSA space is called folding-in. The cosine distance between vectors is used as the measure of their similarity for many applications because of its relation to the dot-product criterion and has been found effective in practice, although other metrics, such as Euclidean or city-block distances are sometimes used. To accurately update the SVD and thereby take into account new term frequencies and/or new terms requires considerable computation; minor perturbations to the original term-by-document matrix X can produce different term and passage vectors (and therefore affect precision and recall for query matching).