Topic Modeling has every growing relevance, specially with most of data being generated is unstructured data. So I thought of explaining what briefly Topic Modeling is. I hope it gives you a starting point to explore more.

Topic Modeling

Topic modeling is a technique that automatically identifies groups of words that tend to occur together in a large collection of documents. A “topic model” assigns every word in every document to one of a given number of topics. Every document is modeled as a mixture of topics in different proportions. A topic is distribution of words— measure of how likely given words are to co-occur in a document.

One of approaches is Key word based search, but it has its limitations

Keyword-based Search Limitations

  • A typical Information Retrieval problem:

Example: Suppose, we search for the keyword computers in a document collection. We may miss documents which do not have computers and contain PC, laptop, desktop, etc.

  • Synonymy: words or phrases that have similar meanings e.g., car & automobile, hood & bonnet
  • Polysemy: words that have more than one distinct meaning e.g., the occurrence of chair in “the chair of the board” and “the chair maker”
  • When we search for documents we usually look for concepts or topics, not keywords.

As result of drawbacks we explore Probabilistic Techniques. Where we treat text as features.

  1. Vector Space Modeling (VSM)

A simple approach: Document as bag-of-words—ignores any word ordering in a document. Document’s features as the appearance frequencies of unique words

  1. A typical solution for Keyword Search
  2. Converts a corpus into a term-document matrix, each cell represents a term’s relative frequency
  3. Translates a document or keyword query into a vector in vector space
  4. Measures similarity between documents by cosine scores—small angle ≡ large cosine ≡ similar

Term-Frequency Inverse-Document-Frequency (TF-IDF)


tf-idf = tf × log (D / df ) , d = 1, 2, . . . , D; t = 1, 2, . . . , V

where tf = the frequency of term t in document d and df = the number of documents where term t appears

Advantages: Reflects how important a word is to a document in the collection Helps to treat common terms in the corpus


  • Synonymy may cause small cosine similarity between documents, but are related leads to poor Recall—the fraction of all relevant documents that are retrieved
  • Polysemy may cause large cosine similarity between documents, but are unrelated leads to poor Precision—the fraction of retrieved documents that are relevant
  • Dimensionality, D and V , can be very large

Latent Semantic Analysis (LSA)

  • Typically works on the TF-IDF matrix X
  • An approximation based on Singular Value Decomposition:

X ≈ X’ = T’ × S’ × D’T

T’ contains selected Eigen vectors of XXT

S contains selected Singular Values of X

D’ contains selected Eigen vectors of XTX


  • Identifies a linear subspace in the space of TF-IDF features—significant compression in large collections
  • Can achieve decent retrieval results—handles Synonymy problems to some extent
  • Easy to implement


  • It’s a linear model—may not find nonlinear dependencies between words or documents
  • It’s not a probabilistic model
  • Not very readable—produces dense features.

Latent Dirichlet Allocation (LDA)

A generative probabilistic model that assumes

  • Document as a bag of words
  • Topic as a distribution over a fixed vocabulary
  • Words are generated from document specific topic distributions


  • Given the corpus, infer the hidden structures:
    • Per-word topic assignment
    • Per-document topic proportions—dimensionality reduction
    • Per-corpus topic distributions
  • Use them to perform information retrieval, document clustering, exploration,

Once the domain of the documents to be analyzed is defined, a dictionary needs to be created for the likely topics. For example

Hybrid Model: Neural Network-Latent Topic Model

A hybrid model combines a neural network with a latent topic model. The neural network provides a low dimensional embedding for the input data, whose subsequent distribution is captured by the topic model. The neural network thus acts as a trainable feature extractor while the topic model captures the group structure of the data. Following an initial pre-training phase to separately initialize each part of the model, a unified training scheme is introduced that allows for discriminative training of the entire model.

The hybrid model is a combination of a probabilistic model, hierarchical topic model (HTM), and a neural network (NN).

The hierarchical topic model is combined with the neural network to form a hybrid model by treating the output x of the network as the bottom nodes of the hierarchy.

Neural Network: A neural network with two hidden layers, as shown in Fig. 1(a). The first hidden one is a sigmoid layer which maps the input features v into a binary representation h via a sigmoid function, i.e. h = σ(w1v + b1) where σ(t) = 1/(1 + exp(−t)) and w1, b1 are the parameters of this layer. The second hidden layer performs linear dimension reduction x = hw2 + b2, with w2, b2 being parameters. The output of the units x correspond to the transformation fw(v) provided the whole network. An arbitrary number of extra hidden layers could be inserted between these two layers if a more complex transformation is preferred. Let w = {w1, b1, w2, b2} denote all parameters of the network. Training the network is performed by back propagation on w. The initialization of w is obtained by learning a Restricted Boltzmann Machine (RBM) with the same structure of network.


  • The resulting model combines the strengths of the two approaches: the deep belief network provides a powerful non-linear feature transformation for the domain appropriate topic model.
  • The fully trained hybrid model further improves the classification accuracy to 70.1% which is significantly better than HTM.

Leave a Reply