Topic Modeling has every growing relevance, specially with most of data being generated is unstructured data. So I thought of explaining what briefly Topic Modeling is. I hope it gives you a starting point to explore more.

**Topic Modeling**

Topic modeling is a technique that automatically identifies groups of words that tend to occur together in a large collection of documents. A “topic model” assigns every word in every document to one of a given number of topics. Every document is modeled as a mixture of topics in different proportions. A topic is distribution of words— measure of how likely given words are to *co-occur* in a document.

One of approaches is Key word based search, but it has its limitations

**Keyword-based Search Limitations**

- A typical Information Retrieval problem:

Example: Suppose, we search for the keyword computers in a document collection. We may miss documents which do not have computers and contain PC, laptop, desktop, etc.

**Synonymy**: words or phrases that have similar meanings e.g., car & automobile, hood & bonnet**Polysemy**: words that have more than one distinct meaning e.g., the occurrence of chair in “the chair of the board” and “the chair maker”- When we search for documents we usually look for concepts or topics, not keywords.

As result of drawbacks we explore Probabilistic Techniques. Where we treat text as features.

**Vector Space Modeling (VSM)**

A simple approach: Document as bag-of-words—ignores any word ordering in a document. Document’s features as the appearance frequencies of unique words

- A typical solution for Keyword Search
- Converts a corpus into a term-document matrix, each cell represents a term’s relative frequency
- Translates a document or keyword query into a vector in vector space
- Measures similarity between documents by cosine scores—small angle ≡ large cosine ≡ similar

**Term-Frequency Inverse-Document-Frequency (TF-IDF)**

TF-IDF

tf-idf = tf × log (D / df ) , d = 1, 2, . . . , D; t = 1, 2, . . . , V

where tf = the frequency of term t in document d and df = the number of documents where term t appears

**Advantages: **Reflects how important a word is to a document in the collection Helps to treat common terms in the corpus

**Limitations:**

- Synonymy may cause small cosine similarity between documents, but are related leads to poor Recall—the fraction of all relevant documents that are retrieved
- Polysemy may cause large cosine similarity between documents, but are unrelated leads to poor Precision—the fraction of retrieved documents that are relevant
- Dimensionality, D and V , can be very large

**Latent Semantic Analysis (LSA)**

- Typically works on the TF-IDF matrix X

- An approximation based on Singular Value Decomposition:

X ≈ X’ = T’ × S’ × D’T

T’ contains selected Eigen vectors of XXT

S contains selected Singular Values of X

D’ contains selected Eigen vectors of XTX

**Advantages**

- Identifies a linear subspace in the space of TF-IDF features—significant compression in large collections
- Can achieve decent retrieval results—handles Synonymy problems to some extent
- Easy to implement

**Limitations:**

- It’s a linear model—may not find nonlinear dependencies between words or documents
- It’s not a probabilistic model
- Not very readable—produces dense features.

**Latent Dirichlet Allocation (LDA)**

A generative probabilistic model that assumes

- Document as a bag of words
- Topic as a distribution over a fixed vocabulary
- Words are generated from document specific topic distributions

**Advantages:**

- Given the corpus, infer the hidden structures:
- Per-word topic assignment
- Per-document topic proportions—dimensionality reduction
- Per-corpus topic distributions

- Use them to perform information retrieval, document clustering, exploration,

Once the domain of the documents to be analyzed is defined, a dictionary needs to be created for the likely topics. For example

**Hybrid Model: Neural Network-Latent Topic Model**

A hybrid model combines a neural network with a latent topic model. The neural network provides a low dimensional embedding for the input data, whose subsequent distribution is captured by the topic model. The neural network thus acts as a trainable feature extractor while the topic model captures the group structure of the data. Following an initial pre-training phase to separately initialize each part of the model, a unified training scheme is introduced that allows for discriminative training of the entire model.

The hybrid model is a combination of a probabilistic model, hierarchical topic model (HTM), and a neural network (NN).

The hierarchical topic model is combined with the neural network to form a hybrid model by treating the output x of the network as the bottom nodes of the hierarchy.

Neural Network: A neural network with two hidden layers, as shown in Fig. 1(a). The first hidden one is a sigmoid layer which maps the input features v into a binary representation h via a sigmoid function, i.e. h = σ(w1v + b1) where σ(t) = 1/(1 + exp(−t)) and w1, b1 are the parameters of this layer. The second hidden layer performs linear dimension reduction x = hw2 + b2, with w2, b2 being parameters. The output of the units x correspond to the transformation fw(v) provided the whole network. An arbitrary number of extra hidden layers could be inserted between these two layers if a more complex transformation is preferred. Let w = {w1, b1, w2, b2} denote all parameters of the network. Training the network is performed by back propagation on w. The initialization of w is obtained by learning a Restricted Boltzmann Machine (RBM) with the same structure of network.

**Advantages:**

- The resulting model combines the strengths of the two approaches: the deep belief network provides a powerful non-linear feature transformation for the domain appropriate topic model.
- The fully trained hybrid model further improves the classification accuracy to 70.1% which is significantly better than HTM.