Client Background

Client is one of the China’s premier financial services portal providing stock market news, personal finance advice and latest happenings in the financial world across the globe. It attracts millions of Chinese visitors per month on its website who engage with its content. Subscribed visitors are also served content as email digest and browser notifications.


Business Objective

As a digital content provider, it’s essential to keep your visitors engaged by serving them meaningful & relevant content at right time. There are millions of articles covering various aspects of finance, stock markets & news that need to be analyzed and recommended. Till then rule-based approach with text mining had been used mostly for content discovery & recommending similar articles to a particular article. This yielded results to an extent but wasn’t expected to scale. Client wanted to adopt a more sophisticated approach using Natural Language Processing & Machine Learning as the foundation.



Our team suggested narrowing down the scope of the solution to suggesting relevant articles for any page visit. This had to be done in real time and we had the context available in form of current page visit. Considerations of integration of algorithms within the current platform and scaling to serve real-time recommendations would be taken care by clients technology team.


Our approach involved following steps:


  1. Convert every article into a feature vector using n-grams.
  2. Each article to be scored with other articles to determine a similarity score. There are different techniques present, so, we will use multiple and evaluate the results.
  3. Additional business rules will be overlaid on top of similarity score. These rules cover trending articles, popular articles, date of publish & geography preferences.
  4. Initial exercise will be limited to a sample dataset. Once validated by the client we will scale the solution in collaboration with client’s technology team.


Client provided us with sample article corpus on Amazon S3 bucket. We created a spark cluster on AWS for processing. Every article was converted into a feature vector and scored against each other for similarity rating. We also created business rules for computing popularity scores, trending articles factoring in time decay and visits.


Working on the solution was demonstrated through simulated incoming article request that was used to discover feature vector of current article and discover articles with high similarity scores. Article result set was further segmented by popularity, trending, time decay factor along with other visitor preferences.


  • Removed the need for manual classification with automated tagging accurate at 80% levels. Accuracy was verified manually by client’s team.
  • Controlled pilot on the sample of visitors resulted in increased visitor engagement by up to 20%.