Author: Shashank Raj (Business Analyst), Kavita Yadav (HR)

Compensation is one of the key culture-definers for any organization. Every organization strives to decide the best compensation for its employees during talent acquisition as well as internal appraisal cycle. This is crucial, because if the employee is underpaid then it may lead to attrition and in the case of overpaid it may impact the profitability of the company.

But, determining the “right” compensation can be tricky because a number of factors play a crucial role in determining compensation rates that are both fair and competitive. 

Problem Statement:

In the age of fierce competition, one of the important aspects companies are looking for is to periodically benchmark their compensation structures. This has lately been a challenge and companies mainly rely on external benchmarking agencies such as Mercer, Michael Page, Kelly Services, Glassdoor, Payscale etc. 

Business Need:

The key objective of the solution is to come up with:

  • A data crawling engine to extract industry standard compensation data from open public sources
  • Standardise data for business consumption
  • Develop Machine Learning (ML) based compensation estimation model and compensation cuts across Experience, Skills, Domain and Education

Proposed Solution:

Platforms like Glassdoor are crowdsourced and they have a database of more than a million salaries and reviews. Other global and country specific reports are published by different agencies which give a true picture of market compensation structure based on prevailing economic conditions. Example: U.S. Bureau of Labor Statistics and PayScale’s report for India are a few who provide comprehensive stats around salary structure for different sectors .

Instead of using the old indicators of age and tenure to estimate compensation structure, ML based algorithms take into account many additional factors such as recent changes in role, pay level, rates of change in pay and incentive eligibility to refine the prediction of compensation. This allows companies to be more successful and effective in managing the compensation of their employees.

Data Crawling Engine. What ,Why, How?:

Data crawling is a process used for data extraction and refers to collecting data from either the world wide web, or in data crawling cases – any document, file, etc. For this project, we intend to extract compensation data from the public data sources and other reports. The data points looked for are:

Job level: Executive, Manager, Individual Contributors 

Job description: Organisation, Size of the organization, Location 

Sector: IT, Engineering, Health Care, Life Sciences, Automobile, Banking and Financial Services etc.

Department: Operations, Sales, Marketing, Technology, HRM etc.

Compensation Structure: Fixed Pay, Variable Pay, Retention Bonus, ESOPs, Rewards and Recognition, Other Benefits (food coupons, insurance, employee discounts, performance bonus, child care policies, vacation and time-offs)

The approach would look something like this: 

  1. The crawler goes to your predefined target – website /report
  2. Discovers the salary pages
  3. Finds the salary details (industry, department. description, CTC)
  4. Scrapes the information

Compensation Estimation Model:

While benchmarking the compensation structure, there are four main features deciding the compensation – one of the candidate, other of the organization, job description and external factors.

Employee attributes: Age, Professional experience, Previous compensation, Education, Department, Skill Set, Training certifications, Job tenure

Organisation attributes: Sector of operation, Size of the organization

Job attributes: Location, Requirement of the job, Occupational group

External data: GDP growth, Inflation, Asset growth, Job growth, Unemployment rate, CSO data


  • Data collection: a Python-based data crawler developed earlier parses and gathers the necessary information from the website
  • Data cleaning: posts with missing values are removed and possible conflicts in the data format (e.g. text encoding) are fixed.
  • Feature engineering: irrelevant features are discarded and others are standardized (e.g. converted into numerical features) by exploiting the domain knowledge
  • Model training and validation: the selected models are trained and cross-validated in order to find the classifiers that best describe the data and are able to predict the output variable with the highest accuracy
  • Model comparison and selection: each model is compared to the others with respect to accuracy and the best performing champion model is selected

Models tested using: K-means clustering, Randomized linear regression, Logistic regression.

Modelling Methodologies: 

  1. The K-means clustering model estimates the salary by finding the group of jobs containing similar profiles. The other models estimate the salary based on the features used. 
  2. Regression Models – Since salary is a continuous variable, the regression models can very well estimate the same with high accuracy. The model is integrated with the HRM system. When an employee is added to the system, the model measures key metrics against the employee and returns the estimated compensation. 


Compensation benchmarking is becoming an indispensable aspect for every organisation. Whether it is a small firm or well established large organisation, it is important to benchmark the pay structures, allowing you to maintain externally competitive and internally equitable pay over time.

HRM teams who are in charge of recruiting, hiring, and retaining talent for the company, know the challenge of competing against other organizations to attract the right employees. By using data driven compensation benchmarking models, you can protect the interests of current and future employees while ensuring the company’s growth.

I agree to have my personal information transfered to MailChimp ( more information )
Join over 3.000 like minded AI enthusiasts who are receiving our weekly newsletters talking about the latest development in AI, Machine Learning and other Automation Technologies
We hate spam. Your email address will not be sold or shared with anyone else.

Leave a Reply