With the advent of “big data”, data scientists have to deal with the problem of information overload and the concept of Dimension Reduction has become a vital part of their analysis process. So what is dimension reduction and why is it important in predictive modelling? Let’s see that now.
In machine learning and statistics, dimensionality reduction or dimension reduction is the process of reducing the number of features under consideration, via obtaining a set “uncorrelated” principle features. In other words, if we have a data set with a large number of variables, then we try to extract only significant variables by taking out unwanted variables, ensuring that similar information can be obtained concisely.
Large amount of information can sometimes produce really bad performing models. So dimension reduction is important as it:
- Takes care of multicollinearity to improve model’s performance.
- Removes redundancy in data.
- Speeds up the algorithm by reducing computation time.
- Reduces space used by data.
- Makes data visualization easier thereby making data more understandable.
and so on……
There are many techniques for dimension reduction. Some of them are listed below:
- Missing values ratio:
If there are data columns which contain too many missing values, then it is better to remove those columns since those columns will not provide much information. We can set a threshold for missing values. If the percentage of missing values in a variable is above that threshold, then that variable can be eliminated.
- Low variance:
Measuring variance of a column is another way of knowing how much information a variable can offer. If the data column has constant values, then its variance would be 0 and such variables will not explain the variation in target variables.
- High Correlation:
Data columns interdependent on each other and carrying similar information can add redundancy. It is suffice to take only one such column. Highly correlated columns can be identified using correlation coefficients like the Pearson’s Product Moment Coefficient for numerical columns and the Pearson’s chi square value for nominal columns.
- Decision Trees and Random Forests:
These can be used for tackling issues of missing values, outliers and identifying significant variables. They are used for feature selection to find most informative subset of features.
- Principal Component Analysis (PCA):
PCA is a statistical procedure in which variables are transformed into a new set of variables, which are linear combination of original variables. It reduces the dimensionality of data-set , by finding a new smaller set of m variables, m < n, retaining most of the data information, i.e. the variation in the data.
- Backward Feature Elimination:
In this technique, a dimension reduction loop is run for a particular algorithm. If there are n input features, then firstly, the selected classification algorithm is trained on all those n input features. Then one input feature is removed at a time and the same model is trained on n-1 input features n After this first iteration, we calculate the error rate and check removal of which feature had the least effect on performance of model or improved the model, and then delete that variable, leaving n-1 input features. The same process is repeated with n-2 features, n-1 times and so on, until no more improvement by removal of variables can be seen. Thus, each iteration k produces a model trained on n-k features and an error rate e(k) and features whose removal produces smallest increase in error rate, that is which are less significant, are removed. Now the maximum tolerable error rate is decided, and the model is selected according to that value of e(k). In this way, we define the smallest number of features necessary for a good classification performance with the selected machine learning algorithm.
- Forward Feature Construction:
This is reverse of above process. In this we start with one most significant feature and analyse the performance of model by progressively adding more features, one at a time until no more significant features are left. In other words, we successively add the variables which improve the model most, or affect the model performance most in a favorable direction until no more betterment can be done to the model. Thus, in this case, we never select all of the input features but only those which are based on higher improvement in model performance. The features which are not significant are automatically not taken into account.
Here I have given the main points regarding various techniques for dimensionality reduction. This concept is gaining popularity for optimizing the machine learning models.
A wise man once said that “having too much of information can be as dangerous as having too little.” I wonder if he had heard about dimension reduction.