Essay on data science blog post

profileDewrhaa
BlogPostworking_19065646.docx

7 Popular Feature Selection Routines in Machine Learning

Harini Subhasri Iragavarapu

Introduction and Personal motivation:

This report will focus on the blog post, exploring the best ways of feature selection. Usually, A general and practical dataset consists of many unnecessary features which in turn impacts the performance of the model. Building up or choosing the best features for training a robust ML model and its discussion motivated me to choose this blog post.

The Domain knowledge that relates to a particular Data Scientist or Machine Learning Engineer would help to choose the best features and also the set of variables. The datasets generally consist of the missing values that may occur due to the failure to record or data exploitation. Various techniques can be used for the imputation of missing values but those techniques don’t match the real data. So, the model trained on the features with the missing values may not yield a better performance.

The Correlation with the Target label and the correlation between the features witnesses many techniques such as Pearson, Spearman, Kendall etc. df.corr() returns with the correlation coefficient between the features. If the variables are highly correlated with the target class, they are known to be the key features. If the features or variables are not correlated with the target variables, they will not impact the model performance.

The Principle Component Analysis (PCA) is the dimensionality reduction method, which helps in extracting the features from the dataset. It uses the Matrix factorization method for the reduction into lower dimension. When the data dimensionality is high then this PCA method is used.

The Forward or the Backward feature selection helps in finding out the subset of best performing features for the ML model. The variables are selected based on the previous result interference when there are n features. This forward feature selection techniques follow: Evaluating the model performance after training, Finalizing the variables or set of features with better results and repeating this until the desired number of features are obtained.

Figure: Forward feature selection

Feature importance gives a score for each variable. It is generally the list of features and also an inbuilt function in the Sk-Learn in building up ML-models.

Conclusion:

The people with the data or domain knowledge helps in the selection of best features. Whereas coming for the missing values, the model trained on the features may not yield good performance even after incorporating the techniques for imputation. If the correlation between the features is considered, the change in one variable or feature will also impact the other variable. The PCA method reduces the dataset from using the various variables to the desired number of features. But, removing the redundant variables is a tough task here. In the forward feature selection method, first all the variables are chosen and then most redundant features are removed in each step. Feature importance scores identifies the best subset of features. By the comparative analysis of these seven techniques one can easily develop a data science model with good performance.

Reference:

https://www.analyticsvidhya.com/blog/2021/03/7-popular-feature-selection-routines-in-machine-learning/

Word Count: 493