Mercedes Benz Greener Challenge

Xnishit
11 min readNov 13, 2020
##Ref:Google Images

Business Problem:

Overview:

  • Daimler popularly known as Mercedes Benz is a leading premium car automobile manufacturer from Germany . This company applies for nearly 2000 patents a year making European leader among premium cars. With a huge selection of features and options customers can choose the customized car of their choice.
  • Mercedes wants to ensure safety and reliability of every unique car configuration before they are launched on to the market. Daimler’s engineers has developed a robust testing system but optimizing the speed of testing systems for many feature combinations is very complex and much time consuming.
  • So, Mercedes wants to find a best algorithm that calculates the time taken to pass the testing phase of each car configuration given different permutations of car features with out reducing its standards.

Problem Statement:

Given car configuration and the tests it has gone through , predict the time taken to complete the testing phase.

Business Constraints:

  • No Latency Constraints
  • Some level of Interpretability

Machine Learning Problem:

Type of Machine Learning Problem:

It is a Regression problem, given car configuration and testing features we need to predict the time taken to complete the testing phase.

Data:

Source:https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/data

  • Data has two csv files — Train.csv and test.csv
  • Train.csv has 4209 rows and 378 columns
  • There are 8 categorical columns , 368 Binary columns , 1 Id column and a Target variable
  • Test.csv has 4209 rows and 377 columns

Performance Metrics:

Source:https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/overview/evaluation

  • As we are predicting the time taken for testing each car configuration , it is a regression task and the evaluation metrics used as per the source is R²(Coefficient of Determination)

Data Analysis:

Train data Info:

There are 8 categorical features 369 int features including the ID,Binary features and a float d-type feature which is the target variable

Checking for any null values:

Train and test null values

There are no null values in the data .

Target Variable Analysis:

Distribution plot of target variable:

From the distribution plot of target variables most of the values lie between 75 and 150 and some points are beyond 250 to get a much clear view lets plot a Box plot.

Box Plot of Target Variable:

From Box plot we get much better understanding of the distribution that there are only a few points beyond 250.

Checking percentiles gives much better view on the distribution

Calculating 99 to 100 percentiles of the target variable

From percentiles we can observe that 99.9% of points lie less than 160.

Calculating 99.90 to 100 percentile values:

These percentiles values give clear info that only 0.02% of points have values more than 200.

They can be considered as outliers but lets try training with outliers and by removing the outliers and compare the performance of the model.

Uni variate Analysis:

There are 8 categorical features → X0,X1,X2,X3,X4,X5,X6 and X8.

Plotting function to plot Distribution of categories and relation between categories and target variable in each feature

Feature X0 :

  • There are 47 unique categories in X0
  • Distribution of X0 falls sharply
  • From box plot each category is having different range with respect to y , so it may be useful in predicting y
  • Category ‘aa’ is having highest mean compared to other categories

Feature X1:

  • There are 27 unique categories in X1
  • Only few categories have count of more than 200
  • From box plot many categories are overlapping , so this feature may not have much impact on predicting

Feature X2:

  • There are 44 unique categories in X2
  • Only 5 categories comprise most of the values in X2
  • Category ‘at’ contributes more than 40% of values
  • Each category is having some different ranges with respect to y so this feature may be useful in predicting

Feature X3:

  • There are seven unique categories in X3
  • Only categories ‘a’ and ‘c’ contribute more than 70 % of values
  • From box plot each category is having almost similar value with respect to y , so this feature may not have much impact in predicting

Feature X4:

  • There are only 4 unique categories in X4
  • Category ‘d’ is having more than 99.9% of values
  • This feature may not be useful in predicting .

Feature X5:

  • There are 27 unique categories in X5
  • Distribution is almost uniform among the categories except for a few
  • Some categories have different y values , these categories may help in predictions

Feature X6:

  • There are 12 unique categories in X6
  • All of the categories are having almost similar y values so this feature may not effect in predicting the target variable.

Feature X8:

  • There are 25 unique features in X8
  • Each category is having more than 100 values
  • From box plot all categories are having similar values with respect to y

Conclusion:

  • Features X0,X1 and X2 are having more variance among the categories with respect to target variable , so these features may have more effect on predicting comparing to other features.
  • Feature X4 is having almost zero variance so we can try removing this feature .
  • Features X3,X4,X5 and X8 are having similar values for their categories with respect to y.
  • We can try different encoding techniques and choose the one which performs best.

Binary feature Analysis:

There are 368 binary columns in Train and test data.

Plot for number of ones and zeros in each feature

Count of zeros and ones in each feature

Summary :

  • Some columns are having very little variance , we can try removing those features while training
  • Remaining features may have some impact on the target variables.

Co-relation between the features using Phi_k:

Ref:https://phik.readthedocs.io/en/latest/

Heat map of co-relation of categorical features

Co-relation of Categorical features with respect to target variable:

From the co-relation we can observe that features X0 and X2 are much co-related to target variable and X4 is having almost no impact so we can drop X4.

Top 10 co-related features with respect to target variables:

Conclusion:

  • There are no labels for the columns so we don't known what they mean so feature engineering is bit tricky.
  • We can try top co-related features with the target variable as a hyper-parameter in training the models
  • More than 75 features have almost no relation in predicting the target variable so we can try training the models with removing those features
  • We can try adding features using PCA and SVD.

First Cut Approach:

Splitting the train data into train and validation data to check the models performance on unseen data.

Target Encoding:

Target encoding of categorical features is replacing the categories of a feature with the mean of the target variable.

##Ref:Google Images

Using target encoding on Categorical Features:

Training using different regression models

  • Linear Regression on top 270 features data:
  • Linear Regression on top 270 features with target values less than 160:
  • Linear Regression on full data:
  • Linear Regression on full data with target values less than 160:
  • SVR on top 270 features data:
  • SVR on Full data:
  • Decision tree on top 270 features data:
  • Decision tree on top 270 features with target values less than 160:
  • Decision tree on full data:
  • Decision tree on full data with target values less than 160:
  • Xgboost on top 270 features:
  • Xgboost on top 270 features with target values less than 160:
  • Xgboost on full data:
  • Xgboost on full data with target values less than 160:

Using Label encoding on categorical features:

Label Encoding:

Label encoding of categorical features is like assigning a particular integer for each category.

  • Linear Regression on top selected features:
  • Linear regression on top selected features with target values less than 160
  • Linear Regression on Full data:
  • Linear regression on full data with target values less than 160:
  • Decision tree on top selected features
  • Decision tree on top features with target values less than 160
  • Decision tree on full data
  • Decision tree on full data with target values less than 160:
  • Xgboost on top selected features:
  • Xgboost on top features with target values less than 160
  • Xgboost on full data
  • Xgboost on full data with target values less than 160:

Results Tabulated:

  • Xgboost model with label encoded features with top selected features and with outliers has highest R² score
  • Decision tree models and Linear regression models has almost similar performance
  • SVR with rbf kernel is not fitting on the data so ignoring SVR.
  • Some more hyper-parameter tuning of Xgboost and considering stacked models may increase R2 score.

Hyper-Parameter Tuning Xgboost:

  • Using Grid search to find out the best parameters
  • Training models on both Label and Target encoded features
  • Selecting top features related to target variable as a hyper parameter
  • Training on data with target values less than 160 and on full data
  • Using the best model from grid search submitting each scores on to kaggle for Leader Board position.

Tabulated scores of different models:

Results of XGB with grid search

Summary:

  • Target encoded features with top 100 co-related features are performing better on data
  • Label encoded data with top 250 to 270 features has much better performance.
  • Best model is Label encoded with top 260 features and target values less than 175 has the highest r2 score of 0.55341 which is like top 1% in the Leader board.

Training with stacked models:

  • Using different combinations of regressors in stacked regression and XGB as a meta regressor for each stack.
  • Finding the best model using grid search for each stacked models
  • Predicting the test data on each best model from stack .
  • Submitting to kaggle to check the Leader board position.

Tabulated results of stacked models:

Stack with Linear regressor,Random Forest,Lasso with XGB as a meta regressor has the best r2 score of 0.54997 , but XGB with grid search has much better score of 0.55341.

Best model:

  • XGB with top 260 features and without outliers has best score.
  • Suggestion from first place winner of this competition wants to add some more feature interactions in XGB and try not considering the feature “ID”
  • As there are no labels for features and we don't know what they mean so considering top features for feature interaction and feature engineering is bit tricky for this data.
  • Training the best model by removing feature “ID” and some combinations of feature interaction
  • by removing “ID” feature and with some feature interactions models performance is slightly decreased.
  • So considering the best model as the final model
Best model submission

Kaggle Leader board position

Kaggle Leader board

Final model has the score of 0.55341 which gives 34th position in the Leader board which is like Top 1% among the submissions.

Final model: https://github.com/Nishit330/Mercedes_benz_case_study/blob/main/final2.ipynb

Future Work:

  • Try adding features using PCA ,TSVD
  • Training on neural network with proper layers and hyper-parameter tuning
  • Encoding with different encoders and verify the performance
  • Try feature engineering on the features like adding some features but this is bit hard as there are no labels and we don’t know what they mean.

--

--