We will now use the code below to train the random forest model.
# Train the RF model
rf_model = RandomForestClassifier(n_estimators=100, random_state=1).fit(train_x,train_y)
pred_y = rf_model.predict(test_x)
cm = confusion_matrix(test_y,pred_y)
print(cm)
accuracy_score(test_y,pred_y)
The output of the Random forest model is given below:
The random forest model has a slightly better accuracy at ~50% with (13+12) targets identified correctly and (14+11) targets mis-classified-14 being false positives and 11 being false negatives.
We will now look at the most influential variables in both the models and how they are affecting the accuracy. We will use ‘PermutationImportance’ from the ‘eli5’ library for this purpose. We can do this with only one line of code as given below:
# Import PermutationImportance from the eli5 library
from eli5.sklearn import PermutationImportance
# Influential variables for Decision Tree model
eli5.show_weights(perm, feature_names = test_x.columns.tolist())
The influential variables in the decision tree model is :
The most influential variables in the decision tree model is ‘1st Goal’, ‘Distance covered’, ‘Yellow Card’ among others. There are also variables that influence the accuracy negatively like ‘Ball possession %’ and ‘Pass accuracy %’. Some variables like ‘Red’ Card, ‘Goal scored’ etc has no influence on the accuracy of the model.
The influential variables in the random forest model is :
The most influential variables in the decision tree model is ‘Ball possession %’, ‘Free Kicks’, ‘Yellow Card’ and ‘Own Goals’ among others. There are also variables that influence the accuracy negatively like ‘Red Card’ and ‘Offsides’ — hence we can drop these variables from the model to increase the accuracy.
The weights indicate by how much percentage the model accuracy is impacted by the variable when the variables are re-shuffled. For eg: By using the feature ‘Ball possession %’ the model accuracy can be improved by 5.20% in a range of (+-) 5.99%.
As you can observe there are significant differences in the variables that influence the 2 models and for the same variable like say ‘Yellow Card’ the percentage of change in accuracy also differs.
Let us now take one variable say ‘Yellow Card’ that is influencing both the models and try to find out the threshold at which the accuracy increases. We can do this easily with Partial dependence plots (PDP).
A partial dependence (PD) plot depicts the functional relationship between input variables and predictions. It shows how the predictions partially depend on values of the input variables.
For example: We can create a partial dependence plot of the variable ‘Yellow Card’ to understand how changes in the values of the variable ‘Yellow Card’ affects overall accuracy of the model.
We will start with the decision tree model first –
# Import the libraries
from matplotlib import pyplot as plt
from pdpbox import pdp, info_plots
# Select the variable/feature to plot
feature_to_plot = 'Yellow Card'
features_input = test_x.columns.tolist()
print(features_input)
# PDP plot for Decision tree model
pdp_yl = pdp.PDPIsolate(model=dt_model,df=test_x,
model_features=features_input,
feature=feature_to_plot, feature_name=feature_to_plot)
fig, axes = pdp_yl.plot(center=True, plot_lines=False, plot_pts_dist=True,
to_bins=False, engine='matplotlib')
fig.set_figheight(6)# Import the libraries
from matplotlib import pyplot as plt
from pdpbox import pdp, get_dataset, info_plots
# Select the variable/feature to plot
feature_to_plot = 'Distance Covered (Kms)'
# PDP plot for Decision tree model
pdp_dist = pdp.pdp_isolate(model=dt_model,dataset=test_x,
model_features=feature_names,
feature= feature_to_plot)
pdp.pdp_plot(pdp_dist, feature_to_plot)
plt.show()
The plot will look like this:
If number of yellow cards is more than 3 that can negatively impact the ‘Man of the Match’, but if yellow cards is < 3 then that does not influence the model. Also, after 5 yellow cards, there is no significant effect on the model.
The PDP (Partial dependence plot) helps to provide an insight into the threshold values of the features that influence the model.
Now we can use the same code for the random forest model and look at the plot :
For both the decision tree model and the random forest model, the plot looks similar with the performance of the model changing win the range if 3 to 5; post which the variable ‘yellow card’ has little or no influence on the model as given by the flat line henceforth.
This is how we can use simple PDP plots to understand the behaviour of influential variables in the model. This information can not only draw insights about the variables that impact the model but is especially helpful in training the models and for selection of the right features. The thresholds can also help to create bins that can be used to sub-set the features that can further enhance the accuracy of the model. In turn, this helps to make the model results explainable to the business.
Please refer to this link on Github for the the dataset and the full code.
I can be reached on Medium, LinkedIn or Twitter in case of any questions/comments.
You can follow me subscribe to my email list here, so that you don’t miss out on my latest articles.
References:
[1] Abraham Itzhak Weinberg, Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification (Feb 2019), Springer
[2] Leo Breiman, Random Forests (Oct 2001), Springer
[3] Alex Goldstein, Adam Kapelner, Justin Bleich, and Emil Pitkin, Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual
Conditional Expectation (Mar 2004), The Wharton School of the University of Pennsylvania, arxiv.org
from machine learning – Techyrack Hub https://ift.tt/7xtuMgL
via IFTTT
0 Comments