Partial Dependence Plots: How to Discover Variables Influencing a Model

We will now use the code below to train the random forest model.

# Train the RF model
rf_model = RandomForestClassifier(n_estimators=100, random_state=1).fit(train_x,train_y)


pred_y = rf_model.predict(test_x)


cm = confusion_matrix(test_y,pred_y)
print(cm)
accuracy_score(test_y,pred_y)

The output of the Random forest model is given below:

The random forest model has a slightly better accuracy at ~50% with (13+12) targets identified correctly and (14+11) targets mis-classified-14 being false positives and 11 being false negatives.

We will now look at the most influential variables in both the models and how they are affecting the accuracy. We will use ‘PermutationImportance’ from the ‘eli5’ library for this purpose. We can do this with only one line of code as given below:

# Import PermutationImportance from the eli5 library
from eli5.sklearn import PermutationImportance


# Influential variables for Decision Tree model


eli5.show_weights(perm, feature_names = test_x.columns.tolist())

The influential variables in the decision tree model is :

The most influential variables in the decision tree model is ‘1st Goal’, ‘Distance covered’, ‘Yellow Card’ among others. There are also variables that influence the accuracy negatively like ‘Ball possession %’ and ‘Pass accuracy %’. Some variables like ‘Red’ Card, ‘Goal scored’ etc has no influence on the accuracy of the model.

The influential variables in the random forest model is :

The most influential variables in the decision tree model is ‘Ball possession %’, ‘Free Kicks’, ‘Yellow Card’ and ‘Own Goals’ among others. There are also variables that influence the accuracy negatively like ‘Red Card’ and ‘Offsides’ — hence we can drop these variables from the model to increase the accuracy.

The weights indicate by how much percentage the model accuracy is impacted by the variable when the variables are re-shuffled. For eg: By using the feature ‘Ball possession %’ the model accuracy can be improved by 5.20% in a range of (+-) 5.99%.

As you can observe there are significant differences in the variables that influence the 2 models and for the same variable like say ‘Yellow Card’ the percentage of change in accuracy also differs.

Let us now take one variable say ‘Yellow Card’ that is influencing both the models and try to find out the threshold at which the accuracy increases. We can do this easily with Partial dependence plots (PDP).

A partial dependence (PD) plot depicts the functional relationship between input variables and predictions. It shows how the predictions partially depend on values of the input variables.

For example: We can create a partial dependence plot of the variable ‘Yellow Card’ to understand how changes in the values of the variable ‘Yellow Card’ affects overall accuracy of the model.

We will start with the decision tree model first –

# Import the libraries
from matplotlib import pyplot as plt
from pdpbox import pdp, info_plots


# Select the variable/feature to plot


feature_to_plot = 'Yellow Card'
features_input = test_x.columns.tolist()
print(features_input)


# PDP plot for Decision tree model


pdp_yl = pdp.PDPIsolate(model=dt_model,df=test_x,
model_features=features_input,
feature=feature_to_plot, feature_name=feature_to_plot)


fig, axes = pdp_yl.plot(center=True, plot_lines=False, plot_pts_dist=True, 
to_bins=False, engine='matplotlib')
fig.set_figheight(6)# Import the libraries


from matplotlib import pyplot as plt
from pdpbox import pdp, get_dataset, info_plots


# Select the variable/feature to plot


feature_to_plot = 'Distance Covered (Kms)'


# PDP plot for Decision tree model


pdp_dist = pdp.pdp_isolate(model=dt_model,dataset=test_x,
model_features=feature_names,
feature= feature_to_plot)


pdp.pdp_plot(pdp_dist, feature_to_plot)
plt.show()

The plot will look like this:

PDP Plot for Decision Tree model (Image by Author)

If number of yellow cards is more than 3 that can negatively impact the ‘Man of the Match’, but if yellow cards is < 3 then that does not influence the model. Also, after 5 yellow cards, there is no significant effect on the model.

The PDP (Partial dependence plot) helps to provide an insight into the threshold values of the features that influence the model.

Now we can use the same code for the random forest model and look at the plot :

PDP Plot for Random Forest model (Image by Author)

For both the decision tree model and the random forest model, the plot looks similar with the performance of the model changing win the range if 3 to 5; post which the variable ‘yellow card’ has little or no influence on the model as given by the flat line henceforth.

This is how we can use simple PDP plots to understand the behaviour of influential variables in the model. This information can not only draw insights about the variables that impact the model but is especially helpful in training the models and for selection of the right features. The thresholds can also help to create bins that can be used to sub-set the features that can further enhance the accuracy of the model. In turn, this helps to make the model results explainable to the business.

Please refer to this link on Github for the the dataset and the full code.

I can be reached on Medium, LinkedIn or Twitter in case of any questions/comments.

You can follow me subscribe to my email list here, so that you don’t miss out on my latest articles.

References:

[1] Abraham Itzhak Weinberg, Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification (Feb 2019), Springer

[2] Leo Breiman, Random Forests (Oct 2001), Springer

[3] Alex Goldstein, Adam Kapelner, Justin Bleich, and Emil Pitkin, Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual
Conditional Expectation (Mar 2004), The Wharton School of the University of Pennsylvania, arxiv.org

from machine learning – Techyrack Hub https://ift.tt/7xtuMgL
via IFTTT

Hot Posts

Recent Posts

Partial Dependence Plots: How to Discover Variables Influencing a Model | by Mythili Krishnan

Posted by AI Global Tech

Post a Comment

0 Comments

Comments

Popular Post

Tomorrow: Join Ali Ghodsi and Dario Amodei for a fireside chat

PhD Scholarships for Indian Students to Study Abroad in 2024-2025

Navigating the Quantum Realm in 2025

AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms

Most Popular

Tomorrow: Join Ali Ghodsi and Dario Amodei for a fireside chat

PhD Scholarships for Indian Students to Study Abroad in 2024-2025

Navigating the Quantum Realm in 2025

AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms

The Shifting MarTech Landscape In 2025

Scholarships for MBA in Australia for Indian Students in 2024-2025

Try generating video in Gemini, powered by Veo 2

How AI can decipher dolphin communication

Coding, web apps with Gemini

Awesome Plotly with code series (Part 9): To dot, to slope or to stack? | by Jose Parreño | Feb, 2025

Categories

Random Posts

Featured post

ScreenAI: A visual language model for UI and visually-situated language understanding

Popular Posts

Chat with Your Images Using Llama 3.2-Vision Multimodal LLMs | by Lihi Gur Arie, PhD | Dec, 2024

PhD Scholarships for Indian Students to Study Abroad in 2024-2025

The 17 Best Barefoot Shoes for Running or Walking (2024)

Contact form

Hot Posts

Ad Code

Recent Posts

Partial Dependence Plots: How to Discover Variables Influencing a Model | by Mythili Krishnan

Posted by AI Global Tech

You may like these posts

Post a Comment

0 Comments

Comments

Popular Post

Most Popular

Categories

Ad Code

Random Posts

Featured post

Popular Posts

Contact form