Techniques to improve algorithm fairness and interpretability

Techniques to improve algorithm fairness and interpretability

Artificial intelligence tells the authorities John Doe should not be granted bail, as there is a high chance he will re-offend.

Why was this prediction made? How was it made? And was it “fair”? These are important questions for authorities and even more important for the convict. That’s where algorithm fairness and interpretability comes in.


The way we do business is changing every day. AI is becoming more embedded into our lives, working behind the scenes in a range of scenarios, from optimizing production yield to recommending products and more. This shift has increased awareness about the implications of using AI in various processes, as well create more demand to make these AI-powered decisions more fair and interpretable.

This blog will cover some of the ways to check as well as to improve fairness and some of the techniques available to make these “black box” AI models more interpretable.

The blog is divided into two parts.

  • The first part will address the importance of fairness, followed by the steps to improve fairness
  • The second part will cover interpretability: What it is, why it is important, and an overview of the model interpretability methods available


Fairness refers to the bias in the dataset. If a model is unfair, it may favor a certain class, group, or set of characteristics over others due to this bias in the dataset, which would then skew results and accuracy. Data science relies on data, and there is always a possibility of bias being already present in the data. This bias could be due to the limitations in the data collection process or because of the existing societal inequalities.

There have been a couple of cases of data science models showing bias towards a specific class and some examples that adversely impact our society include the model for predicting future criminals.

There have been a couple of cases of data science models showing bias towards a specific class and some examples that adversely impact our society include the model for predicting future criminals. COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is an algorithm used in state court systems throughout the United States used to predict the likeliness of a criminal reoffending. While the company that developed the system disputes the bias, Propublica found the results from the algorithm indicated that African American defendants posed a higher risk of reoffending than they did in reality, meaning that African American defendants were more likely to be incorrectly flagged by the algorithm as a high-risk. This also resulted in white defendants being more likely to be incorrectly labelled as a low risk for reoffending.

In low-stake scenarios, this type of bias will offer misleading insights. In high-stake ones, the results of such algorithms can lead to devastating effects.

How to improve fairness in AI models?

Before implementing any machine learning model and using any dataset, a special task should be allocated to Exploratory Data Analysis stage to include a bias check step. If there is a bias, it should be thoroughly scrutinized by the data scientists, the organization using the data, and/or the stakeholders making the model to determine how that bias may impact the model. There has been a lot of progress in this field, and there will be lot more to learn to better tackle this problem as we move ahead.

To start, here are some of the steps to improve algorithm fairness:

  1. Make sure that your data includes every segment of the relevant population for the targeted use case. Depending on the use case, this could mean race or ethnicity, gender, education level, job, location, etc. It’s also important to ensure that there is a sufficient representation of each segment or every class in the target variable.
  2. Analyze how the data can grow and change over time
  3. Gather more data to fulfill the above two points. If you are unable to collect additional data to ensure your training data set is appropriately representative, try to balance the data using various techniques:
    • A simple technique could be to delete some of the data from the overrepresented group
    • More complex techniques could involve using synthetic sampling techniques such as oversampling (among others) to increase the underrepresented population.

As mentioned previously, it’s important to dedicate time to analyze algorithm fairness. It is also essential to have some flexibility on accuracy goals if there is a tradeoff between accuracy and fairness.

For example, there may be an attribute/feature that can adversely affect fairness but may also lead to higher model accuracy. For example, let’s suppose an organization’s data shows that male employees took more leadership roles over the past 20 years. The algorithm using this data to predict next promotable candidates would include the feature “gender” and give it a significant importance. However, it introduces a bias. If we test this algorithm with the organization’s historical data, it will give a more accurate result as male employees have taken more leadership roles, but if the organization’s goal is to promote a more gender diverse leadership team, the solution may be to remove the “gender” feature from the model altogether.

It’s only once the model fairness has been fully evaluated that AI practitioners can decide whether a particular feature, such as “gender” in the above example, should be used or not in the final model.


If you can’t explain it simply, you don’t understand it well enough. — Albert Einstein

One of the most important tasks of data science is to forecast trends. To do this, we deploy a machine learning model.  Often, these machine learning models are difficult to interpret, even for data scientists. Such models are referred to as black-box models.

Interpretable models are deployed on top of these models to give insights into the workings of the model. These interpretable models are data science frameworks that either include simple models that are self-explanatory or are complex models that describe the innerworkings of the main model using straightforward visualizations.

Why is interpretability important?

Going back to the example of using AI models in criminal justice processes, it’s critical to know how these models work in order to trust the decision. The difference could mean denying someone a fair chance at bail, or incorrectly labeling someone as a low-risk. What’s more, a black-box model would create a barrier in understanding whether or not that AI-powered decision was accurate and fair.

Interpretability helps us determine why this prediction was made by the model, how it was made, whether it was fair, and so on.

If we can’t interpret a model, we don’t know why it’s working – or not working. This leaves more room for errors and inaccuracies.

There are a few other reasons to include interpretability:

  • Humans are curious and interested to know and to learn why a particular prediction was made, especially about unexpected events
  • By default, machine learning models pick up biases from the training data, which can result in discriminatory predictions over time. Interpretability can be used to detect this bias and to intervene to improve model fairness
  • It can be used to debug and audit the model
  • It improves users’ confidence and trust in the model
  • It enables data scientists to confirm that only causal relationships are picked

One interesting example of interpretability includes a method deployed over a classifier that classifies whether the animal in the picture is a Husky or a Wolf.

The prediction of this model was wrong in some cases. After using explainability it became clear that the model associated ice in the background with Wolf, which meant that whenever a picture had ice in the background, it would be classified as a Wolf. With these explanations, data scientists can then appropriately correct the model.

Wolf vs Husky identification model

Image source: “Why should I trust you?” Explaining the predictions of any classifier (Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin)

Various Interpretability techniques available

Though feature importance and some knowledge of algorithms could be used to shine some light on the processes a model followed to get a result, there are other more advanced interpretability models built for this purpose.

There are broadly three categories of interpretability models:

  • Interpretable models
  • Model agnostic methods
  • Example-based explanations

table with different interpretability techniques for algorithms

Interpretable models

Interpretable models include the easily understandable machine learning models. Conceptual knowledge of these, along with some functionality such as feature importance that gives weights associated with each feature (especially in the case of Linear Regression), can give us some useful insights into why a specific prediction was made.

Most of these methods are linear and follow monotonicity which ensures the relationship between a feature and target always goes in the same direction over the whole dataset. Interactions are one area that, if assessed manually in these models, can offer more insights into the model’s decision.

These attributes of the interpretable model make it possible to explain the model as it is, without using a separate framework on top of the original model. These models generally include Linear Regression, Logistic Regression, Decision Tree, etc.


Model agnostic methods

This class of methods is the most widely used method for model explainability. The main methods in this class are ALE plots, LIME and SHAP values.

There is a tradeoff between using a simple linear model which is easy to interpret and its prediction accuracy. Some businesses use simple models that are not very accurate but give them more insights into why a specific prediction was made. While some businesses can work with low accuracy, most businesses require machine learning models that provide very high accuracy along with good interpretability.

Model Agnostic methods come into play in such requests. These methods can be used with any model as they are deployed on top of that model and original predictions are completely separate from the prediction of those interpretable models.

Partial dependence plots (PDP), Individual Conditional Expectation Plots (ICE) and Accumulated Local Effects plots (ALE)

These are some of the basic explainability methods that give us insights into the effects of individual features. PDP tells us the average dependence of the outputs/model on the specific feature by taking perturbations of the instances with a graph for each feature. Meanwhile, ICE plots specify this dependence for a single instance separately with a graph for each feature and each instance. ALE plots do the same job as PDP but are more robust to feature correlation.

Global Surrogate Models

This method includes the applicability of simple models in a novel way. We deploy simple machine learning algorithms on complex models using model inputs and predictions. We then use explanations of the simple model to draw conclusions about the black box models.

Local Interpretable Model-agnostic Explanations (LIME), Shapley values and SHAP

These are some of the most advanced interpretability methods and they include strong visualizations to get better insights. LIME focuses on local interpretability and does so by implementing simple interpretable models at local regions and shows feature importance for a specific instance. Shapley Values demonstrates the contribution of each feature to offset the prediction from the average prediction and SHAP further explains these effects using the concepts of game theory.

Example-based explanations

These methods select specific instances to explain the workings of the model and are deployed more to check the model rather than to understand it. These methods are also model agnostic. The main difference between the model agnostic methods and the example-based ones is while the former relies on creating summaries of feature to explain the model, the latter relies on selecting particular instances and representing those in human interpretable ways.

Adversarial examples

An adversarial example is an instance with small variations in the feature done to flip the prediction of the model in an adverse way. One such example is an adversarial patch. When an adversarial patch is attached to the picture, the model will only predict the image within the patch.

adversarial patch example image

Image source: Adversarial Patch (Tom B. Brown, Dandelion Mané, Aurko Roy, Martín Abadi, Justin Gilmer)


As data science and AI continue to grow and be embedded into new use cases, it becomes increasingly important to understand how these models make predictions. Likewise, it is our responsibility in this field to check the fairness of our technologies and make them more explainable. Fairness and interpretability help create more accurate models, which in turn provides businesses with better insights and more confidence in their AI-powered decisions.

Recommended reading

The Interpretable Machine Learning book by Christoph Molnar was used as a resource to write this blog referenced to write this blog. I would encourage everyone to read that book as well as check out other resources such as: