AI / Machine Learning
September 6, 2023

How to prevent failures and bias through Machine Learning Observability

AI observability provides ongoing insights into the performance of a machine learning model in production. This information is used to conduct a comprehensive review of the system's performance and implement changes if necessary. In this article, we will explore why AI observability is important, the challenges of implementing it, and the key components of an effective observability strategy.

The Importance of AI Observability

AI observability is a critical aspect of Responsible AI that helps prevent machine learning models from failing and presenting bias. It enables monitoring the performance of models in production, early identification of problems or issues, and take corrective action. By providing ongoing insights into model performance, observability can prevent expensive failures and ensure that models achieve their intended outcomes.

The Challenges of AI Observability

The characteristics of machine learning data, in terms of data volume, structure and the number of dimensions, make it impossible to use traditional DevOps monitoring tools for AI observability. To address this challenge, the machine learning model itself should log the information while learning in order to get this type of metrics and explain predictions. Additionally, another challenge is that observability should be infrastructure agnostic to support the dynamic nature of machine learning ecosystems.

Monitoring vs Observability

While monitoring informs us the ¨what¨ happened, observability aims inform us ¨why¨ it happened. So, we need both. Monitoring refers to visualization of available log and metric data, and observability adds more information that might help to get more insights about a problem.

Monitoring is centered around metrics, usually aggregated metrics. On the other hand, Observability provides more grained metrics, so that you can know not only when there is a degradation in performance but also what was the input and output during that time. In order to achieve this, we need detailed logs and traceability through all end-to-end pipeline. In addition, Observability includes interpretability. While interpretability helps to understand how an ML model works, observability assists in comprehending how the whole ML system works.

Key Components of an Observability Strategy

ML Observability with MELT

An effective observability strategy should include the “MELT” capabilities: (metrics, events, logs, and traces).

All of these metrics can provide useful insights to diagnose a problem. As any other system, machine learning systems should log useful information that can help to diagnose a problem, it is important to handle exceptions during training and during inference. This might provide relevant insights about overfitting, performance issues, or capture relevant information to detect bias present in the algorithm. Also, during inference, logs should be generated an identifier that allows to trace input and outputs’ results.

Data Observability: controlling the flow of training and incoming data

In addition to machine learning observability, data observability is also important. Data downtime should be registered and monitored by teams to ensure early detection of problems. It should be considered to control data distribution, structure, and volume as well data lineage, which means tracking the flow of data over time in the pipeline. Providing a clear understanding of where the data was originated, what are the transformation applied over the data, and the status in the data pipeline.

What Can Go Wrong?

We need to assume that everything can go wrong!

So, we should control changes in the data, changes in the input and output variables that might provide insights that reflect that something is wrong. The following are potential problems that could arise during machine learning: model drift, training-serving skew, changing data distributions, and messy data.

Here are some of the common issues that might be present.

  • Model drift is present when outputs from the model can be degrading over time due to changes in the concept reflected in data and the relationships between input and output of the model. In other words, the issue that the model was originally designed to address may not align with the present problem at hand.
  • Model drift is also related with ¨Training Serving Skew¨, which refers to the difference between performance during training and performance during serving (deployment). The cause of this skewness can be because an inconsistency between transformations applied over training data vs serving pipelines.
  • In general when developing ML algorithms we assume that training and test data come from the same distribution. Therefore, If the unseen data comes from a different distribution, the model might not generalize well. This effect is called data distributions shifts, and it happens because real world data is not stationary. For instance, when COVID19 come e-commerce buying patterns changed and probably algorithms making recommendations should have changed to provide better services for clients.
  • Another issue that we might find is related to outliers in data that can lead to edge cases, unexpected outputs of the model. During training, sometimes it might be better to remove outliers to better generalization and performance of the model. However, in production, you cannot remove outliers, they just happen! So, relevant decisions need to be addressed to tackle this kind of cases.

The Need for Reproducibility

Given the code and data, an effective machine learning strategy should ensure reproducibility and replicability of results. Reproducibility means that given the code and data, one can get the same results as advertised in the technical report. Replicability means that if the algorithm is applied to different data, the model still works as expected.


An effective observability strategy should include four key components, known as the pillars of observability: performance analysis, drift detection, data quality, and explainability. Performance analysis involves ensuring that the system's performance has not significantly degraded from when it was first deployed. Drift detection is used to detect changes in data distribution over the lifetime of a model. Data quality ensures that high-quality inputs and outputs are maintained. Explainability allows for the analysis and comprehension of why a model produced a specific prediction. By integrating these pillars into an observability plan, machine learning models can be monitored more effectively, lowering the chances of errors or bias, and giving us a better understanding of why they arise.

In conclusion, AI observability is critical to prevent machine learning models from failing and presenting bias.