Skip to content

A Guide to Monitoring Machine Learning Models in Production

Store monitoring graphic

Machine learning models are increasingly used to make important real-world decisions, from identifying fraudulent activity to applying automatic brakes in a…

Machine learning models are increasingly used to make important real-world decisions, from identifying fraudulent activity to applying automatic brakes in a car. 

The job of a machine learning practitioner is far from over once a model is deployed to production. You must monitor your models to ensure they continue to perform as expected in the face of real-world activity. However, monitoring machine learning systems as you would with traditional software is insufficient. 

How, then, can machine learning models in production be monitored effectively? What specific metrics need to be monitored? What tools are most effective? This post will answer these critical questions for machine learning practitioners. 

Importance of monitoring machine learning models

In the context of machine learning, monitoring refers to the process of tracking the behavior of a deployed model to analyze performance. Monitoring a machine learning model after deployment is vital, as models can break and degrade in production. Deployment is not a one-time action that you do and forget about. 

To determine the right time to update a model in production, there must be a real-time view constantly enabling stakeholders to evaluate the model’s performance in the live environment. This helps ensure that your model is performing as expected. Having as much visibility as possible into your deployed model is required to detect issues and the source before they cause a negative business impact. 

Gaining visibility may sound straightforward, but it is not. Monitoring machine learning models is a difficult task. The following section looks more closely at the challenges of monitoring machine learning models. 

Why is machine learning system monitoring hard?

Software developers have been putting traditional software into production for years, so it is a good starting point to evaluate the difficulty of doing the same with machine learning models. 

It is vital to acknowledge that any discussion of machine learning models in production is akin to discussing machine learning systems. Machine learning systems have the challenges of traditional software and several challenges specific to machine learning. To learn more about these challenges, see Hidden Technical Debt in Machine Learning Systems.

Machine learning system behavior

When building machine learning systems, practitioners are mainly keen on tracking the system’s behavior. Three components determine the system’s behavior: 

  • The data (ML specific): A machine learning system’s behavior depends on the dataset on which the model was trained, as well as the data streaming into the system while in production.
  • The model (ML specific): The model is the output of a machine learning algorithm trained on data. It represents what was learned by the algorithm. It is better to think of the model as a pipeline as it would typically consist of all steps to orchestrate the flow of data into and output from the model. 
  • The code: Code is required to build the machine learning pipeline and define the model configurations to train, test, and evaluate models.  
  • As Christopher Samiullah says in Deployment of Machine Learning Models, “Without a way to understand and track these data (and hence model) changes, you cannot understand your system.”

    Rules specified in the code, the model, and the data impact the overall system’s behavior. Recall that data comes from a never-ending source–“the real world”–which is ever-changing, thus making it unpredictable.

    Challenges in machine learning systems

    It is not as simple as saying, “we have two additional dimensions” to consider when building a machine learning system. Code and configuration introduce more complexity and sensitivity into a machine learning system due to the following challenges: 

    Entanglements: Any change in the input data distributions will influence the approximation of the target function, which may affect the predictions made by the model. In other words, changing anything changes everything. Therefore, any feature engineering and selection code must be carefully tested.  

    Configurations: A flaw in the configuration of a model (for example, hyperparameters, versions, and features) can radically alter the system’s behavior and will not be caught with traditional software tests. In other words, a machine learning system can predict an incorrect but valid output without raising an exception. 

    These factors combine to make monitoring machine learning systems extremely difficult compared to traditional software systems, which are governed by the rules specified in the code. Another factor to consider is the number of stakeholders involved in developing a machine learning system. This is known as the responsibility challenge

    The responsibility challenge

    Often, having multiple stakeholders on a project may be extremely beneficial. Each stakeholder can provide insight into requirements and constraints based on their expertise, enabling the team to reduce and uncover risks on the project. 

    However, each stakeholder may have a completely different understanding of the meaning of “monitoring” based on business areas and responsibilities. An example distinction could be made between data scientists and engineers. 

    A data scientist’s perspective

    Data scientists are most concerned with achieving functional objectives, such as changes in the input data, the model, and the predictions made by the model. Monitoring functional objectives requires visibility into the data passing into the model, metrics from the model itself, and an understanding of the predictions made by the model. 

    A data scientist may be more concerned with the model’s accuracy in the production environment. To achieve such insight, it would be ideal if true labels were available in real time, which is only sometimes the case. Thus, data scientists often use proxy values to gain visibility into their models. 

    An engineer’s perspective

    On the other hand, engineers are often responsible for achieving operational objectives that ensure the resources for the machine learning system are healthy. This requires monitoring traditional software application metrics, which is typical in traditional software development. Examples include: 

    • Latency
    • IO/memory/disk use
    • System reliability (uptime)
    • Auditability

    Despite the discrepancies in stakeholder goals and responsibilities, adequate monitoring of machine learning systems takes both perspectives into account. However, a good level of understanding is still required across the board. To achieve such a feat, it is still vital that all stakeholders come together to ensure terms are well-defined so all team members speak the same language. 

    What needs to be monitored in production? 

    Monitoring is divided into two levels: functional and operational. 

    Functional level monitoring

    At the functional level, the data scientist (or/and machine learning engineer) will monitor three distinct categories: the input data, the model, and the output predictions. Monitoring each category provides data scientists with better insight into the model’s performance. 

    Input data 

    Models depend on the data received as input. If a model receives an input it does not expect, the model may break. Monitoring the input data is the first step to detecting functional performance problems and extinguishing them before they impact the performance of the machine learning system. Items to monitor from an input data perspective include: 

    Data quality: To maintain data integrity, you must validate production data before it sees the machine learning model, using metrics based on data properties. In other words, ensure that data types are equivalent. Several factors may compromise your data integrity; for example, a change in the source data schema or data being lost. Such issues change the data pipeline so that the model no longer receives the expected inputs. 

    Data drift: Changes in distribution between the training data and production data can be monitored to check for drift: this is done by detecting changes in the statistical properties of feature values over time. Data comes from a never-ending, ever-changing source called the real world. As people’s behavior changes, the landscape and context around the business case you’re solving may change. At that point, it is time to update your machine learning model.

    The model     

    At the heart of your machine learning system lies your machine learning model. For the system to drive business value, the model must maintain a performance level above a threshold. The various aspects that could deter the model’s performance must be monitored to achieve this goal, such as model drift and versions.

    Model drift: Model drift is the decay of a model’s predictive power due to alterations in the real-world environment. Statistical tests should be used to detect drift, and predictive performance should be monitored to evaluate the model’s performance over time. 

    Versions: Always ensure the correct model is running in production. Version history and predictions should be tracked.  

    The output

    To understand how the model performs, you must also understand the predictions the model outputs in the production environment. A machine learning model is put into production to solve a problem. Thus, monitoring the model’s output is a valuable way to ensure it performs according to the metrics used as KPIs. For example: 

    Ground truth: For some problems, you can acquire ground truth labels. For example, if a model is used to recommend personalized ads to users (you are predicting if a user will click the ad or not), and a user clicks to imply the ad is relevant, you can almost immediately acquire the ground truth. In such scenarios, an aggregation of a model’s predictions can be evaluated against the actual solution to determine how well the model performs. However, evaluating model predictions against ground truth labels is difficult in most machine learning use cases, and an alternative method is required.  

    Prediction drift: When it is not possible to acquire ground truth labels, predictions must be monitored. If there is a drastic change in the distribution of predictions, something has potentially gone wrong. For example, if you are using a model to predict fraudulent credit card transactions and suddenly the proportion of transactions identified as fraud shoots up, then something has changed. Perhaps input data structure has been altered, some other microservice in the system is misbehaving, or maybe there is just more fraud in the world.

    Operational level monitoring

    At the operational level, the operations engineers are concerned with ensuring the resources for the machine learning system are healthy. The engineers are responsible for acting when the resources are not healthy. They will also monitor the machine learning application across three categories: the system, the pipelines, and the costs. 

    The ML system performance

    The idea is to be informed constantly about how the machine learning model performs in line with the entire application stack. Issues in this arena will impact the entire system. System performance metrics that would provide insight into the model performance include: 

    • Memory use 
    • Latency
    • CPU/GPU use

    The pipelines

    Two crucial pipelines should be monitored: the data pipeline and the model pipeline. Failure to monitor the data pipeline may raise data quality issues that cause the system to break. Regarding the model, you want to track and monitor the factors that may lead to the model failing in production, such as the model dependencies. 

    Costs

    From data storage to model training and more, there are financial costs involved in machine learning. While machine learning systems can generate lots of value for a business, it is also possible for leveraging machine learning to become excruciatingly expensive. Constantly monitoring how much your machine learning application costs your organization is a responsible step to ensuring costs are maintained. 

    For example, you can set budgets using a cloud vendor such as AWS or GCP since their services track your bills and spending. The cloud provider will send alerts to inform the team when budgets are maxed. 

    If you are hosting the machine learning application on-premises, monitoring the system usage and cost could provide greater insight into what component of the application is the most costly and whether or not you can make certain compromises to cut costs. 

    Tools for monitoring machine learning models

    Getting started with machine learning model monitoring is easier now than ever. Several businesses have produced tools to simplify the process of monitoring machine learning systems in production. Reinventing the wheel is not necessary. 

    The tooling to leverage for monitoring your system depends on the specific items you want to monitor. It is worth browsing to find what best works for you before finalizing your decision. A few solutions you may wish to start with are listed below. 

    Prometheus and Grafana

    Prometheus is an open-source system used for event monitoring and alerting. It works by scraping real-time metrics from instrumented jobs and storing the scraped samples locally as time-series data.

    Grafana is an open-source analytics and interactive visualization web application that can be used in collaboration with Prometheus to visualize the collected data.

    Put simply, you can combine the power of Prometheus and Grafana to create dashboards that allow you to track your machine learning system in production. You can also use these dashboards to set up alerts that notify you when an unexpected event occurs.

    If you are using NVIDIA Triton Inference Server to deploy, run, and scale AI models in production, you can leverage the operational metrics that NVIDIA Triton exports in Prometheus format. You can use NVIDIA Triton to collect GPU/CPU use, memory and latency metrics from the system in which it is running inference. These metrics are useful to scale and load balance the requests so application SLAs are met.

    Learn more about Prometheus and Grafana.

    Evidently AI

    Evidently AI is an open-source Python tool used to analyze, monitor, and debug machine learning models in a production environment. The co-founders, Emeli Dral and Elena Samuylova, have written informative articles regarding model monitoring, including:

    • Monitoring Machine Learning Models in Production
    • Machine Learning Monitoring: What It Is, and What We Are Missing

    To learn more, see the Evidently AI documentation.

    Amazon SageMaker Model Monitor

    At a glance, Amazon SageMaker Model Monitor can alert you of any deviations in model quality so that corrective actions, such as retraining, auditing upstream systems, or fixing quality issues, can occur. Developers can leverage no-code monitoring capabilities or conduct a custom analysis by coding. For more details, see the Amazon SageMaker documentation.

    Best practices for machine learning model monitoring

    Deploying a model is only part of your responsibilities as a machine learning practitioner. The other parts of your work involve ensuring that the model performs as expected in the live environment, which requires monitoring the machine learning system. Some general best practices to follow when monitoring your machine learning system include: 

    Monitoring does not start in the deployment phase

    Building a machine learning model often involves several iterations to arrive at an acceptable design. Therefore, tracking and monitoring metrics and logs constitute an important part of model development and should be enforced once you begin experimenting.  

    Major degradation is a red flag and requires investigation

    Degradation of your model’s performance should be expected. However, sudden major dips are a cause for concern and should be investigated immediately. 

    Create a troubleshooting framework

    Teams should be encouraged to document their troubleshooting framework. A system to take teams from alert to troubleshooting is effective for model maintenance. 

    Create a plan of action

    In the inevitable event that there is a break in your machine learning system, a framework should be in place to respond. Once the team is alerted to the issue, the framework should move the team from the alert into action and then eventually to debugging the issue to ensure the model is maintained effectively.  

    Use proxies when ground truth is unavailable

    It is vital to be constantly aware of your machine learning model’s performance in the production environment. If it is impossible to evaluate a model against ground truth, proxies such as prediction drift will suffice. 

    Did I miss any? Leave a comment in the NVIDIA Developer Forums.

    What’s next?

    Monitoring machine learning systems is a difficult yet essential part of the machine learning lifecycle. Models sometimes perform differently than expected when in production. Thus, proper monitoring is required to detect issues before they can cause significant damage. 

    A weak monitoring system may lead to 1) models with poor performance left in production without supervision, 2) businesses owning models that no longer deliver business value, or 3) uncaught bugs that blow up over time. 

    If you have existing machine learning models in production, ask the following questions:

    • What metrics are being monitored? 
    • Do the monitored metrics serve as a good indicator of success? 
    • How soon can bugs be detected in a model?
    • Will discrepancies between data in the development and production environment be caught? 

    Did you find this post helpful? Leave your feedback in the comments. 

    Source:: NVIDIA