As businesses see value in Artificial Intelligence (AI) and Machine Learning (ML) technologies, their adoption is gathering pace leading to new challenges. With a plethora of AI and ML options to choose from, organizations struggle to compose a tool stack that works for them. Choosing a tool stack for fulfilling an organization’s ML and operational requirements warrants a deep understanding of key business challenges.
To enable organizations, choose a tool stack, it is critical to understand key challenges and how they can be overcome by having the right set of tools to support Machine Learning and Operations.
Below are some effective strategies for building a robust MLOps Platform:
Create a Scalable Development Environment
-
Problem
Data scientists and ML engineers need an environment where they can easily code and run machine learning models. Typically, the environment is shared across users creating package versioning and scalability issues. As more and more users start using the platform, the codes run slower, which invariably reduce team productivity.
-
Resolution
Create an enterprise platform that is elastic and scales with increasing user demand. The platform users should have their own environment where they can install packages necessary to run their experiments. In addition, they should be to scale and run larger algorithms using the elastic nature of the platform. An ideal solution is to use cloud platforms like AWS Sagemaker, Azure or GCP notebooks. The notebook instances can be scaled up and down based on user density while providing each user a collaborative environment.
Establish Code and Model Versioning Repositories
-
Problem
Data scientists create models that they iteratively tweak to improve model performance. Since they perform several iterations, it becomes difficult for them to maintain a record of the various parameters they used to train models. Ensuring that the data science code is stored somewhere also becomes a priority for users to collaborate and work on new model improvement features.
-
Resolution
Data Science codes should be checked in git repositories like Bitbucket and Gitlab to ensure that a working copy of the code is stored somewhere safe. For any new development, a data scientist can pull the code, create a new feature branch, and follow git principles of pull and merge to update new codes in the git repositories. It also ensures that multiple data scientists can work together on the same data science model and build individual features that can be bundled together. Tools like MLflow facilitate model versioning as well as the ability to track parameters and metrics. Sagemaker and Azure Experiments are other alternatives that can be used by data scientists to experiment and track model metrics and parameters.
Deploy Machine Learning Models at Scale
-
Problem
Data scientists work on different models in their own environment by installing corresponding packages needed to run. As soon as the creation of a data science model is complete, it needs to be pushed to production for its intended use. Not having a platform that is easy to use and deploy machine learning codes creates bulky overheads during model deployment. The key challenges such a platform gives rise to are associated with ensuring that the environment has all relevant packages, running and scheduling the code , and overcoming data access limitations. It all comes together to create bottlenecks and delays in time to market.
-
Resolution
Create a Machine Learning pipeline that can be used to push the code to production for use. To run models, containers or cloud providers’ inbuilt ML solutions can be leveraged. It ensures that the relevant packages for models are deployed and the model is run while the model artefacts are stored. The entire pipeline can be automated where for each data science code push, the pipeline automatically runs the code in the environment and provides status. Container orchestrations tools like Docker and Kubernetes can be used to run the models. Such tools ensure that all teams have a standard process defined to move machine learning models to production. Moreover, they enable data scientists to productionize their models immediately rather than spend months in setting up a production pipeline.
Train and Retrain Models
-
Problem
Over a period of time, the data as well as business problems change leading to two types of model drifts, data drift and concept drift. Data drift occurs when the distribution of data changes over time causing the model to perform poorly. Concept drift occurs when business problems change rendering the patterns identified by the model inaccurate. For example, a fashion trend identified by a prediction engine may not hold true for the same time next year. Training and retraining models from time to time is essential to produce accurate results.
-
Resolution
One of the ways is to train the model on a daily basis, but will create a huge drag on available resources. If the underlying data does not change rapidly, it is not advisable to train the model daily. One of the ways would be to look out for data drift and based on changes in data patterns, models can be triggered to train and retrain on their own. Attention should also be given to establishing a retraining window and how much historical data should be used to train. Similar attention should also be given for concept drift to ensure that models work as expected and retrain them when necessary.
Evaluate Model Hosting and Inference Options
-
Problem
Data Science models provide prediction outputs. The models need to be hosted and scaled based on usage. Performance issues occur if the scalability of the platform and its infrastructure are not taken into consideration. For example, if the model needs a GPU for inference but the model hosting infrastructure does not support it then the model will not work as expected.
-
Resolution
While building machine learning models, it is important to have a clear objective. If models need to be hosted, then the infrastructure that supports the platform need to be considered for model building. It is also important that the hosted model responds as per the expected QPS (query per second) for providing inference. It will ensure sufficient scalability built in the platform. Kubernetes or cloud providers’ generic solutions for model hosting can be used. In cases such as a recommendation engine where we do not need real time inference , we can store the recommendations in a NoSQL database like Dynamodb /Cosmos DB and use an API to expose the results to end applications.
Focus on Model Pipeline Execution
-
Problem
A data science model requires data loading, data validation, and model building capabilities. The model’s output then requires storage or if real time inference is required, hosting becomes necessary. There are lot of pieces and dependencies. Often, these dependencies are manually checked and executed resulting in errors and loss of data science team’s productivity.
-
Resolution
It is essential to have an orchestration tool that solves the end-to-end model run lifecycle issue. Tools like Airflow have in built sensors that can be leveraged to have dependencies inbuilt for each stage of the ML Lifecycle. Kubeflow serves as another powerful platform to setup ML pipelines and execute with various integrated dependencies.
Conduct A/B testing
-
Problem
Data science models are created interactively, and new models are pushed to production. Consequently, the need to check user acceptance of the new models/ new model performance is often overlooked. It causes the model to provide incorrect predictions or hamper user experience.
-
Resolution
Data science models need to have A/B testing defined both for user experience and for model version upgrade. It ensures that we have separate groups of users to gauge user acceptance of the new model’s predictions. Moreover, it enables data scientists to roll out new model versions completely only when the predictions are at par with existing tried and tested data science models.
Enable Real-time Tracking and Alerting
-
Problem
Models are run in production, but if there is no way to measure model performance, we will not know if the model is performing optimally or if it needs retraining and improvement. Models are often pushed to production without model tracking, which leads to low ROI. If the infrastructure is not monitored and alerted based on any incident, it will result in delays and cause external API’s to not receive predictions.
-
Resolution
Build a framework where model runs are stored with appropriate metrics for tracking. MLflow or cloud providers’ experiment frameworks can be used to track model performance. They can be integrated with alert tools to trigger alerts or retrain the model based on threshold breach. To monitor the entire system’s health, tools like Grafana can be used to avail a real time dashboard replete with system health and status KPIs. Alerts can be configured to be triggered as per system health parameters.
Track Machine Learning Costs and ROI
-
Problem
Data scientists spend a lot of time and efforts to fine tune their models to fit the data for training and ensuring that model accuracy / precision/ recall /etc. are good on validation sets. So even though they spend considerable time and effort on modelling, when their models go to production, data can change leading to low model performance. ML projects typically fail when the business objective is not clearly defined and cost is not aligned with ROI.
-
Resolution
Building a data science model is an iterative process. The key to building a successful model is to understand business problems and define solutions accordingly. The focus should be on faster time to market with a model that delivers good performance without being obsessed with a high performance model. A platform that enables data scientists to push their models to production, work, and optimize them, enables faster time to market. The ROI also needs to be tracked to ensure that the ML models are giving desired benefits. For example, if you are providing product recommendations, we can track how many customers are buying AI vs No AI recommended products to clearly demonstrate the AI model’s business benefits. Solutions should focus on business priorities keeping cost and performance into consideration.
Organizations should not start with a state-of-the-art MLOps platform. Based on business use case, start with important stages of the AI / ML lifecycle and tool stack to iteratively build the platform as AI/ ML maturity grows in the organizations. It is also important to keep an eye on cost/ ROI and future roadmap while building the platform.