Deploying machine learning models is hard! If you don’t believe me, ask any ML engineer or data team that has been asked to put their models into production. To further back up this claim, Algorithima’s “2021 State of Enterprise ML” reports that the time required for organizations to deploy a machine learning model is increasing, with 64% of all organizations taking a month or longer. The same report also states that 38% of organizations spend more than 50% of their data scientists’ time on deploying machine learning models to production – and it only gets worse with scale.
With MLOps still being a nascent field, it’s hard to find established best practices and model deployment examples to operationalize machine learning solutions because solutions to problems can vary depending on:
- 1Type of business use case
- 2Technologies used
- 3The talents involved
- 4Organizational scale and structure
- 5And the resources available
Regardless of your machine learning model deployment pipeline… in this article, you will learn the model deployment challenges faced by a number of ML Engineers and their teams, and the workarounds they applied to get ahead of these challenges. The purpose of this article is to give you a perspective on such challenges from diverse industries, organizational scales, different use cases, and hopefully, this can be a good starting point for you if you are facing similar problems in your deployment scenarios.
NB: These challenges have been approved and reviewed by the aforementioned engineers before publishing. If you have any concerns you want to be addressed, feel free to reach out to me on LinkedIn.
Without further ado, here are 6 difficult things about model deployment as told by ML engineers:
Challenge 1: Choosing the right production requirements for machine learning solutions
Team size: No dedicated ML team
Industry: Media and entertainment
The Netflix content recommendation problem is a well-known use case for machine learning. The business question here is: how can users be served personalized, accurate, and on-demand content recommendations? How can they in turn have a quality streaming experience for recommended content?
Thanks to an ex-software engineer (who prefers to remain anonymous) from Netflix for granting me an interview and reviewing this piece before it was published.
Netflix content recommendation problem
Deploying a recommendation service turned out to be a hard challenge for the engineering team at Netflix. The content recommendation service posed some interesting challenges of which providing highly available and personalized recommendations for users and downstream services was the major one. As a former Netflix engineer pointed out:
“The business objectives of the streams and recommendations are that every single time any individual logs on to Netflix, we need to be able to present the recommendations. So the availability of the server that is generating the recommendations has to be really high.“
Ex-Software Engineer at Netflix
Providing on-demand recommendations also directly influences the availability of content for users when they want to watch them:
“Let’s say I recommend you House of Cards, as a show that you need to watch, and if you end up clicking on it and playing that show, then we also need to guarantee that we are able to stream to you in a very reliable manner. And as a result of that, we cannot stream all of this content from our data centers to your device because if we do this, the amount of bandwidth that Netflix will require to operate would crush the internet infrastructure in many countries.”
Ex-Software Engineer at Netflix
When you stream your recommended shows, for example, to ensure a quality streaming experience, Netflix has to select the recommended titles from thousands of popular content proactively cached in their global network of thousands of Open Connect Appliances (OCAs). This helps ensure the recommended titles are also highly available for viewers to stream—because what’s the use of providing on-demand recommendations if they cannot be streamed seamlessly!
The recommendation service will need to readily predict with high accuracy what their users will watch and at what time of the day they will watch it, so they can make use of the non-peak bandwidth to download most of the content updates to their OCAs during these configurable time windows. You can learn more about Netflix’s Open Connect technology in this company blog post.
So, the challenge was to select the right production requirement before deploying their recommendation models that ensured:
- The recommendation service is highly available,
- Users are served fresh, personalized recommendations,
- Recommended titles are ready to be streamed to a user’s device from the OCA.
Selecting an optimal production requirement for both the business goal and engineering target
The team had to choose a production requirement that is optimal for both the engineering and business problem. Because recommendations do not have to change minute-over-minute or hour-over-hour for each user—since they do not change in real-time—model scoring could happen offline and be served once a user logs into their device:
“When it comes to generating recommendations, what Netflix does is that they train their recommendation models offline and they will deploy that to generate a set of recommendations for every single consumer offline. And then they will store these generated recommendations in a database.”
Ex-Software Engineer at Netflix
This solves the engineering problem because:
- The large-scale recommendations are scored and pre-computed offline for each user.
- They also do not depend on highly available servers running the recommendation services at scale for each user – which would have been quite expensive – but depend on results stored in a database.
This allowed Netflix to scale recommendations to a global user base in a much more efficient manner.
For the business problem, when a user logs into their device, the recommended titles are available to be displayed to them. Since the titles may have also been cached in the Open Connect CDN for the user, the recommended titles are ready to be streamed once a user hits “play”. One thing to note here is that if recommendations are slightly stale by a few hours, the user experience would likely not be impacted compared to when recommendations are slow to load or stale by days, weeks, or months.
In terms of high availability, online scoring or learning at Netflix’s scale will inevitably cause latency issues with servers. This would most likely stress infrastructure and operations, and in turn would affect the user experience, impacting the business. Choosing a production requirement that’s both optimal from an engineering and business perspective helped the team ensure this challenge was solved.
Challenge 2: Simplifying model deployment and machine learning operations (MLOps)
Team size: Small team
Industry: Public Relations & Communications, Media Intelligence
An Ad predictor feature to filter paid advertisements received directly from publishing houses, from thousands of different media content (e.g, magazines and newspapers). These media content are in the form of digital files streamed into a data processing pipeline that extracts the relevant details from these sources and predicts if it’s an ad or not.
In building the first version of the ad-predictor, the team opted to deploy the model on a serverless platform. They deployed a standalone ad predictor endpoint on an external service that would score data from the data pipeline and perform serverless inference.
While serverless deployment has benefits such as auto-scaling instances, running on-demand, and providing interfaces that are easy to integrate with, it also brought about some of its well-known challenges to the fore:
- Decoupling data pipeline from prediction service, making operations harder.
- High network calls and long boot-up time (cold-start problem), causing high latency in returning prediction results.
- Auto-scaling both the data pipeline and the prediction service to compensate for the high traffic from the pipeline.
“Predictions had a higher latency because of network calls and boot up times, causing timeouts and issues resulting from predictor unavailability due to instance interruptions. We also had to auto-scale both the data pipeline and the prediction service, which was non-trivial given the unpredictable load of events.”
“Our solution to these challenges centered on combining the benefits of two frameworks: Open Neural Network Exchange (ONNX) and Deep Java Library (DJL). With ONNX and DJL, we deployed a new multilingual ad predictor model directly in our pipeline. This replaced our first solution, the serverless ad predictor.”
To tackle the challenge they had with the first version, they used ONNX Runtime to quantize the model and deployed it with Deep Java Library (DJL) which was compatible with their Scala-based data pipeline. Deploying the model directly in the pipeline ensured that the model was coupled with the pipeline and could scale as the data pipeline scaled to the amount of data that was streamed.
The solution also helped improve their system in the following ways:
- The model was no longer in a stand-alone, external prediction service; it was now coupled with the data pipeline. This ensured latency was reduced and the inference was made in real-time, without the need of spinning up another instance or moving data from the pipeline to another service.
- It helped simplify their test suite, leading to more test stability.
- It allowed the team to integrate other machine learning models with the pipeline, further improving the data processing pipeline.
- It simplified model management, helping the team to easily spot, track, and reproduce inference errors if and when they occur.
To learn more about the solution to this particular use case from the Hypefactors team, you can check out this article they published on the AWS blog.
Challenge 3: Navigating organizational structure for machine learning operations (MLOps)
Team size: 4 Data Scientists and 3 Data Analysts
Industry: FinTech – Market Intelligence
Thanks to Laszlo Sragner for granting me an interview and reviewing this excerpt before it was published.
A system that processed news from emerging markets to provide intelligence to traders, asset managers, and hedge fund managers.
“The biggest challenge I see is that the production environment usually belongs to software engineers or DevOps engineers. There needs to be some kind of communication between machine learning engineers and software engineers on how their ML model goes to production under the watchful eyes of the DevOps or software engineering team. There has to be an assurance that your code or model is going to run correctly, and you need to figure out what the best way to do that is.”
Laszlo Sragner, ex-Head of Data Science at Arkera
One of the common challenges faced by data scientists is that writing production code is quite different from code in the development environment. When they write code for experimentation and come up with a model, the hand-off process is tricky because deploying the model or pipeline code to the production environment poses different challenges.
If the engineering team and the ML team cannot come to an agreement that a model or pipeline code will not fail when it’s deployed to production, this would likely result in failure modes that could cause entire application errors. The failure modes could either be:
- System failure: One that breaks down the production system due to errors such as slow loading or scoring times, exception errors, and non-statistical errors.
- Statistical failure: Or “silent” failure where the model consistently outputs wrong predictions.
Either or both of these failure modes need to be addressed by both teams, but before they can be addressed, the teams need to know what they are responsible for.
To tackle the challenge of trust between the ML and software engineering teams, there needed to be a way everyone could make sure the models shipped can work as expected. As of that time, the only way both teams could come to an agreement that the model would work as expected before deployment was to test the model.
“How did we solve this (challenge)? The use case was about 3 years ago, pretty much before Seldon, or any kind of deployment tool so we needed to do whatever we could. What we did was to store the model assets in protobufs and ship them to the engineering team where they could run tests on the model and deploy it into production.”
Laszlo Sragner, ex-Head of Data Science at Arkera
The software engineering team had to test the model to make sure it outputs results as required and is compatible with other services in production. They would send requests to the model and if the service failed, they would provide a report to the data team on what types of inputs they passed to the model.
The technologies they used at that time were TensorFlow, TensorFlow Serving, and Flask-based microservice directing the TensorFlow Serving instances. Laszlo admitted that if he were to solve this deployment challenge again, he would use FastAPI and directly load the models into a Docker container, or just use a vendor-created product.
You may also like
Creating bounded contexts
Another approach Laszlo’s team took was to create a “bounded context”, forming domain boundaries for the ML and software engineering teams. This allowed the machine learning team to know the errors they were responsible for and own them—in this case, everything that happened within the model, i.e. the statistical errors. The software engineering team was responsible for domains outside the model.
This helped the teams know who was in charge of what at any given point in time:
- If an error occurred in the production system and the engineering team traced it back to the model, they would hand the error over to the ML team.
- If the error needed to be fixed quickly, the engineering team would fall back to an old model (as an emergency protocol) to give the machine learning team time to fix the model errors, as they cannot troubleshoot in the production environment.
This use case was also before the explosion of model registries, so models (serialized as protobuf files) were stored in an S3 bucket and listed as directories. When an update was made to a model, it was done through a pull request.
In the case of an emergency protocol, the software engineer that was in charge of maintaining the infrastructure outside of the model would roll back to the previous pull request for the model, while the ML team troubleshoot errors with the recent pull request.
Updating the prediction service
If the ML team wanted to deploy a new model and it didn’t require any change to the way it was deployed, the model would be retrained, new model assets created and uploaded to an S3 bucket as a separate model, and a pull request created with the model directory, so the engineering team could know that there is an updated model available to be deployed.
Challenge 4: Correlation of model development (offline) and deployment (online inference) metrics
Team size: Unknown
Industry: Business-oriented social network
Thanks to Skylar Payne for granting me an interview and reviewing this excerpt before it was published.
Recommended Matches is a feature in LinkedIn’s LinkedIn Jobs product that provides users with candidate recommendations for their open job postings that get more targeted over time based on their feedback. The goal of this feature is to keep users from spending time wading through hundreds of applications and helping them find the right talent faster.
Correlation of offline and online metrics for the same model
One of the challenges Skylar’s team encountered while they were deploying the candidate recommendation service was the correlation between online and offline metrics. With recommendation problems, it is usually challenging to link the offline results of the model with a proper metric online:
“One of the really big challenges with deploying models is having a correlation between your offline online metrics. Already, search and recommendation is challenging for having a correlation between online and offline metrics because you have a hard counterfactual problem to solve or estimate.”
For a large-scale recommendation service like this, one disadvantage of using offline-learned models—but use activity features for online inference—is that it is difficult to take the recruiter’s feedback into account during the current search session, while they are reviewing the recommended candidates and providing feedback. This makes it hard to track model performance with the right labels online i.e. they could not know for sure whether a candidate recommended to a recruiter was a viable candidate or not.
Technically, you could classify such a challenge as a training-serving skew challenge but the key point to note here is that the team had parts of the ranking and retrieval stack for the recommendation engine that they could not reproduce very effectively offline so training robust models to be deployed posed the model evaluation challenge.
Coverage and diversity of model recommendations
Another problem the team faced was in the coverage and diversity of recommendations that led to difficulties measuring the results of the deployed model. There were lots of data on potential candidates that were never shown to recruiters so there was no way the team could tell if the model was being biased during the selection process or this was based on the recruiter’s requirements. Since these candidates were not scored, it was quite difficult to track their metrics and understand if the model deployed was indeed robust enough.
“Parts of the challenge were biases and how things were presented in the product, such that when you make small tweaks to how the retrieval works, it’s very likely that the new set of documents that I would get from retrieval after reordering and ranking them will have no labels on for that query.
It’s partially a sparse label problem. That makes it challenging if you don’t think ahead of time, about how you’re going to solve this problem. In your model evaluation analysis, you can put yourself into a bad situation where you can’t really perform a robust analysis of your model.”
“It really boiled down to just being a lot more robust about how we were doing our evaluation. We used a lot of different tools…”
The team tried to solve the challenges with a couple of techniques:
- Using a counterfactual evaluation metric.
- Avoid making changes to the retrieval layer of the recommendation engine stack.
Using counterfactual evaluation techniques
A technique the team used to combat the model selection bias was the Inverse-Propensity-Scoring (IPS) technique, with the aim of evaluating the candidate ranking policies offline based on the logs collected from online recruiter interaction with the product. As Skylar explained:
“One technique that we often looked at and reached for was the inverse propensity scoring technique. Basically, you’re able to undo some of the bias in your samples, with inverse propensity scoring. That’s something that helped.”
Avoid making changes to the retrieval layer
According to Skylar, the makeshift solution at that time was that they avoided making any change to the retrieval layer in the recommendation stack that could have affected how candidates were being recommended to recruiters, making it impossible to track model results online. As Skylar pointed out below, a better solution could have been to build tools that enabled more robust tools or help measure changes to the retrieval layer, but as of that time, the resources to build such tools were limited.
“What ended up happening was that we just avoided making changes to the retrieval layer as much as possible, because if we had changed that, it was very uncertain if it would have been translated online.
I think the real solution there though would have been to build much more sophisticated tools like simulation or analysis tools to measure changes in that retrieval phase.”
Team size: 15 people on the team
Industry: Retail and consumer goods
Thanks to Emmanuel Raj for granting me an interview and reviewing this excerpt before it was published.
This use case was for a project developed for a retail client, helping the client to resolve tickets in an automated way using machine learning. When people raise tickets or they are generated by maintenance problems, machine learning is used to classify the tickets into different categories, helping in the faster resolution of the tickets.
Lack of standard development tooling between data scientists and ML engineers
One of the main challenges most data teams face when they have to collaborate is in the diversity of tooling used by the talents on the team. If there are no standard tools, with everyone developing with the tools they know how to use best, it will always be a challenge in unifying the efforts especially if the solution has to be deployed. This was one of the challenges Emmanuel’s team faced while they were working on this use case. As he explained:
“Some of the data scientists were developing the models using sklearn, some were developing using TensorFlow, and different frameworks. There wasn’t one standard framework that the team adopted.”
Emmanuel Raj, Senior Machine Learning Engineer
As a result of the differences in tooling usage and because these tools were not interoperable, it was difficult for the team to deploy their ML models.
Might be useful for ML teams to have one source of truth for their models
Model infrastructure bottleneck
Another issue the team faced during model deployment was sorting out runtime dependencies of the model and memory consumption in production:
- In some cases, after model containerization and deployment, some packages get depreciated over time.
- Other times, when the model is running in production, the infrastructure would not work stably as the container clusters would often run out of memory, causing the team to restart the clusters at intervals.
Using an open format for models
Because it would be far more difficult to get people to learn a common tool, the team needed a solution that could:
- Make models developed using different frameworks and libraries interoperable,
- Consolidate the efforts of everyone on the team into one application that can be deployed to solve the business problem.
The team decided to opt-in for the popular open-source project Open Neural Network Exchange (or ONNX), which is an open standard for machine learning models that allows teams to share models across different ML frameworks and tools, facilitating interoperability between such tools. This way, it was easy for the team to develop models using different tools but the same models were packaged in a particular format which made the deployment of such models less challenging. As Emmanuel acknowledged:
“Thankfully, ONNX came up, Open Neural Network Exchange, and that helped us solve that issue. So we would serialize it in a particular format, and once we have a similar format for serialized files, we can containerize the model and deploy it.”
Emmanuel Raj, Senior Machine Learning Engineer
Challenge 6: Dealing with model size and scale before and after deployment
Team size: 1 data scientist
Industry: Retail and consumer goods
Thanks to Emeka Boris for granting me an interview and reviewing this excerpt before it was published.
The transaction metadata product at MonoHQ uses machine learning to classify transaction statements that are helpful for a variety of corporate customer applications such as credit scoring, income statements, and asset planning/management. The transactions for thousands of customers are classified into different categories based on the narration and other metadata for the transaction.
Natural Language Processing (NLP) models are known for their size—especially transformer-based NLP models. The challenge for Emeka was to ensure his model size met the requirement required for deployment to the company infrastructure. The models are often loaded on servers that had limited memory, so they needed to fit a certain size threshold to pass for the deployment.
Another issue Emeka encountered while he was trying to deploy his model was how the model would scale to score a lot of requests that it received when it integrated with upstream services in the system. As he mentioned:
“Because the model was in our microservices architecture integrating with other services, if we had 20,000 transactions, each of these transactions is processed discretely. When up to 4 customers queried the transaction metadata API, we observed significant latency issues. This was because the model processed transactions consecutively, causing a slow down in response to downstream services.
In this case, the model would be scoring up to 5,000 transactions for each customer and this was happening consecutively—not simultaneously.”
Emeka Boris, Senior Data Scientist at MonoHQ.
Accessing models through endpoints
Optimizing the size of NLP models is often a game of trade-offs between model robustness or accuracy (inference efficiency of the models) and getting a smaller model. Emeka approached this problem differently. Rather than loading the model from S3 to a server each time a request was made, he decided it was best to store the model on its cluster and make it accessible through an API endpoint so other services could interact with it.
Using Kubernetes clusters to scale model operations
Emeka, at the time of this writing, is considering adopting Kubernetes clusters to scale his models so they can score requests simultaneously and meet the required SLA for the downstream services. He plans on using fully-managed Kubernetes clusters for this solution, so as to not worry about managing the infrastructure required to maintain the clusters.
In this article, we learned that model deployment challenges faced by ML Engineers and data teams go beyond putting models into production. They also entail:
- Thinking about—and choosing—the right business and production requirements,
- Non-negligible infrastructure and operations concerns,
- Organizational structure; how teams are involved and structured for projects,
- Model testing,
- Security and compliance for models and services,
- And a whole slew of other concerns.
Hopefully, one or more of these cases are useful for you as you also look to address challenges in deploying ML models in your organization.
References and resources
Continuum Industries Case Study: How to Track, Monitor & Visualize CI/CD Pipelines
7 mins read | Updated August 9th, 2021
Continuum Industries is a company in the infrastructure industry that wants to automate and optimize the design of linear infrastructure assets like water pipelines, overhead transmission lines, subsea power lines, or telecommunication cables.
Its core product Optioneer lets customers input the engineering design assumptions and the geospatial data and uses evolutionary optimization algorithms to find possible solutions to connect point A to B given the constraints.
As Chief Scientist Andreas Malekos, who works on the Optioneer AI-powered engine, explains:
“Building something like a power line is a huge project, so you have to get the design right before you start. The more reasonable designs you see, the better decision you can make. Optioneer can get you design assets in minutes at a fraction of the cost of traditional design methods.”
But creating and operating the Optioneer engine is more challenging than it seems:
- The objective function does not represent reality
- There are a lot of assumptions that civil engineers don’t know in advance
- Different customers feed it completely different problems, and the algorithm needs to be robust enough to handle those
Instead of building the perfect solution, it’s better to present them with a list of interesting design options so that they can make informed decisions.
The engine team leverages a diverse skillset from mechanical engineering, electrical engineering, computational physics, applied mathematics, and software engineering to pull this off.
A side effect of building a successful software product, whether it uses AI or not, is that people rely on it working. And when people rely on your optimization engine with million-dollar infrastructure design decisions, you need to have a robust quality assurance (QA) in place.
As Andreas pointed out, they have to be able to say that the solutions they return to the users are:
- Good, meaning that it is a result that a civil engineer can look at and agree with
- Correct, meaning that all the different engineering quantities that are calculated and returned to the end-user are as accurate as possible
On top of that, the team is constantly working on improving the optimization engine. But to do that, you have to make sure that the changes:
- Don’t break the algorithm in some way or another
- They actually improve the results not just on one infrastructure problem but across the board
Basically, you need to set up a proper validation and testing, but the nature of the problem the team is trying to solve presents additional challenges:
- You cannot automatically tell whether an algorithm output is correct or not. It is not like in ML where you have labeled data to compute accuracy or recall on your evaluation set.
- You need a set of example problems that is representative of the kind of problem that the algorithm will be asked to solve in production. Furthermore, these problems need to be versioned so that repeatability is as easily achievable as possible.