LLMs, generative AI loom large for MLOps practices

Articles / Analysis

Unique needs for artificial intelligence (AI) development spawned MLOps practices tailored for building and deploying machine learning models. Always in flux, those practices may be in for yet another shakeup, as generative AI and Large Language Models (LLMs) power new applications.

When past breakthroughs occurred in machine learning (ML) models, the news was confined to small communities of AI specialists. The Image Net object recognition database in 2012 and the Transformer neural architecture described in 2017 by Google were minor ripples in technology consciousness.

Not so with ChatGPT. It made a splash heard round the world when it was added to Bing and the Edge browser. C-suite execs had to take notice as generative AI, LLMs and foundation models seemed to point to significant innovations. Generative AI betokens new forms of chatbot interaction, content summation and generation, software code generation and much more.

Consultancy Deloitte says generative AI is creating a wave of disruption. As much as 55% of those queried in a 2023 Deloitte/Forbes survey of 143 CEOs are evaluating or experimenting with generative AI.

Meanwhile, 79% agree that generative AI will increase efficiency, and 52% of the surveyed believe it will increase growth opportunities. Deloitte said 37% of respondents are already implementing generative AI to some degree.

The rush toward LLMs and the need for top-notch ML development tooling has accelerated acquisitions in the ML Ops space. Some viewers are beginning to distinguish “LLM Ops space” as well.

Many see these types of purchases as talent acquisition plays, highlighting the skills issues that shadow the prospects of generative AI.

Teams now work to tame the new tech both in training and inference modes. The LLMs at the heart of generative AI’s innovations require large-scale hardware and software architectures that support distributed computing. Memory and compute resources must be tuned to reduce latency in human-machine interaction. All this quickly translates into costs that stymie some hopeful projects.

Moreover, the LLMs feed on prodigious training data, which must be curated and governed. The LLM output can be flaky; sometimes, developers rely on iterative prompt engineering, repeatedly querying the model, then pondering the random nature of the responses as they arrive. Still, independent developers and vendors of all sizes see paths to solving the problems.

“Large Language Models are amazing at general purpose reasoning but they are extremely brittle,” said Shreya Rajpal, who spoke at the recent Databricks Data and AI Summit 2023. “Getting correct outputs from large language models is hard.”

“When you scale it out, there are no guarantees that it is going to work as you expect it to,” she told Data and AI Summit attendees.

Rajpal is a former senior Apple ML engineer, and now founder of start-up Guardrails AI, which creates software to better guarantee quality of LLM outputs.

As LLMs are applied for enterprise uses, where correctness is critical, there is an acute need to validate input, according to Rajpal. Validation revolves around language structures and types, checks for profanity or length of responses, and much more. At Guardrails AI, Rajpal pursues verification tooling in a quest to better guarantee quality of LLM outputs.

Container technology continues to drive automated ML development. They promote vital collaboration between data scientists and operations. Unique challenges of LLMs will require improved container management, according to Josh Poduska, chief field data scientist at Domino Data Lab, which has honed analytics skills for assorted Fortune 100 clients since its inception in 2013.

“Data science today is very much based on containers. At the enterprise level they play a huge role in building the foundation of a data science platform. LLMs require a different flavor of container than traditional machine learning and that puts new requirements on container management frameworks that support better collaboration, for better reproducibility,” he indicated.

In its latest rev of Domino Enterprise MLOps Platform, Poduska said, Domino is including pretrained foundation models and project templates to help auto-scale users’ generative AI projects. The software includes support for Apache Spark, Dask and Ray distributed computing frameworks used with LLMs, as well as a new Model Sentry that allows for control of model validation, review, and approval processes.

Eased LLM development is an objective at Nvidia, the producer of the GPUs that drive much of today’s AI work that would like to see broad adoption. Nvidia has enhanced its containerized NeMo framework – already familiar from earlier waves of AI image and speech processing innovations – for LLM performance.

Kari Briski, VP of product management for AI and HPC software at Nvidia, describes NeMo as an end-to-end framework covering tasks from data curation to distributed training to AI inference. NeMo is now enabling scaled-up distributed processing for LLMs. As part of its efforts, Nvidia in April released NeMo Guardrails to help build AI chatbots that are “accurate, appropriate, on topic and secure.”

Briski positions the new software as a natural step in evolution, but with a few twists that might come under the heading of “LLM Ops.”

“Code has evolved over the years, compilers and test suites and test cases too. ML Ops has just gone through the evolution of what we need in our software,” she said.

Where are differences? The tone of answers to users’ questions is one.

“Evaluations tend to be subjective. Every company [working on] their personal data is going to be subjective,” Briski said. That carries through to the very “tone of voice” in responses to users’ queries. How responses are rated, for example, depends on how they adhere to how a company defines its brand’s voice.

Evaluation of LLM output is among the more difficult problems teams need to solve these days, said Waleed Kadous, chief scientist at Anyscale, and a former engineering lead at Uber and Google.

“Evaluation is one of the trickiest and least solved problems with LLMs, compared to other ML operations,” he said.

If you’re trying to tell if something’s a cat or a dog, Kadous said, it’s very easy to decide whether you’re doing a good job. But when you’re giving people a block of text that may or may not answer their question, or that may be offensive, measuring success is much harder.

Kadous said advances in Retrieval Augmented Generation show some promise in dealing with the issue. This technique matches industry-specific Q&A models with LLMs.

Meanwhile, he doesn’t dismiss challenges faced in cost efficient computation for generative AI, which is among the issues Anyscale is trying to address. The company offers the Anyscale distributed programming platform, which is its managed, autoscaling version of the Ray open-source framework. That framework is intrinsic to its mission, as AnyScale’s founding principles began to create it while students at the University of California, Berkeley. The Ray API recently gained streaming enhancements to support faster response times for LLM Workloads.

In May, Anyscale launched the Aviary open-source project to help developers assessing and deploying LLMs. The cloud-based service allows developers to submit test prompts to a variety of open-source LLMs, and to try out different optimization techniques.

The newness of LLMs shouldn’t obscure much that is basically familiar to anyone who has worked in machine learning, according to Andy Thurai, VP and principal analyst, Constellation Research.

“LLM Ops is the equivalent of MLOps, but for LLMs,” he said in an email interview. “It is essentially about how to train the LLM models and get them into production in the most efficient way.”

Issues already familiar in previous MLOps work are in play, he adds. These include things like model monitoring, model drift and model retraining. The timeless imperative to feed models good data also applies, he notes.

“If someone wants to build an LLM, the normal ML best practices would apply. Data acquisition, data curation/preparation, data cleansing, data management, feature engineering, data annotation, data privacy, data governance, and data lineage tracking will all come to play from the data engineering side of things,” Thurai said. “Bias removal and mitigation also play a role.”

Much about LLMs is familiar, but there is also plenty new about them. The degree of success development teams achieve with new tooling, frameworks and libraries will ultimately decide how soon AI innovation goes mainstream.