Description
This session explores how MLOps, LLMOps, and FMOps is a part of GenAI development, ensuring scalable, reliable, and compliant AI deployment. Learn best practices for automated evaluation, monitoring, and governance, optimizing machine learning, large language models, and foundation models. Discover frameworks that enhance observability, bias mitigation, and performance while integrating with DevOps.
Shoby Abdi – Senior Growth Partner, Altimetrik, Inc. Shoby has been in the Salesforce ecosystem over 18 years, in a number of different roles.
Learn more:
Transcript
Alright. Cool. I'm gonna get started.
Alright. Man, those lights are bright. Alright. Hi, everybody. Is everybody here at the correct space?
I guess we'll find out eventually. Alright. So, really quickly, so my name is Shobi Abdi. Man, that's really blurry.
So I'm a senior growth partner at a company called Altametric.
Before I did that, I was on the Salesforce architects and man, those lights are really bright. They're killing me. Salesforce architects and well architected team at Salesforce. Did that for about three and a half, four years.
Has been a principal architecture evangelist. Now, basically, I do all things Gen AI. Right? So I'm having a pretty good time doing it for the last three months.
So what do we mean when we talk about Gen AI, DevOps? Right? Who's heard about Gen AI, DevOps so far?
God, you guys are at a DevOps conference. That's freaking me out a little bit. Right? So, we talk about GenAI DevOps. Like, the way that I wanna articulate it is I'm gonna use the Jeff Bezos spectrum of maturity.
So, here's the one thing about any kind of session I ever do. There's always audience participation so you will have to raise your hand and I will make you say things and I may call you out for whatever you say. But, who here has heard of the Jeff Bezos spectrum of maturity?
Thank god. I made it up. It's not really anything that Jeff Bezos has ever created. Right?
But what it essentially is is when you look at Jeff Bezos as a human being over time and how he's matured as a human being, it's pretty fascinating. Right? So when we talk about Gen AI DevOps, right, there's the early days of Gen AI DevOps. Right?
Basically, it's Jeff in a room. I want to sell books online. They're in bookstores right now but people should be able to order them and ship them. That's a quaint idea.
That level of Jeff Bezos is what I equate to today with using Gen AI with development. Right? And some of the tools that are out there today are GitHub Copilot. Anybody ever use GitHub Copilot at all or heard of it?
Alright. One hand. Or Agent Force for developers. Who here is using Agent Force for developers?
Okay. So, the one thing I will tell you is people often ask, well, how do I get into Agent Force? How do I learn Agent Force? How do I engage with Agent Force?
One of the simplest, dumbest, quickest ways to get in on it is Agent Force for developers. It's universal. It's unique. For those organizations that are really taking Gen AI seriously, one of the first use cases they're going after is Gen AI DevOps and Agent Force for developers is a fantastic tool including the flow capability that Adam Sims came out with, Right?
So, that's kind of what we're starting with, right? Then it's like, alright, you know, I've sold a lot of books online. I'm selling other things online now. I'm gonna buy a newspaper because why not?
I've created an ebook kind of thing called a Kindle. I'm trying to take you over the world a little bit. Right? This stage of Jeff is what I call GenAI with DevOps.
Now, what's a company that's utilizing GenAI with DevOps?
Shouldn't be hard to guess.
They brought us this event. Gear set.
Right? Gear set is a tool that's utilized these capabilities. So this isn't just using GenAI for the purposes of coding. This is using GenAI for the purposes of all encompassing elements of your DevOps process. And that's a lot of what we're seeing as well from a maturity perspective.
But then we get to, like, this this Jeff Bezos level of maturity. Like, you know, yeah, I got a divorce. She took half. I don't care.
I'm still the richest man in the world. I'm gonna build rockets anyway. Come back down to Earth, wear a stupid cowboy hat. That's real Jeff Bezos maturity.
And that's what we're talking about GenAI with models. Right? So it's not simply utilizing GenAI, you know, on your DevOps process, but now how to use your DevOps process on your models, right? And for a lot of that, that's where we use tools like Hugging Face, which we'll talk about a little bit more, and AI Foundry.
Who's utilized or heard of either Hugging Face or AI Foundry? Right? So Google has its own capability from that, Right? And I'll show it a little bit.
I actually got a demo. I'd never do a session without a demo. So we'll demo this a little bit. Right?
And for the purpose of today's session, we're really gonna focus on this version of Jeff, if you will. Right? Where we're gonna really talk about BYOM BYOLM. Has anybody done bring your own model or bring your own LLM with Salesforce yet?
At your degree, kinda here. Okay. Yep. So we're gonna go into a little bit more.
We'll talk about from an introductory perspective, but why you should do it. Right? And then when we talk about within models itself, there's also a spectrum of maturity when it comes to Jeff. Right?
So right now, a lot of what we see is prompt engineering. Right? It's how do we manipulate? How do we drive a specific prompt?
Well, it's a lot about typing, I'm typing, I'm typing, hope to God to get a good response. Typing, typing, typing, hoping to God to get a good response. Or someone gives me a good response. Right?
And for tools like that, that's where we utilize some of the declarative capabilities of Agent Force Agent Force, prompt builder and all that where you're not really getting into the model itself but you're really driving against it. Alright? But then we can go to a level another level of maturity. Right?
And this is Jeff Bezos. I don't even know what he was thinking at this stage. Right? And it's and to a degree, there's a certain amount of insanity wanting to do some of this, which is fine tuning your models.
Fine tuning models is a very fascinating process where, like, quite literally the terminology for it seems to change almost every week. Right? And in order and can you do fine tuning models with Salesforce?
To a degree, yes. Right? And this is where we're gonna lean on our friends, Agent Cody, a little bit more than Agent Astro, right? We're gonna go into a little bit of the tech side, a little bit of code, a little bit of the API.
Right? And that's where we'll focus a little bit more in this session.
Right? Now, when we talk about Gen AI and the DevOps ecosystem, now I've spent the last few months across the board. Right? So my company, Altometric, we do.
Agent Force Salesforce. Right? But surprisingly, like, very little. Right? We are one of the like, OpenAI has six global partners across the world.
We are one of them. OpenAI, Azure, Vertex, Gemini, Snowflake, Databricks, Mistral, Anthropic. These are all our partners. So we do a lot of it.
And whenever we talk to customers or clients in terms of Gen AI, usually one of the first things that comes up is how do we utilize it to accelerate our DevOps? It always comes up. It's universal. Whose organization is investigating utilizing Gen AI to accelerate their own DevOps internally?
Well, there's one. Okay. Thank god. There should be and you should be and you better be.
Right? But really for the purpose of today, when we talk about the whole ecosystem, for the most part, I'm gonna ignore it because this is a DevOps conference. Hopefully, they're covering it. Right?
We'll talk about model eval and model drift and maintenance.
Right? Now, when you saw the whole slew of, like, you know, on the on the actual name of the session, just a bunch of letters next to each other, what do those letters necessarily mean? Right? So ML ops so ML ops is not a new concept in any respect at all.
For those who've done predictive, like, you know, machine learning, whether it's, you know, basing, clustering, you know, supervised or unsupervised, whatever it is, it's been around for decades. Right? And really, it's the idea of how do you deploy ML models at scale. Right?
But now and then within that, we had what was called foundational model ops. Right? Now foundational model lives within machine learning operations and it's just how do we take that generative AI model and that specific AI model across any kind of people process technology. Right?
One specific model that's foundational to multiple processes.
And then within foundational model ops, we have LLM ops. Right? LLM ops is more of a subset of field ops. Right?
And for the most part, it's text to text. Right? There's all kinds of multimodal capabilities that you can do when it comes in. Everybody knows what I mean when I say multimodal.
It's right. I throw up an image and it and I tell it tell me, you know, tell me what's in the background of this image and describe this image. It tells me. Or I tell, you know, like, you know, the LLM, hey.
Like, create an image for me that does this or create a video that does this or create a PDF or a PowerPoint or something that does this. That's multimodal, right? But for the most part for this session, we'll focus on text, right? Now, we're gonna get a little deeper into foundational model ops and LLM ops, right?
Now, when we really talk about evaluating, like, a lot, you know, like, the the series and the stages of model and f m ops and o l m ops, there's usually before I like, what model should I use? And then once I've used it, once I'm starting to actually, you know, utilize it within a a production or sandbox or pilot instance, how do I know if it's effective? Right? Now selection and evaluation is usually, like, people always think of it as kind of complex, but it usually goes into two specific buckets.
There's the automated version of selection, right, where and we'll talk about some of those concepts where you can use datasets, models, metrics. So that's like tools like AI Foundry, Google's Model Garden, Hugging Face, which I'll show. Like, they do a great job of actually showing you what are specific benchmarks that this model should be achieving based on your particular use case. Right?
So that's an automated process.
But more often than not, it's the human assessment that's gonna drive a lot of it. Right? That's where we kind of see that, like, why do you need that human in the loop. Right? And in the case of Salesforce's agent capabilities, right, it's a lot of direct feedback and indirect feedback in that, like, loop process, right, where if, like, if it gives you a good response, give a thumbs up. Who here is using Agent Force?
Who here when you give it a good response and you give it a thumbs up?
Give it a thumbs up. Tell it it's doing a good job, right, and not because it gives me a warm and fuzzy and it gives the AI warm and fuzzy before it comes kills all of us. But really, it's to tell it that that was effective because that's the only way to learn, right? That's the learning element. Like, just doing prompt engineering alone is not gonna cut it, but that's where the human assessment comes in. You have to tell it if it was effective or if it was ineffective with the thumbs down. Right?
So, that's where you as a human being, as an individual can look at the response and say, okay, did this work for what I'm trying to do? Right? Now that's from a selection and an evaluation process.
Right? So and you can engage and articulate and look at multiple kinds of models in that instance, you know. But now you've selected your model. It could be g p t four point one.
It could be llama three. It could be, you know, deep seek. Why not? It could be whatever you want.
Right? Now, how do you assess that it's actually viable, it's working? Right? And a lot of it comes to its capabilities around observability compliance, right?
So there are tools out there utilizing like AI Foundry, Hugging Face, others, right, that will actually monitor what your foundational model is doing, right? And it'll tell you is this getting filled with PII data. It happens, right? When now all of a sudden your model is full of PII data or your model I think what often happens is that is there any you know, what often happens is that people always make the assumption of like people hear about drift and they think, oh, model is drifting away.
But you don't necessarily hear about what happens when it's drifted too far. You have to trash the model. You have to trash the model and completely restart from scratch. That's always a suggested approach.
Like, that can't deviate from that. That's just is what it is. Even Salesforce and engineering, they they drive and do the same thing. Right?
The reason behind that is sometimes it's just gone beyond what it can do. Right? So having that constant observability to ensure that doesn't get to that point. Right?
And then driving compliance. Right? Like profanity, hate speech, right? Is it does it have a lot of PII data, right?
Are you constantly versioning, you know, the actual model itself, doing model versioning so that you can actually ensure that when you're iterating and doing different updates, it's not just simply the net new version of what OpenAI comes out with or what Meta comes out with or what Anthropic comes out with but it's also versions of the model that you're creating that you feel could be a little bit more secure or a little bit more experimental for specific use cases.
So now we're gonna play a guessing game, right, utilizing our favorite tool, Gen AI, I guess, right, in ops. Right? Now when you look at this image, right, what do you think this image is supposed to represent?
I gave it a specific prompt. You don't have to tell me the whole prompt, but what do you think I kind of told, you know, Image Energy to do? Come on. You gotta speak up loudly. I can't hear anybody.
Not even a single guess.
I mean, what do you see up here?
Yeah. That's it. Right? Really, the concept for that, right, that's actually a great way to describe it, honestly, is what's called LM as a judge.
Now, LM as a judge is a is a little bit of a net new design pattern where if you wanna compare one model against the other, right, and in many cases, it's what's called challenger champion assessments where some of our customers actually do what they'll run one, you know, model consistently out there but they'll do AB. We'll do AB for them where, like, we'll sometimes have a challenger model kind of prop up every once in a while. Right? And then we have an LM running in an as an agent.
Right? It's like, you know, twenty you know, what what century are we in? Twenty first? Twenty second?
I don't know, whichever. But, but essentially in this century, like, you know, I we like to joke that agents have an agents as the new babies have babies, right? Where now we have an agent judging the performance of the specific agent and their output, right? The LM is a judge.
It's pretty straightforward. If anybody's ever looked at OpenAI's agent capabilities or their SDK, right? It's one of the code examples that they actually provide. It's very simple straightforward way to do it.
You can point it to models and you can actually judge the performance of it. What about this one? What do we see? What do we think is going on here?
Remember, human assessment, always invaluable.
Humans cleaning. I think that works for me.
You know? Now model well, but model cleansing. Right? So model cleansing. Now often when we talk about the trust layer, right, we talk about often, you know, making sure that the data that like, the data that shouldn't go into the model doesn't go into the model.
Right? And sometimes it works, sometimes it doesn't. But more often than not, organizations will impact and create model cleansing where what it does is it's actually a specific agent that you create that's trained on that model to identify PII, any kind of terminology that it shouldn't be utilizing like like recommending, you know, competitor products. Right?
If I'm on, you know, Salesforce's agent on their support site, it shouldn't recommend really solid SAP use cases to me. Right? So how do we cleanse a lot of those terms out of it from a, you know, capability perspective?
So these are some of the images. Like, LM is a judge and model cleansing are some of the ones that really kind of drive into f m ops, l m ops for the most part.
But really these are like, okay, great. Right? But how does this apply to Salesforce and Agentforce itself? Right?
So I'm gonna do a little bit of demo. But to really kinda demo, I'm gonna show you a little bit of like a like an architecture diagram first. So So on the Dev Evangelist team, Moe created this great LM Open Connect. Has anybody looked at the LM Open Connector at all that Salesforce created?
Alright, so the LM Open Connector is a great tool and what I've essentially done in the demo that we'll just see, right, is I'm pointing it at Hugging Face using a Mistral LM. Now, I like Mistral because it's a very lightweight LM. It's mostly text to text.
It doesn't do much but it does enough. It's cheap. Sometimes you can even get it down to like super free if you're using it. So it's always pretty solid that way.
You know? And from like the sake of playing around with the LM Open Connector straw, like, Moe was using point one. I'm using point three for Solid. So in the middle, utilizing a Heroku app, right, which is where the connector lives.
Anybody here using Heroku at all for anything? Alright. Yeah. So Heroku, I'm using the EcoDyno app.
It's I use that EcoDyno app for almost everything. It's literally five bucks a month. Right? It's fantastic.
You know? And what that's doing is that's allowing Einstein Studio to point to that foundational model within Hugging Face and utilize it in the context of a prompt builder.
Okay. So let's actually see what that looks like. So, yeah. Okay. Great. Alright. So, Hugging Face.
So, one of the things I want to first talk about is Hugging Face. So anybody visit Hugging Face or know what it's about? Right. So, when you're training a model, driving a model, like, people talk about, like, models like DeepSeek, OpenAI and all that.
There's there's the other critical element of judging the performance of any model, is the dataset, right? So tools like Hugging Face, AI Foundry, Model Garden and others, right? It's not just all these great models. Like, alright, these are great, right?
But when you look at the datasets themselves, what these are is these are like open source managed datasets in, like, the millions or billions or thousands of rows. You can even get one that's like a couple of, like, a hundred thousand rows of perfect immaculate data. Or you can get a hundred thousand rows of just god awful racist data. You get all of it.
Right? It's great. Right? But what it allows you to do is to judge the performance of that model based on these datasets.
Right? So for our use case, I'm using Mistral point three. Right? Very simple model. Right? Much simpler than OpenAI. Straightforward.
And what I've essentially done is utilize the this LM Open Connector. Right? So there is Moe up there. Right?
And the LM Open Connector just doesn't work with Hugging Face. Right? It's, you know, AWS with Bedrock, Rock, you know, Mule and even Vertex's capabilities, you know. So the way that you can the way that you can see it in action to a degree, it's like, well, why do this?
Why do build your own model? Right? So here we've got a prop builder, good old fashioned prop builder. And what I'm gonna do is I'm I've already got kind of prep but I'm gonna refresh it because why not?
Be dangerous.
You know, and what I'm gonna do is I'm gonna show this thing. Everybody know what this thing is?
Yeah. It's terrifying. Right? You know, whenever you whenever you show VSC at a Salesforce conference, someone's gonna throw something at me.
But, you know, so so Edge Communications, my favorite company. I'm gonna hit preview. Right now here, you can see I'm using custom models, not some of the standard ones that are provided by our friends. Using custom model, DevOps streaming two point o.
What happened to one point o?
That's a great question. I really don't remember. You know, but I'm gonna say English and, actually, what's going on here? This is why I love live demos because sometimes it doesn't do what you need it to do.
But let's see. I'm gonna choose Edge again. It's already set.
I'm gonna preview, right? And I'm gonna go over here, Right? So essentially, what this is now doing, right, so I I updated Moe's code a little bit to do a little bit more stack tracing. Now now when when you guys are using Agent Force prop builder, right, do you know what's being sent to the LLM?
Not really.
It's part of the trust layer. It's a key element of trust.
But sometimes, like, when you want to do model evaluations, you you know, it's what's the the Russian proverb, You know, dobiah no probiah. Right? Trust but verify. Right?
So, in this instance, by the virtue of using b y o l o m is that I can verify and validate what is being sent to the model. Right? Because now I see this giant thing. I thought it was gonna be my little little series of, like, you know, text there.
But no. Instead, it's quite a bit of content being sent. Because now what I can do oh, go ahead.
True. Yeah. So, if you write data cloud using auto logs, you can you can look at all this as well, right? And yeah, but I don't want to pay for it.
But, but in the end you got it right. But, but in the end, like, the goal of this, right, is now I can take this message, this big bulk of text, and I can utilize a tool like Hugging Face and I can compare it, contrast it from a performance perspective. Right? Because this is the response that I got, you know, there like, you know, maybe I can zoom in a bit more.
This is the response that I got. It's an okay response. Right? And because I'll be frank, Mistral is okay.
Right? It's an okay l m. You know? I'm not using an advanced one. Right? But the purpose of the demo.
But the goal is is that, like, the purpose of, like, being like, why use why roll out your own l m? You know? Because you want a test and trial. Like, maybe there are compliance reasons.
Most financial institutions, pharmaceutical organizations, like, it's hard to believe, right? But, like, one of the largest instances of OpenAI that exists in the world right now is hosted. It lives within the data centers of an extremely large bank. It's hosted.
It's behind the firewall. Right? And the reason behind it is because your financial data, like the globe's financial data runs through them and they want to ensure that it's protected even with GenAI. Right?
So how do we get to this stage? Right? How did I get to even this option over here?
So this is where we get to Einstein Studio. Now anybody ever play with Einstein Studio when it comes to predictive or generative capabilities at all? Right? So So this is where you see the list of all the different models.
So, like, this is the one I created for the purposes of, our session today. Right? Run DevOps streaming. Right?
But then you see all the other ones that are available from Salesforce, like Azure, OpenAI.
If I had Vertex here enabled with some of the new announcements, that's there. It's that would be there too. But really, when you look at what the model's doing, right, like, the model itself, like, the foundational element of it is is pointing that URL. So that URL is it fuzzy to everybody else or just me?
I don't know. But, that that URL is that Heroku app that's running right now, right? And what I've essentially done in the configuration is I've said, okay, it's a generative type model. This is the URL that you should point to that model at, put the token limit, right?
And then I've essentially put what the foundational model is, right? Now, now this is what tells Hugging Face through its API call what to use. If I want to use deep seek, if I want to use OpenAI, if if I want to use Uncle Charlie's LLM, right? I can get in on that if I've got access, right?
So, this is what allows me to do and it's it's a very easy way of doing it, right? Where you can literally tweak and update it just here, just basic text. Right? There's nothing hard coded in Heroku.
Nothing hard coded in Hugging Face. It's just once you've provided access to Hugging Face, provided access model, it's there. I haven't even even this one that I'm referencing, I've done nothing within Hugging Face. I just signed up for a free account, pointed it at it.
That's it. Right? So it's pretty straightforward on that.
You know, and then, like, you know, you know, once you've got that, that's where you have this configured model, right? This is what shows up within our friend, you know, Prompt Builder, right? So Prompt Builder is actually looking at this one, specifically DevOps streaming two point o, and it's going through the process where when I hit preview, calls out to the Heroku app, calls out to Hugging Face, puts that prompt against that model, gets back a response, right? Now, because I can see what's being sent, I can experiment further and further and further.
That's one of the key value propositions of that's why you do that's why you roll your own foundational model. That's why you do LM ops because you wanna see the performance. Like, imagine if you did DevOps without being able to see the code. Well, I can't see the code but I'm deploying something.
Oh, why isn't this working? I don't know. Right? So that's one of the key drivers behind it, right?
And a demo. We did a demo. And that's it. Any questions?
Okay. No questions. Alright. If no questions, I'm good. Thank you. Thank you, Gearset. Thank you for joining the session.
Good day.