Scale AI safely: Guardrails for Salesforce development

AI tools like Agentforce, Cursor, and Copilot are helping Salesforce teams ship more code, faster. But if your review process wasn’t built for that volume, you’re not moving faster — you’re just building up pressure at the wrong stage.

In this webinar, Sam Crossland, DevOps Architect at Gearset, explains why adding AI to development without deterministic guardrails at the review stage compounds risk rather than reducing it — and how to layer quality gates across your DevOps lifecycle to scale AI safely.

What you’ll learn:

Why AI-assisted development can bottleneck your review stage and what that means for release velocity.
Why using AI to review AI-generated code is circular — probabilistic outputs reviewed probabilistically don’t produce reliable results.
How Salesforce-specific risks like governor limits, hard-coded IDs, and flow complexity are frequently missed by general-purpose AI reviewers.
How to layer quality gates across your DevOps lifecycle, from IDE scans and PR checks through to release and observability.
How Gearset Code Reviews gives you deterministic, Salesforce-aware checks that cover Apex, flows, and 300+ metadata types.

Learn more:

Relevant videos:

Fantastic. So we're a couple of minutes in. I think it's probably time to get started. Thank you so much for everyone that's joined me this evening to talk about scaling AI safely, guardrails for Salesforce development. As I mentioned, we've got a couple of folks from Gearset here today on the chat, so feel free to raise anything in there or use the q and a function.

If I get us started with the actual demonstration presentation now, quick introduction to myself. So I'm Sam Crossland. I'm one the DevOps architects here at Gearset. I've been here for the last three and a half years or so, but in the technology space for the last ten years. I used to work as an automation engineer and, obviously, seeing how things have moved along from the days of autocompleting in your IDE to everything that AI brings has given me a really good opportunity to lean into this as my first webinar and and a real privilege to be able to talk to you about it.

There's quite a few things that I'd love to be able to cover today.

We've got how AI has actually changed development workflows over the last few years. We then need to consider things like the Salesforce nuances and risks that working in this ecosystem brings.

Then I wanna talk about AI on AI in the code review process and, normally, that first gate where we're bringing code together and actually doing a a a real review about whether it should move forward. That leads in really nicely to talk about guardrailing AI and then layering your quality gating throughout your DevOps life cycle, and I've got a couple of diagrams I'll use to to talk through that. We'll then wrap up moving into a bit bit of a framework about scaling AI and bring some of those key points together, and then finish off with talking about what do we actually want from that code review stage, and that would give me a chance to do a small code reviews demo before I finish up with some final thoughts.

So first of all, let's talk a little bit about AI changes to development workflows over the past couple of years. We know AI came bursting onto the scene a few years ago really heavily in late twenty twenty two. Vibe coding is a was word of the year last year. So it's a it's a term that's used an awful lot now, and it talks about prioritizing velocity over, you know, precise planning up front. That gives a lot of, you know, benefits for developers in terms of quickly churning things out, but then we need to be able to actually monitor them and make sure they're adhering to the right practices.

It has inserted itself into multiple parts of the DevOps life cycle. So we've got story generation, supporting developers as I just talked around.

We've got the review stage, maybe actually assisting in moving deployments through as well. Teams are using it for observation and even document generation. There's been plenty of things in the news recently around organizations actually generating things, and we that's a real good example of where we actually need to check what the outputs are coming from AI. There's a lot of different options to leverage.

We've got things that are pushed directly by Salesforce in terms of AgentReforce vibes and set up with AgentReforce inside your sandboxes. Then there's things like Cursor and Copilot that are gonna support you in that development stage or some of the more standard LLMs and agent based options like Gemini. So there's quite a lot of things that are available to use. One thing it has really done is highlighted bottlenecks in the process.

So if we're really adding a lot of horsepower into our development phase, what if our delivery pipes aren't actually wide enough? So we're getting a lot of extra work coming from AI system development, but are we causing a bottleneck at that review stage? And I like to think of this in a in a bit of a car analogy using a motorway or or or a highway. We're adding a lot of extra cars with a lot of speed onto that motorway.

Have we got the right gates in place? That might be a a toll booth. And have we got the relevant amount of lanes to really allow that extra work to move forward?

And there's a report here from FAROS that I found really interesting, which talks about we're seeing a big input, a big increase in the throughput per developer on the left hand side here, but then we're also seeing the review time is going up substantially because of the amount of work that AI is actually providing. So we we see that highlights a bottleneck at that stage.

Next, I wanna talk a little bit about Salesforce nuances and some of the risks about those for AI.

So we've got metadata complexity across the existing org structure. So does our AI, whichever agent we happen to use, has it got that context to make the decision between whether we're a trigger based organization or we're switching to flows? Are there associated knock ons of making new objects maybe with our limits that we're hitting in our org?

Does it also have availability for the newest concepts, things like agent force in the last couple of years or data three sixty or some of the releases like spring twenty six upcoming? Does it have the context from those release notes to give us really detailed information as it should do?

Another thing that we see is about unit testing. So are the tests that are being generated by AI actually built to pass just following the happy path, or are they going down the roots of the unhappy paths and the boundaries of what we should allow us as entries into various fields and also permissions around who should do who should be able to do different things.

An agent force five is a really good example of that. We you can specify markdown files to give a a set of guidelines using dev agent rules. But when I was testing this a few months ago, we can see that I've used it to generate some codes. I then asked, has that generation actually adhered to the rules that I'd put in place in terms of that markdown file?

And I found a really interesting answer where I'd I'd advised to have ninety five percent code coverage because I'm a good organization. I wanna make sure I'm way above the seventy five minimum. And I was told that I it it achieved a ninety four percent, which is very close to the requirement. So almost as if it was good enough, and we don't want that at the review stage.

If we've got some organizational priorities, we wanna make sure they're enforced and adhered to, and we don't want AI making those kind of murky decisions for us.

Another thing is being aware of governor limits. So, obviously, in the Salesforce space, we know that we're using essentially shared infrastructure, and I think Trailhead has got a really good example about how that looks where we're essentially individual rooms in an office building, and we're all sharing central resources like heat, electricity, Wi Fi bandwidth, etcetera. We need to make sure that we're whenever we're using AI, that it's aware of those concepts. So we know about bulkification. Has it got testing limits? It's not like Java or C sharp that's just running on your own system, and, obviously, you have control over that. We need to be aware of these limits.

And then finally, thinking about the hard coding trap. So are we having things generated by AI that might be hard coded record IDs or URLs that then aren't environment variables? They're kind of things that are gonna not scale correctly across our deployment pipeline, so that's something we need to be aware of.

I've got a couple of examples here of where I'd used agent force fives and also Gemini, giving them the same prompts, which we can see in the top left hand corner about creating me an Apex trigger on the contact object to basically look for duplicate emails. First of all, obviously, I you you can see that there's there are some syntax errors that I got out of agent force, and, obviously, we iterated those away, but it was an interesting thing to see first off. Next, when I actually scanned the code that was generated in the first couple of iterations, one of the most interesting things that came out to me is that the cognitive complexity was very high according to CodeAnalyzer. So in terms of the difficulty of reading and maintaining, we can see that that's that would be very high on the list. And would would we be happy if one of our developers had created something with that level of complexity? That's something that we would need to be aware of.

When I was actually using Gemini instead of Agent four swipes, I got a very different set of results. It did have test classes generated, but didn't have the meta XML files. So, amusingly, they would have been deployable anyway until we sort that. But we did have a much smaller set of code. It did generate me the relevant test class, as I mentioned, and with the cyclomatic complexity here is reduced at the bottom. So it's not as complex of a piece of code, and this is just good proof to say that using different agents both in and out of Salesforce, you're gonna get slightly different results due to that probabilistic nature. So something that you need to be aware of with some hard quality gates about what you expect.

And this is something that we also found really interesting from I I got a quote here from the mid twenty twenty five paper about AI code review tools, and it's convergence of AI generation with AI review actually amplifies the risks. Because if we're if we're reducing the human oversight and we're adding AI into more and more stages of our life cycle, are we actually amplifying the risk of how accurate we're being in terms of the generation and the review stage? So that's something that we we found really interesting in terms of the breadth across the market and the ecosystem using AI.

So let's dive into that a little bit more, AI on AI review in the code review process.

So we've already talked a little bit about probabilistic versus deterministic. Probabilistic, we we know that we could put the same prompt in tens of times across different LLMs, and we're essentially gonna end up with roughly the same thing but in a slightly different way. It's not deterministic in terms of its verification. So are we actually, like the quote said, just compounding the issue of accuracy?

Are we having a ninety percent success rate or ninety five percent of the dev level, and then we're compounding it with a ninety five percent accuracy from an LLM at the review study review stage. And are the proposed changes even going back into testing? And that's a really key point. Have we got any verification before that code is being added back to the pull request that it's gonna be in the in the right state?

We talked a little bit about team specific context in your conventions as an organization. So we wanna make sure that there's there's knowledge of those full system interacting interactions across metadata boundaries, and that's things like we don't want one file at a time. We want that context of an Apex class being linked to other things that might be called by a flow, that might be called by agent force. We need to make sure that extra context is available.

Another thing is how verbose a lot of the AI outputs are. So, usually, we get an awful lot of code from AI as we saw with my example a few moments ago. We don't want the reviewers skipping it due to the sheer amount of supposedly good code, And I'm sure everyone's kind of seen this meme before, which is if you if you pass a senior developer ten lines of code, they'll probably find ten issues or ten pieces of of advice for you. But if we get five hundred lines of code, it's normally it looks good to me and a tick. And we don't wanna get into that process and rubber stamping just because it's come from AI. It doesn't mean that all that code is legitimate for our conventions.

So that brings me to the final point there about being aware of too much trust at the review stage. There are certainly things where AI can support you, things like summaries of your proposed pull request. They should be supplemental though and not stand alone by themselves, and we wanna be really conscious of AI marking its own homework, especially if it's going through a different model.

And that will matter depending on which ones you're using for the different stages.

So in terms of guardrailing, there's a few pieces that I wanted to pull together here just for you to think about. The first one is around whether we're inside the Salesforce trust boundary or not, and that's Salesforce's Einstein Trust Layer as that secure airlock, making sure everything stays within that particular system. Are we happy with a public LLM knowing our Salesforce information, you know, typing in relevant pieces of our code base and generating things? Because, essentially, you're in a public pool there depending on whether you're using a a private a private pod. But are we happy with it being in or out of the Salesforce plus boundary? We've got dev agent rules, all the equivalent for those team expectations. So similar to having a code and framework in place for your developers, that might be about the routes that you follow, triggers versus flows, or it might be about the actual format of the code.

We don't wanna just focus on linting and formatting, but also those preferred routes, as I mentioned, because that will impact you depending on your limits around your Salesforce org and, obviously, the overall architecture of your code base.

We wanna make sure the relevant context is provided, and that's both locally when you're developing, so not seeing just a set of files in in a silo and and also across the org as a as as a whole. So we want those changes to be double checked against the rest of what your org structure is set up like.

And in terms of reviewing, I've already talked about it, but when we think about continuous delivery, which in its its truest form is allowing our features to move all the way up to production as quickly as possible through a load of automated checkpoints, would we be happy with only an AI sign off at a certain checkpoint? Probably not. We wanna make sure that we've got some deterministic guardrails in there as well to say that we're adhering to some standards.

So now I wanna move into a little bit about talking about the quality gate stages, and I'm sure you might have seen a couple of different examples of this infinity loop before.

This is the one that Gearset follows where we move from plan all the way through to Observe, and I wanna use that to talk about layered quality gating today. So when we think about building up through those different layers, there are a couple of advisories that I wanted to make clear around if you're using AI to help you and facilitate some of those. So in the planning stage, for example, where people are actually using it to maybe plan user stories or refine them, You might think about something like gear set org intelligence or making sure that you've got the relevant level of context to actually pass into the AI and make sure it's giving you the right amount of information and clarity that you can then pass on to developers. So thinking about things like prompt templates to make sure you've got the right definition of done, architectural considerations in those roots and standards that we talked about earlier.

In terms of build, we wanna shift those guardrails left. We want rules and scans to be enforced really early on in the life cycle. So that's in your IDE. That's actually inside the Salesforce orgs where you are developing things. So static code analysis, conventions, and consistency, we want the outputs to be consistent across our team.

When we think about the validation review section, there's a really interesting concept around risk based review tiers. So that might be that you flag different areas of your code base as being more acceptable for a lesser sign off, and that's the same as you would have a junior or a senior human reviewer. You might say that across certain bits of the code base, they're not financially implicated or in terms of a compliance perspective, and we're happy for AI to review them.

Do we only have new issues to to be detected, or does it also think about existing tech debt? Because I know there's there's nothing worse than putting a new feature request up and then having a load of old tech debt being flagged by scanners and other things. So would we want that level of flexibility?

You wanna also be tracking those quality gate infringement. So how many times across your entire development team, whether that does or doesn't include AI, how many times are we hitting those quality gates when we get to our source control system? Because we wanna be keeping that number as low as possible and shift to those resolutions left, and that builds in nicely to what level of autonomy do you want to allow for validating and fixing those issues. So would we allow an LLM to go off and search for some common validation issues or merge conflict problems and try and resolve them for us, or would you actually want those deterministic gates in place to make sure that you're adhering to them?

And deterministic should be the the the winner for the release stage as well. So would you really trust just AI to decide if a feature could move all the way through to production? We certainly don't think so, and it should be a combination of a lot of different quality gates, static code analysis, reviews, pre and post deployment steps. There's a number of other unique pieces in Salesforce that we need to be aware of.

And then finally, in terms of observability, when we're thinking about tech technical debt, you might if you're using something like Agentforce, you might be thinking about health with studio monitoring. But actually, through a DevOps lens, we wanna we could be leveraging AI to look at some of your observability outputs. So there might be security and scale center findings or technical debt analysis, or you might be conducting something like root cause analysis from something like GIS observability. You need to make sure that they're balanced with your organizational priorities because AI will be working on the rest of ecosystem, things like the OWASP framework from however old the data may be. We wanna make sure that's balanced against your organizational priorities.

So moving on now just to kind of bring everything together into a framework for scaling your AI.

First of all, we've already talked about this quite a bit. The guardrails are gonna be really key. You it's all well and good to go at higher velocity, and things like via coding and using AI for development has really proved that that's successful. But human interaction should should still definitely be in place to balance any AI ownership and how much trust we put. We've talked a lot about deterministic versus probabilistic. There are certain stages of the life cycle where probabilistic AI and the results may would be a lot less risky, for example, the plan stage. Whereas when we're thinking about the validate stage and reviewing to say whether a feature is good enough to move forward or not, we want those deterministic guardrails in place.

We want a set of standardized rules to your team to for your team, including AI. So seeing it as kind of a junior developer that's joining the team, we wanna make sure coding conventions are aware, architectural preferences, and security and quality minimums around violations. So we should be shifting those scans left as much as possible. We wouldn't just pass junior developer in the team to say, just write them some code.

We need to make sure there's a set of standards that are actually followed and consider building it in a phased approach. So where we think about bringing AI into the life cycle, phase that both in terms of capabilities and the level of trust. So that might be maybe you're just using it to help you generate some AI and some PR descriptions. Maybe you're only using it for unit test boilerplate before you actually build up into more complex logic, but still having your other automated quality gates in a human level review for the key parts of your code base.

And this is one quote that I wanted to put in from one of our customers, Jolene, at HackerOne is that when you're reviewing by eye, even if you've got a numb you know, a really senior team and we've got architects, because you have a second set of eyes in every pull request with something like gear set code reviews, you're looking for exactly the right things. And that's something that you don't know from an AI perspective whether that LLM is baked in to look for exactly the right things in the Salesforce space. So that becomes really important. So what do we actually want from the code review stage?

So to be confident in that stage, there's a few capabilities that we wanna be wanna make sure are available. So deterministic is one of them. We wanna make sure we're always adhering to a set of rules and a hard quality gate way of working, not a good enough or we've nearly hit the requirements. We need to be hitting those. Building in with the guardrailing, but also being flexible, so giving you the power to maybe say whether something is an error or a warning in terms of your organization.

We need the context awareness across the full Salesforce platform, not just the type of code or type of metadata that we're looking at, and that needs to be aligned to Salesforce best practices. So we want an understanding of more than just Apex, things like flows, and we want to include those newer capabilities, things like agent force or data three sixteen. So we need to make sure they're in play.

Having the right level of automation to support that review stage is important too. So including auto fixes with and without AI because you get back into that, are we always gonna get the same result in terms of a conversation? So we wanna make sure there's capabilities that include that and exclude it. So it's really clear for you to tell which.

You want accurate and low noise results because one of the worst things, as I said, is as a developer getting onto a pull request and seeing that there's been two hundred code violations flagged, and that's just because you changed five lines. All of those violations are on an older set of the code. So should that be your responsibility to resolve at that point? Probably not. So we wanna make sure we're avoiding overwhelming developers at that stage.

And another key point is making sure you've got the metrics to track your team performance over online. So if we've got team members that are potentially always committing the same violations, that's something that we can flag into a training program, And you wanna be able to flag that in terms of the metrics over a long period of time. So let's take a little look at code reviews, and that'll give me a chance to do a small demonstration now for you.

So I'll just head over to my pipeline.

This is a pretty standard pipeline in terms of a a developer sandbox and moving all the way through integration UAT and production. This is my Gemini generated code that I showed off earlier. So if we think about creating a pull request to the next stage, there's a really good example here where I could actually use AI to generate a summary for me, and that's going through the seeing what's in the pull request, taking a look at what all the information is, and then generating me a good breakdown of exactly what's changed, what the impact of those changes are, and the testing recommendations, and then this is really clear for any of my reviewers to actually see. So if I just create that pull request now.

So we can see that's gonna start creating against the next stage, and I've got a couple of pull requests that I've already opened. So the first one here was the one that was created by AgentForce Vibes earlier on. I just wanted to show that we've got automated code review tied into your pipeline. So this is here at code review scanning against particular branches.

We can see the reports in app, so we can fetch those pull request issues. We can see which policy. So this is the Salesforce well architected framework in terms of the policy that's taking place, the severity, how many issues, and then also a recommended fix effort, which can be really useful to see how difficult something might actually be to to resolve. The interesting bit that I wanted to show here is that cyclomatic complexity that I mentioned earlier, and that might be something that we're really interested in to say, well, actually, this is a lot of code for the prompt that I gave it, and that might be something that we want to then link out to, create a bug, and make sure that that's pushed back into our development team to resolve.

If we shoot back to the pipeline, another thing I wanted to show is we've got a pull request here for flows, and that's something that I see a lot of scanners don't really take too much attention off. So because they're XML files, what kind of scanner do we need to to to actually check it with? But there's a lot of things that can go wrong with flows. So we've you can see here that we've got a report also from where Gearset code reviews has taken a look at that, and we can see there's a missing fault path in my flow. And that means that if I'm only expecting it to go down the happy path. So we wanna make sure we're handling errors in that way. And if I just shoot over to the policies section in the the code reviews view, you can see that all of these different policies are related to the Salesforce well architected framework.

We've got a number of different rules covering all of the metadata types. So if you have a three hundred metadata types Included in that, it's not just the code based ones. We can see a really good example if I dive into automated here that not only are we covering things to do with triggers and and code as you would actually expect, but you've also got things like flows. So we don't want hard coded IDs in flows, for example, and that's something that we wanna make sure is caught because it's not scalable, and it's gonna cause problems later down the line.

If I head back to my pipeline view now, we can see that the pull request that I just raised, which is this one ending in Gemini that has now finished, we've got all of those different quality gates that I mentioned in terms of layering things up. We've got the merge checks. We've got validation checks and the Apex code coverage. We might have different testing types, and then Code Review is coming in and giving me some information here about what the quality of that pull request looks like. If I head into here, we can see a similar set of results as to what I saw with the other pull request, but there's an interesting one here around sharing clauses.

And Gearset code of users got a number of different auto fixes available. So we can see I filtered now by incorrect sharing clauses. So I've got a class here where I haven't specified that we have a sharing, and that's from a permissions perspective. We need to make sure that filters down, and it's not just kinda loose in the system.

So what I can do is I can create an auto fix PR. That's particular rule is gonna go through the engine and make sure that we're fixing it. This backs onto your source control system. So in my example, I'm using GitHub.

And then when we head back to our code review section, one of the tying up of the loops that that we need to make sure is happening is instead of just adding it to the pull request itself, it should also be added to the dev sandbox. So what will happen shortly is we'll get a little agent view here, and we can see we've got fixed box fixed box pull requests available in the top right hand side. If I was to follow through that route, we would have the deployment to the sandbox so that I can go and do some manual testing and also the application to the pull request, so saving developers time from doing things manually.

That's all we've got time for in terms of the demonstration today.

If I just move on to wrapping some things up. So this diagram is another way that I wanted to try and bring everything together. We can see this as a a normal Salesforce workflow, a couple of different streams, and then sandbox is on the way to production. There are a number of different areas that you can see where AI might actually be inserted into the process as we talked about with the DevOps infinity loop. So you might be using it early on at the planning stage to actually pass over to development. You might be using some areas to both aid development and potentially review at that first stage. There might be some elements in terms of testing where you're actually using it to analyze some of your testing areas.

And if we now just layer on some of the different testing types and where you might use something like Gears at code reviews, you can see there's quite a lot of overlap as to where those things actually take place. So you might be using code reviews to help you in your actual development, definitely at that first review stage, and that was this key important piece that I've just talked through is that we wanna be shifting left, and we wanna be finding those problems before they take place, especially if we're bottlenecking with the amount of code review with the amount of AI generated development that we are supporting.

And this one here, I just wanted to flag this from the SFXD founder, Geoffrey Vazerofornier, as part of the spring twenty six release notes, which I found really interesting, is that we should be treating AI as a suggestion layer, but we need that deterministic business logic validation as well. It shouldn't be an autonomous system in itself, and we need those kind of guardrails to make sure it's working as expected. So partnering those two pieces together and not just handing the keys over to allow AI to do that.

In terms of wrapping up with some key takeaways, because I'm pretty spot on time, I wanted to say that AI can certainly accelerate development, and it's been really positive to the ecosystem in that way. But we need to make sure the guardrails are in place both from a what we're allowing it to generate and also the checks that we're doing and balances that we've got in place. Don't fall into that pattern of just rubber stamping. All of those higher velocity reviews come in from AI assisted development because the later down the line that we allow them to go, the more difficult and potentially expensive and complex they'll be to fix. So make sure that your PR quality gates aren't just AI reviewing its own homework. You want something that's a set of deterministic checks at the same time.

AI trust can take months to build, but it's just one hallucinated production bug to burn all of that confidence down. We don't want to see that lack of oversight hampering further AI usage and building it into the rest of your workflow. So making sure you get it right early on is really key. And building defense in-depth to protect your workflows, so not just allowing single quality gates all the way through your process. We wanna build up the usage of AI alongside some of the set gates that we've talked about in the rest of the presentation.

Fantastic. So that's everything that we've got prepared in terms of today. Thank you so much for listening to me and and engaging with the session. I haven't had a chance to take a look at any of the chat or any of the questions, but I know if we haven't had a chance to answer any of the any of them, that's something that we will follow-up with you afterwards. Feel free to book a call in with the team if you did wanna see anything else about Gearset code reviews or how we're thinking about AI moving forward as well. And thank you very much for your time today.

Compare & deploy

CI/CD pipelines

Backup & restore

Scale AI Safely: Guardrails for Salesforce Development

Compare & deploy

CI/CD pipelines

Backup & restore

DevOps done right

Ebooks & whitepapers

Webinars

Blog

Podcast

DevOps report 2025

DevOps training

Help center

DevOps assessment

Why choose Gearset

Customer stories

Integrations

Security & trust

Events

DevOps Leaders

Feedback forum

Upcoming event

TDX

New from the blog

Salesforce solution design: step-by-step implementation guide

New from the blog

How to deploy Salesforce Agent Script metadata

Scale AI Safely: Guardrails for Salesforce Development

Description

Transcript

Contact us

Customer support