Shipping production agents with confidence

In this talk from Dreamforce 2025, discover a smarter approach to Agentforce testing that removes uncertainty and makes deployments predictable, accurate, and stress-free.

Speaker:

Laurence Boyce, Sales Engineer

Hi, everyone. I'm really excited to be here looking at this topic with you. Today, we're going to learn what good testing for agents really means, how to replace manual testing and monitoring with efficient automation, and how automating testing builds confidence as you scale your agent suite.

As a quick introduction, I'm Lawrence, one of the SEs at Gearset, and day to day, I work with large teams handling complex requirements from their enterprises. Many of those are diving deeply into their agent force journey, so I'm greatly looking forward to sharing some of my experiences and learnings with you today.

For any folks not familiar with Gearset, we're the leading DevOps platform for Salesforce. We support over three thousand teams globally, ranging from small businesses to large enterprises. And for them, we've delivered over thirty five million deployments, and they really enjoy working with us. Our ninety eight percent customer satisfaction score is something we're really proud of.

And for our three thousand plus customers, over ten percent are already deploying AgentForce, and over sixty percent of these have deployed into production.

And because of this, we're your partner for delivering agent force at scale. But let's see why.

This is the third of a series of talks from Gearset, and if you've been to the previous two, you'll have seen with Kevin, Deerset CEO, why agent force systems differ from traditional Salesforce development due to unpredictable agent behavior and then deep reliance on massive high quality datasets, which necessitates a new approach to testing and observability by shifting left.

And also with Javi and Eamon, we learned about the three key pillars for success when it comes to confident agent force delivery at pace. Solid foundations, smooth delivery, and automated protection.

And it's that third point which brings us on to a commonly overlooked part of the agent force development life cycle, but arguably one of the most important, testing our agents. None of us want to build, plan, and deploy our agents without the confidence of knowing that they'll be highly successful in production.

But before we get into this in detail, it's worth a quick summary upfront of why testing exists, why especially in the Salesforce context, and even more so why for Agentforce.

Testing is a critical part of a healthy software development life cycle, not just the final step. For Salesforce, this is especially important. The platform's multitenant architecture means that small changes can have huge ripple effects across your business critical processes.

Add in multiple releases a year and a mixture of declarative and programmatic changes, let alone the nondeterministic nature of AI agents, a robust testing strategy isn't just a luxury. It's essential for protecting your user experience and deploying with confidence.

And the demand for faster innovation on Salesforce is creating new challenges. While many tools help to streamline development and deployment, testing often remain remains complex and time consuming.

So without a solid strategy, it's difficult to gain the confidence needed to deploy and iterate quickly.

In the agent force context, this could mean, how do you know your lead qualification agent isn't stuck in a loop repeatedly creating duplicate leads?

How can you verify your case routing agent is correctly assigning cases to the right folks internally based on their skills and workload?

Or how do you know your agents will work as designed in production? They might not do anything at all.

So testing is a very broad word, and it's worth considering what types of testing exist then, and also what is good.

Well, Mike Cohen's test pyramid offers a great starting point for building a test suite. The x axis represents the number of tests, and the y axis, the cost and complexity, as well as the time to run them.

It's broken into three layers.

So from bottom to top, unit found unit tests are the foundation of your testing. They're fast, easy to maintain, and you should have lots of them to provide fast feedback during the development process.

Service and integration tests as the middle layer provide greater confidence than unit tests but are more complex, thus you'll have fewer of them.

An end to end or UI tests are that smallest layer at the top. They're the most complex, slowest to run, and are the most expensive. However, they provide the highest level of confidence that your application is working as intended.

So the core idea is simple. Write tests with different levels of granularity. You should have many fast low level ones and fewer slower high level ones. But fundamentally, the one thing you should be thinking about when writing tests is how much confidence they bring you that your project is free of bugs.

To translate this to AgentForce, Apex and Flow tests will remain. You'll create many of these, and they'll run regularly.

Testing center tests will serve as your integration tests, and end to end or UI tests are a critical part of your testing strategy.

So it's worth noting at this stage that although agent force testing center is a core part of this testing journey, it isn't designed to address all of the parts, and it shouldn't.

This is all very well and good. However, manually testing and deploying a growing amount of software is unsustainable.

It's repetitive, time consuming, and prone to human error. So to deliver software faster without sacrificing quality, you must automate.

Adopting continuous delivery practices where automated pipelines test and deploy your software will eliminate tedious manual testing and and ensure you can release new capabilities with cadence and confidence.

So let's turn to Gearset and see how the Gearset platform enables teams to leverage multiple stages of automated testing and provides confidence that your agents operate in line with expectations when deployed to production.

We're gonna work bottom to top in the diagram, but also left to right in the software development life cycle, focusing on a few core areas.

Automating APEX and flow tests to ensure we're not introducing issues in these foundational areas. A thorough code review process to prevent introducing vulnerabilities or anti patterns.

Triggering agent test suites to get fast feedback that you're getting the expected outcomes in your eval suites, and UI testing to, test that the agents are providing the expected results when embedded in the Salesforce UI.

So let's start with the foundation of any Salesforce developer's toolkit, Apex and Flow tests. They're more than just a checkbox for code coverage. An agent force bot built on faulty code can lead to terrible customer experiences. So we need to be absolutely sure the underlying Apex and Flows work correctly before we deploy.

Here in Gearset, we've performed a commit to our branch to update the pricing calculator agent along with its topics, actions, and a prompt template that triggers an Apex class to run. We're ready to deploy this upstream, so we open a pull request.

Before we can deploy, Gearset runs all relevant Apex tests. This critical safety net ensures our new code doesn't break existing functionality.

And without this, we'd be hoping the AI agent behaves correctly without an automated way to verify its underlying processes.

As you can see, our test failed. Gears has blocked the deployment with a quality gate, and this saved us from deploying these untested updates to a live environment, a costly mistake, frustrating users, and eroding trust.

So the criticality of this testing, especially with a new technology like AgentFalse, cannot be overstated.

So if we fast forward for a minute, and now it's been fixed, and the validation is complete, we'll be able to proceed with confidence.

And this is the power of a complete DevOps platform like Gearset. It's not just about deployments. It's about making every part of your release release cycle faster, safer, and more reliable.

But before we deploy, there's another critical juncture. A code review process is nonnegotiable. A simple typo in a in a prompt template, a lack of agent guardrails, or an oversight in the Apex class it calls could cause the entire action to fail. The AI bot would simply get a useless error or worse, provide an incorrect answer to the user.

Gearset's solution for this is simple and incredibly effective.

When a pull request is opened, Gearset runs an automated code review process that scans all Salesforce metadata types to ensure these align with best practices curated around the well architected framework. This also provides fast feedback on code quality and standards, blocking vulnerabilities and anti patterns for being deployed.

As we can see here, Gearset has detected a potential vulnerability.

Of course, agent guardrails ensure AI agents are secure, compliant, and on brand by controlling their behavior, preventing unauthorized actions, and managing sensitive topics. So detecting that these are missing provides a massive relief to me that this was caught predeployment before even getting to a testing environment.

And in the world of agent force where metadata directly dictates the AI behavior, this isn't just a best practice. It's a critical safety measure. It's the difference between a bot that's a trusted assistant and one that becomes a liability.

This process also directly aligns with Salesforce's secure by design principles and generative AI security guidelines. A potential vulnerability on our pricing calculator would be a clear indicator that the agent's call lodge call logic is flawed, violating those principles of security and trust.

So this not only saves countless hours of debugging, prevents business disruptions, but also allows us to innovate at speed.

So with Gearset, we're shifting left, ensuring quality, trustworthiness, security, and performance, facilitating us to deploy with confidence.

So you've written your code, and it's been reviewed.

What's next?

Well, Gearset CICD pipelines will automate your code promotions. But as you deploy your AI agents, it's critical to continually test their behavior. Do they understand the user's intent? Do they call the right flows? Do they provide accurate, grounded responses without hallucinating?

Salesforce's agent force testing center is designed for this exactly and lets you create repeatable test scenarios to ensure your agents are accurate and effective.

But the challenge traditionally has been making these part of our automated release process. Let me explain.

When you create tests in your dev org, you have to do a manual song and dance to get these upstream, which involves downloading a CSV template and uploading it again to the next environment. So it's incredibly common for eval suites to drift if not source controlled.

Additionally, it's a manual process to go and click go to run them, So this adds a whole heap of burden on the deployment team.

So the two main testing center challenges we hear from customers are how to ensure tests are consistent and triggering them.

Gearset is unique in this area and allows users to solve both of these problems. We treat your eval suites as metadata, allowing them to be moved seamlessly and automatically through your deployment pipeline alongside your agents and any other metadata. And to ensure they're always run, the tests are automatically triggered after each deployment.

So Gearset has enabled Salesforce teams to bring the rigor of a modern DevOps process to the unique challenge of testing AI agents.

By ensuring your agents are thoroughly tested, you'll have confidence that we're deploying a fully tested functional AI assistant that we know will perform correctly every time.

So it's deployed, and your tests have run, and they've passed your expectations.

Now what?

Well, the agent force testing center is great for testing an agent's logic, but it doesn't cover the core user experience within the Salesforce console. You still need UI testing to ensure you don't disrupt critical business processes for end users.

These tests are vital for preventing promotion until they pass.

So it's worth remembering here that we're at the top end of the testing pyramid. UI tests are the most expensive to maintain, so you'll have fewer of them, but provide immense confidence in your agent's performance.

Gearset's in house UI test builder simplifies this process and lets you provide a prompt of the job you want to test.

So with our new pricing calculator agent, we wanna create a scenario where a user logs in. They ask about the cost of Gearset licensing, such as how much is five deployment users and a hundred backup licenses.

The AI then takes this prompt, translates this into a series of user steps, and captures the UI in real time to adapt to the next step, making tests resilient to UI changes that would typically break hard coded scripts.

And because of this architecture, we can even prompt the UI test tool to wait for the agent force bot to respond, verify the response is as expected, and ask follow-up questions too to simulate complex multi step scenarios like we see here.

This is also a critical part of adhering to Salesforce's security guidelines, as a major concern with AI agents is over permissioning, where an agent has access to too much data.

So these tests will not only verify the agent is performing the correct actions, but is staying within its assigned permissions. We're confirming our agent isn't just thinking correctly, but its actions are manifesting correctly within the confines of our least privileged security model.

So our UI tests have passed then, and we can deploy into production confidently knowing that the underlying Apex and flows are thoroughly tested. The changes have passed through a thorough code review process to align with our coding standards without bugs or vulnerabilities.

We've had fast feedback that the changes I've made are having the expected outcomes in our eval suites, and we've performed UI testing in our UAT environments to ensure the agents are providing the expected results with sample data.

But now those changes are live, our job isn't over. In fact, for a nondeterministic system like an AI agent, it's just the beginning.

We need to know what's happening. We need observability to continually stay on top of these changes, monitoring for anything untowards taking place.

And monitoring is also the final phase of the secure by design framework for Agentforce. You must not only build and validate agents securely, but continually monitor their behavior in production.

Gearset's observability platform is a game changer in this area and in having a bird's eye view of the health of the Salesforce org as a whole.

The single pane of glass dashboard provides critical insights into your org's health, including API limits, email usage, and platform events, along with Apex exceptions and flow errors, so you can easily track improvements over time.

You can also get right into the details in your org without having to rely on user reports. For instance, if a key part of your agent's functionality, like the pricing escalation flow, isn't working, you can immediately identify the issue. See a visualization of the flow to understand the exact node where the issue arose and create a Jira ticket directly within Gearset to begin resolving the problem.

This level of insight empowers you to make data driven decisions as opposed to using gut feel or guessing.

So by turning confusing, unexpected behavior into actionable, explainable insights, we're ensuring our AgingFore solution is not only reliable, but continuously improving for our users.

So let me summarize not only this talk, but all three of Gearset's AgentForce talks as a whole.

AgentForce systems differ from traditional Salesforce development due to unpredictable agent behavior and their deep reliance on massive, high quality datasets. These then necessitate a new approach to testing and observability by proactively shifting left.

And to be confident and successful with agent force delivery, there's a few nonnegotiables.

Solid foundations, smooth delivery, and automated protection. And it's that third one, especially with the importance of testing software, even more so for nondeterministic systems like an AI agent, it just can't be over overstressed.

But often the biggest challenge is knowing what to test and how to test it in a way that gives true confidence rather than leaving you with unknowns. And this is where Gearset is uniquely placed to support you across the DevOps life cycle, all underpinned by observability.

So by all means, please come and grab us if you'd like to talk further. And, of course, just get in touch with the Gearset team. Thank you very much. We've been Gearset.

Compare & deploy

CI/CD pipelines

Backup & restore

Shipping production agents with confidence

Compare & deploy

CI/CD pipelines

Backup & restore

DevOps done right

Ebooks & whitepapers

Webinars

Blog

Podcast

DevOps report 2025

DevOps training

Help center

DevOps assessment

Why choose Gearset

Customer stories

Integrations

Security & trust

Events

DevOps Leaders

Feedback forum

Upcoming event

TDX

New from the blog

Salesforce solution design: step-by-step implementation guide

New from the blog

How to deploy Salesforce Agent Script metadata

Shipping production agents with confidence

Description

Transcript

Contact us

Customer support