Description
Deploying an Agentforce agent without a robust testing strategy means trusting a non-deterministic system to behave correctly in production — with no automated way to verify it. Over ten percent of Gearset’s three thousand customers are already deploying Agentforce, and over sixty percent of those have deployed agents into production. The teams doing this successfully aren’t testing manually. They’re automating at every layer.
In this session, Laurence Boyce, Lead Sales Engineer at Gearset, walks through a practical testing strategy for Agentforce using the test pyramid as a framework — from Apex and Flow tests at the base through automated code reviews, Testing Center eval suites, and UI testing at the top — all integrated into a CI/CD pipeline.
What you’ll learn:
- Why Agentforce demands a layered testing strategy — non-deterministic agent behavior, shifting models, and prompt sensitivity mean that traditional manual testing is no longer sufficient.
- How automated Apex and Flow tests serve as the foundation, catching breaking changes in the underlying code before an agent ever reaches a testing environment.
- How Gearset Code Reviews scan every metadata change against best practices and the well-architected framework — flagging missing agent guardrails, vulnerabilities, and anti-patterns before deployment.
- How Gearset treats Agentforce Testing Center eval suites as metadata, moving them through the pipeline alongside your agents and automatically triggering them after each deployment — eliminating the manual CSV export process.
- How AI-powered UI testing verifies that agents behave correctly within the Salesforce console, staying within assigned permissions and providing accurate responses across multi-step scenarios.
Learn more:
- Deploying Agentforce with Gearset
- Gearset’s Salesforce automated testing solution
- Automate AI agent testing at scale with the Agentforce Testing Center
- Salesforce-aware automated code reviews
- How HackerOne keeps Salesforce secure and proactively monitors org health with Gearset
Relevant videos:
Transcript
Hi, everyone. I'm really excited to be here looking at this topic with you. Today, we're going to learn what good testing for agents really means, how to replace manual testing and monitoring with efficient automation, and how automating testing builds confidence as you scale your agent suite.
As a quick introduction, I'm Lawrence, one of the SEs at Gearset, and day to day, I work with large teams handling complex requirements from their enterprises. Many of those are diving deeply into their Agentforce journey, so I'm greatly looking forward to sharing some of my experiences and learnings with you today.
For any folks not familiar with Gearset, we're the leading DevOps platform for Salesforce. We support over three thousand teams globally, ranging from small businesses to large enterprises. And for them, we've delivered over thirty five million deployments, and they really enjoy working with us. Our ninety eight percent customer satisfaction score is something we're really proud of.
And for our three thousand plus customers, over ten percent are already deploying AgentForce, and over sixty percent of these have deployed into production.
And because of this, we're your partner for delivering Agentforce at scale. But let's see why.
This is the third of a series of talks from Gearset, and if you've been to the previous two, you'll have seen with Kevin, Gearset CEO, why Agentforce systems differ from traditional Salesforce development due to unpredictable agent behavior and then deep reliance on massive high quality datasets, which necessitates a new approach to testing and observability by shifting left.
And also with Javi and Eamon, we learned about the three key pillars for success when it comes to confident Agentforce delivery at pace. Solid foundations, smooth delivery, and automated protection.
And it's that third point which brings us on to a commonly overlooked part of the Agentforce development life cycle, but arguably one of the most important, testing our agents. None of us want to build, plan, and deploy our agents without the confidence of knowing that they'll be highly successful in production.
But before we get into this in detail, it's worth a quick summary upfront of why testing exists, why especially in the Salesforce context, and even more so why for Agentforce.
Testing is a critical part of a healthy software development life cycle, not just the final step. For Salesforce, this is especially important. The platform's multitenant architecture means that small changes can have huge ripple effects across your business critical processes.
Add in multiple releases a year and a mixture of declarative and programmatic changes, let alone the nondeterministic nature of AI agents, a robust testing strategy isn't just a luxury. It's essential for protecting your user experience and deploying with confidence.
And the demand for faster innovation on Salesforce is creating new challenges. While many tools help to streamline development and deployment, testing often remain remains complex and time consuming.
So without a solid strategy, it's difficult to gain the confidence needed to deploy and iterate quickly.
In the Agentforce context, this could mean, how do you know your lead qualification agent isn't stuck in a loop repeatedly creating duplicate leads?
How can you verify your case routing agent is correctly assigning cases to the right folks internally based on their skills and workload?
Or how do you know your agents will work as designed in production? They might not do anything at all.
So testing is a very broad word, and it's worth considering what types of testing exist then, and also what is good.
Well, Mike Cohn's test pyramid offers a great starting point for building a test suite. The x axis represents the number of tests, and the y axis, the cost and complexity, as well as the time to run them.
It's broken into three layers.
So from bottom to top, unit found unit tests are the foundation of your testing. They're fast, easy to maintain, and you should have lots of them to provide fast feedback during the development process.
Service and integration tests as the middle layer provide greater confidence than unit tests but are more complex, thus you'll have fewer of them.
An end to end or UI tests are that smallest layer at the top. They're the most complex, slowest to run, and are the most expensive. However, they provide the highest level of confidence that your application is working as intended.
So the core idea is simple. Write tests with different levels of granularity. You should have many fast low level ones and fewer slower high level ones. But fundamentally, the one thing you should be thinking about when writing tests is how much confidence they bring you that your project is free of bugs.
To translate this to AgentForce, Apex and Flow tests will remain. You'll create many of these, and they'll run regularly.
Testing center tests will serve as your integration tests, and end to end or UI tests are a critical part of your testing strategy.
So it's worth noting at this stage that although Agentforce Testing Center is a core part of this testing journey, it isn't designed to address all of the parts, and it shouldn't.
This is all very well and good. However, manually testing and deploying a growing amount of software is unsustainable.
It's repetitive, time consuming, and prone to human error. So to deliver software faster without sacrificing quality, you must automate.
Adopting continuous delivery practices where automated pipelines test and deploy your software will eliminate tedious manual testing and and ensure you can release new capabilities with cadence and confidence.
So let's turn to Gearset and see how the Gearset platform enables teams to leverage multiple stages of automated testing and provides confidence that your agents operate in line with expectations when deployed to production.
We're gonna work bottom to top in the diagram, but also left to right in the software development life cycle, focusing on a few core areas.
Automating APEX and flow tests to ensure we're not introducing issues in these foundational areas. A thorough code review process to prevent introducing vulnerabilities or anti patterns.
Triggering agent test suites to get fast feedback that you're getting the expected outcomes in your eval suites, and UI testing to, test that the agents are providing the expected results when embedded in the Salesforce UI.
So let's start with the foundation of any Salesforce developer's toolkit, Apex and Flow tests. They're more than just a checkbox for code coverage. An Agentforce bot built on faulty code can lead to terrible customer experiences. So we need to be absolutely sure the underlying Apex and Flows work correctly before we deploy.
Here in Gearset, we've performed a commit to our branch to update the pricing calculator agent along with its topics, actions, and a prompt template that triggers an Apex class to run. We're ready to deploy this upstream, so we open a pull request.
Before we can deploy, Gearset runs all relevant Apex tests. This critical safety net ensures our new code doesn't break existing functionality.
And without this, we'd be hoping the AI agent behaves correctly without an automated way to verify its underlying processes.
As you can see, our test failed. Gearset has blocked the deployment with a quality gate, and this saved us from deploying these untested updates to a live environment, a costly mistake, frustrating users, and eroding trust.
So the criticality of this testing, especially with a new technology like Agentforce, cannot be overstated.
So if we fast forward for a minute, and now it's been fixed, and the validation is complete, we'll be able to proceed with confidence.
And this is the power of a complete DevOps platform like Gearset. It's not just about deployments. It's about making every part of your release cycle faster, safer, and more reliable.
But before we deploy, there's another critical juncture. A code review process is nonnegotiable. A simple typo in a in a prompt template, a lack of agent guardrails, or an oversight in the Apex class it calls could cause the entire action to fail. The AI bot would simply get a useless error or worse, provide an incorrect answer to the user.
Gearset's solution for this is simple and incredibly effective.
When a pull request is opened, Gearset runs an automated code review process that scans all Salesforce metadata types to ensure these align with best practices curated around the well architected framework. This also provides fast feedback on code quality and standards, blocking vulnerabilities and anti patterns for being deployed.
As we can see here, Gearset has detected a potential vulnerability.
Of course, agent guardrails ensure AI agents are secure, compliant, and on brand by controlling their behavior, preventing unauthorized actions, and managing sensitive topics. So detecting that these are missing provides a massive relief to me that this was caught predeployment before even getting to a testing environment.
And in the world of Agentforce where metadata directly dictates the AI behavior, this isn't just a best practice. It's a critical safety measure. It's the difference between a bot that's a trusted assistant and one that becomes a liability.
This process also directly aligns with Salesforce's secure by design principles and generative AI security guidelines. A potential vulnerability on our pricing calculator would be a clear indicator that the agent's call logic is flawed, violating those principles of security and trust.
So this not only saves countless hours of debugging, prevents business disruptions, but also allows us to innovate at speed.
So with Gearset, we're shifting left, ensuring quality, trustworthiness, security, and performance, facilitating us to deploy with confidence.
So you've written your code, and it's been reviewed.
What's next?
Well, Gearset CICD pipelines will automate your code promotions. But as you deploy your AI agents, it's critical to continually test their behavior. Do they understand the user's intent? Do they call the right flows? Do they provide accurate, grounded responses without hallucinating?
Salesforce's Agentforce Testing Center is designed for this exactly and lets you create repeatable test scenarios to ensure your agents are accurate and effective.
But the challenge traditionally has been making these part of our automated release process. Let me explain.
When you create tests in your dev org, you have to do a manual song and dance to get these upstream, which involves downloading a CSV template and uploading it again to the next environment. So it's incredibly common for eval suites to drift if not source controlled.
Additionally, it's a manual process to go and click go to run them, So this adds a whole heap of burden on the deployment team.
So the two main testing center challenges we hear from customers are how to ensure tests are consistent and triggering them.
Gearset is unique in this area and allows users to solve both of these problems. We treat your eval suites as metadata, allowing them to be moved seamlessly and automatically through your deployment pipeline alongside your agents and any other metadata. And to ensure they're always run, the tests are automatically triggered after each deployment.
So Gearset has enabled Salesforce teams to bring the rigor of a modern DevOps process to the unique challenge of testing AI agents.
By ensuring your agents are thoroughly tested, you'll have confidence that we're deploying a fully tested functional AI assistant that we know will perform correctly every time.
So it's deployed, and your tests have run, and they've passed your expectations.
Now what?
Well, the Agentforce Testing Center is great for testing an agent's logic, but it doesn't cover the core user experience within the Salesforce console. You still need UI testing to ensure you don't disrupt critical business processes for end users.
These tests are vital for preventing promotion until they pass.
So it's worth remembering here that we're at the top end of the testing pyramid. UI tests are the most expensive to maintain, so you'll have fewer of them, but provide immense confidence in your agent's performance.
Gearset's in house UI test builder simplifies this process and lets you provide a prompt of the job you want to test.
So with our new pricing calculator agent, we wanna create a scenario where a user logs in. They ask about the cost of Gearset licensing, such as how much is five deployment users and a hundred backup licenses.
The AI then takes this prompt, translates this into a series of user steps, and captures the UI in real time to adapt to the next step, making tests resilient to UI changes that would typically break hard coded scripts.
And because of this architecture, we can even prompt the UI test tool to wait for the Agentforce bot to respond, verify the response is as expected, and ask follow-up questions too to simulate complex multi step scenarios like we see here.
This is also a critical part of adhering to Salesforce's security guidelines, as a major concern with AI agents is over permissioning, where an agent has access to too much data.
So these tests will not only verify the agent is performing the correct actions, but is staying within its assigned permissions. We're confirming our agent isn't just thinking correctly, but its actions are manifesting correctly within the confines of our least privileged security model.
So our UI tests have passed then, and we can deploy into production confidently knowing that the underlying Apex and flows are thoroughly tested. The changes have passed through a thorough code review process to align with our coding standards without bugs or vulnerabilities.
We've had fast feedback that the changes I've made are having the expected outcomes in our eval suites, and we've performed UI testing in our UAT environments to ensure the agents are providing the expected results with sample data.
But now those changes are live, our job isn't over. In fact, for a nondeterministic system like an AI agent, it's just the beginning.
We need to know what's happening. We need observability to continually stay on top of these changes, monitoring for anything untowards taking place.
And monitoring is also the final phase of the secure by design framework for Agentforce. You must not only build and validate agents securely, but continually monitor their behavior in production.
Gearset's observability platform is a game changer in this area and in having a bird's eye view of the health of the Salesforce org as a whole.
The single pane of glass dashboard provides critical insights into your org's health, including API limits, email usage, and platform events, along with Apex exceptions and flow errors, so you can easily track improvements over time.
You can also get right into the details in your org without having to rely on user reports. For instance, if a key part of your agent's functionality, like the pricing escalation flow, isn't working, you can immediately identify the issue. See a visualization of the flow to understand the exact node where the issue arose and create a Jira ticket directly within Gearset to begin resolving the problem.
This level of insight empowers you to make data driven decisions as opposed to using gut feel or guessing.
So by turning confusing, unexpected behavior into actionable, explainable insights, we're ensuring our Agentforce solution is not only reliable, but continuously improving for our users.
So let me summarize not only this talk, but all three of Gearset's AgentForce talks as a whole.
AgentForce systems differ from traditional Salesforce development due to unpredictable agent behavior and their deep reliance on massive, high quality datasets. These then necessitate a new approach to testing and observability by proactively shifting left.
And to be confident and successful with Agentforce delivery, there's a few nonnegotiables.
Solid foundations, smooth delivery, and automated protection. And it's that third one, especially with the importance of testing software, even more so for nondeterministic systems like an AI agent, it just can't be over overstressed.
But often the biggest challenge is knowing what to test and how to test it in a way that gives true confidence rather than leaving you with unknowns. And this is where Gearset is uniquely placed to support you across the DevOps life cycle, all underpinned by observability.
So by all means, please come and grab us if you'd like to talk further. And, of course, just get in touch with the Gearset team. Thank you very much. We've been Gearset.