Description
What’s breaking your org right now — and how long will it take to find out? Rewatch this session from Dreamforce 2025 to discover the power of observability to proactively catch Flow and Apex errors before they impact users.
Speakers:
Kenny Vaughan, Development Team Lead
Transcript
Flows on Apex are costing you more than they should.
It might be costing you late nights, user satisfaction, or time for admins and devs to build new features. I'm talking about production issues.
Hello. I'm Kenny, and it's a pleasure to be here with you. I'm an engineer at Gearset, and for the last two years, I've been leading our team building our observability solution.
In my eleven years in the industry, I felt the pain of a bug grinding a production system and the teams that use it to a halt.
All sorts of teams are embracing new parts of DevOps. If we think about Salesforce backup, just a few years ago, adoption hovered at around fifty percent. Now it's a standard practice at over seventy percent.
The entire industry has learned that hope is not a strategy.
I'm here to tell you that observability is on that same journey. It isn't just a buzzword. It's our team's most powerful tool to stop flows and Apex costing more than it should.
There's three things I wanna talk about today. I'll talk about the true cost of production errors from revenue loss to team burnout. I'll introduce observability and on how you can shift your mindset from reactive firefighting to proactive proactive fire prevention.
I'm then gonna show you Gearset's observability solution in action, putting that theory into practice.
We'll be using Gearset's solution as an example, but I want you to leave your understanding why observability observability is a cornerstone for building happy and high performing teams and robust reliable systems. It's not just nice to have.
Let's set some context, flow and Apex errors.
I know there's likely many of you who already know this, but please bear with me. I want to make sure that we're all talking about the same thing. So here's my thirty second primer to flows and Apex.
Flows are a declarative way of defining business processes without the need to write any code.
A flow error is when something happens that your flow didn't anticipate. It could be bad data, a bug introduced in your most recent deployment, or something else the flow relies on just isn't working correctly.
Apex is Salesforce's programming language, and it lets your developers extend the platform with custom code.
Errors can occur in Apex for the same reason as flows, bad data or bugs.
You get some alerting observability from Salesforce for flow and Apex errors without having to do anything. Salesforce will send you emails when errors occur.
They'll either go to the last person who modified the flow, or they can be configured to go to a central inbox or distribution list.
You have to remember to check, dig through the noise, and work out what to triage.
Prioritization is a huge factor here. How do you know which issue needs your attention right now and which issues can wait?
Some devs have told us we sort of just know, which is great, but how does someone else come along and make those same well reasoned decisions?
Or what if you sort of don't know?
For most teams, these emails are where they start when it comes to monitoring flow and Apex errors.
And if these emails are what you're relying on, you should be asking, how many of the errors are on the same flow or Apex?
Are these the same errors, or are they new ones?
Have I seen a spike or a reduction? What changed and when? And, ultimately, what impact and cost is this having?
Finding this information from error emails is manual, time consuming, and it involves a bit of effort to know what needs your attention right now.
Observability starts by telling you that something has gone wrong, but good observability provides more.
You should be able to quickly prioritize what needs your attention.
You should be able to pull in more context to help diagnose the issue and prevent it from escalating in the future.
You shouldn't just be reacting to incidents. Your tooling should enable you to be more proactive.
Incorporating observability practices into your workflow is the real way to take control of flow and apex errors. So let's dig in further. What is observability?
Simply put, observability is about understanding the state of your whole system.
Are things running properly? Do we have any errors? Is the system getting slower? Are we at risk of hitting org limits?
It's not just about logs and having alerts, but if we're going to be really practical, that's a great starting point for any team.
Imagine it's five PM on a Friday, and you've just finished a release, and you're ready for the weekend.
Well, that's our first mistake. Don't release on a Friday.
You open up Slack, and you see a message from your sales channel. I can't generate an invoice.
Your stomach drops. Before you can respond, another message comes in and another and another, and quickly, the channel is becoming a cascade of notifications.
Then you get the DM you never want to see from your chief revenue officer, your CRO. What's going on with payment processing? My whole team is blocked.
You dive into your inbox, but it's a flood of two hundred error emails, most of them just noise. At least half are errors. You've been meaning to fix, but you just haven't got around to yet, and you're digging for a needle in a haystack. Meanwhile, leadership is demanding answers.
Is this a full blown production outage? Kenny, what's the financial impact? And the worst, when will it be fixed? We need a firm ETA.
There goes your Friday, maybe your weekend, and the next few weeks aren't going to be fun.
Retrospectives, RCAs, cleaning up corrupt data.
After all, once production systems are back online, that's not the end of the story.
Alright. I think I've stressed myself out, but that's the reality of many teams' lives. That's the chaotic world that they live in.
And that's not just a feeling. Data backs it up.
Seventy four percent of teams without observability only find out about issues when their users, maybe your CRO reports them, and that's just the issues that get reported.
This comes with a cost. The processes that are automated are likely critical to several business functions. Things like generating quotes and invoices to reports and system integrations.
If your invoice process fails, for example, you could have an entire revenue generating department like sales unable to do their work.
A recent report from the ITIC found that for forty one percent of companies, a single hour of downtime can cost over one million dollars.
That makes taking through emails a pretty expensive activity.
Meanwhile, the admins and developers have to investigate what's going on.
This can be time consuming. It takes a skilled admin and developer to do so. And when they're hunting bugs, they're not adding any new value to the company. So no new processes are being automated and no existing processes are being improved, and you're not building features.
Let's not neglect the impact on people and culture.
Teams rely on these processes to do their day to day work. And if these fail, your sales and support teams are likely gonna get the brunt of customer frustrations.
The internal Salesforce teams are gonna feel frustrated too, and trust can be eroded in the systems that were built, in the teams that manage them, and in the teams that use them.
And the cost to you, you've paid with your time, maybe weekends, family dinners, time with loved ones and friends.
Your health, that gut wrenching stomach drop every time slack goes off after hours and that ticking anxiety waiting for the next incident who knows when, and your job satisfaction.
You are hired to build and to innovate, but you're fighting fires and explaining what went wrong.
So the cost is immense, impacting our revenue, our innovation, and most importantly, our people.
And, of course, we can try to prevent all of this. We have strategies. We enforce code reviews. We write good tests, and we build robust deployment pipelines. And if you're not doing those, come and speak to CareSet. But we can try and do everything right, but as we say in software engineering, ship happens.
And how much for a few key reasons? First, we all ship bugs. High performing teams deploy twelve times more frequently than their peers, and this is a good thing. But with more code, there may be more bugs.
And it's no surprise that bugs cause twenty one percent of Salesforce outages.
Even if you do everything right, you can still run into an error. Governor limits are a great example, and we've all seen those.
We can't anticipate everything that could go wrong, but what we can do is put the tools and processes in place to prepare ourselves to deal with them when they do arise.
With observability, we can be forewarned, and we can be prepared for when things go wrong.
In our earlier example, we can be the one to drop a message to the sales team. We can let them know that we're aware of an emerging issue. We're already working on a fix. Maybe there's a workaround they can use right now. Maybe we even choose to roll back a deployment, but no one's work has to grind to a halt.
And we could use those tools to go beyond that reactive firefighting and start thinking proactively. So how can my observability, the metrics and numbers I'm seeing, help me make my org healthier, more efficient, and ultimately more stable?
That stability has a profound impact on your team. The twenty twenty four Dora report is unequivocal. The primary driver of team burnout is not technical debt. It's the unstable priorities caused by constant firefighting.
Let's talk some numbers.
Seventy four percent of teams that don't have observability tools most often learn about issues from end users or the CRO, seventy four percent.
Teams with an observability solution are fifty percent more likely than other teams to catch bugs within a day and forty eight percent more likely to fix them within a day. Let's just think about that within one day.
Alright. We've talked about the theory. Let's put it into practice.
I'd love you to get excited about our product and buy it, of course, but what I really want you to do is walk away from this session convinced that every Salesforce team needs a good observability strategy.
At Gearset, we've built a platform that we believe is the best way to do that, and I want to show you what good looks like in action.
I've navigated to the Gearset observability dashboard.
We usually have a few theories about what went wrong, and it's always best to try and find the data to either prove or disprove those theories.
My two theories are that I have breached my org limits, which is causing the error, or something happened in today's deployment that had broken something.
Immediately from the numbers in the top right, I can see that it's not an issue with org limits. We haven't breached any of the limits. Nothing's in critical and nothing's in warning.
Instead, I'll look to the large numbers on the left hand side.
I can see that there's been a five percent increase from the previous time period on the number of users impacted, but the number of flow one Apex errors has decreased by thirty three and fifty five percent respectively.
This isn't really telling us much.
If we look at the flow and the apex errors graph, if I focus only on flows for now, I can see there has been a rather large spike. Let's dig deeper into that.
I've chosen a larger time range, and it starts to uncover something much more telling.
If we look all the way back to early September, we can see that we have a baseline of errors that just continually occur.
But near the end of September, we've done a deployment. And from there, we can see an uptick in the number of errors that are happening on these flows.
We can then change and sort and filter the data just to see if we can figure out which of the flows is causing us the most trouble. If we choose by the flow with the most, errors, We can see the weekly invoice business impacting twenty users, and we're seeing there's three hundred and thirty five errors for that time range. We can then further drill down into it and see what are the most problematic areas. I'm gonna choose this top error here. Too many so called queries.
And now we've got a really full picture of everything that's been going on and what this error is. Along the left hand side, we can see all of the users that are impacted. That's quite useful because the user who told me in the sales team that they're having issue is Anika. So I know that she's been impacted by this. I'll just hide that for now. I can also remind myself within Gearset without having to go to Salesforce what this flow was doing.
I can navigate through it, and I can see exactly the element that the error was occurring on.
From here, I can go and immediately add my support ticket, prepopulate it with information that we need, ready for the development team to pick up and do further triage.
Further to that, I can leave some notes, And I can take a look at when that flow was last deployed and drilling in deeper showing me that there had been some spikes of errors previously, but something was going wrong on the ninth of October.
All of that in less than five minutes.
So we've disproven some of our initial theories. We have identified the exact flow that was erroring.
We've pinpointed the deployment that it may have caused, prioritized the fix by the number of users impacted on the number of errors. We've assigned a Jira ticket with all the context everyone can need, and let's go and set up a proactive alert as well.
I don't want an alert every time this throws an error. We know it happens every now and again. But say if we have ten errors in one hour, I want to be notified in my Slack channel. Great. That's all done.
We've shown you what it's like fighting fires with Gearset's observability by your side, but the real transformation happens and the magic We've shown you what it's like firefighting with gears has observability by your side. But the real transformation and magic happens when you stop fighting fires altogether and use these insights to become more proactive.
We could spend another twenty minutes showing you all of the other features, monitoring your org limits to prevent breaking integrations or analyzing error trends to refactor your most fragile flows before you they break your business.
But the features aren't the most important takeaway.
The real point is the operational shifts, moving away from reacting to your users to informing your business.
So as you think about bringing this capability to your own team, here's what you really should look for.
Clarity in minutes, not months. A single click to get things set up.
Signal, not noise. So filter, sort, and slice the data in a way that helps you find issues.
Insights connected to action.
Get notified where you work, create tickets, and integrate with the ways that you work and the tools that you use.
And fire prevention, not just firefighting. Spot trends and know what to look for next.
The choice between these two approaches doesn't just get you a new tool. It creates two completely different realities for your teams, your business, and for you.
So as a final thought, I want you to picture two words.
In the first word, an error happens, and your team's first clue is a vague Slack message from a confused sales team or worse, a frantic message from the CRO. You're scrambling, digging through emails, unsure how long it's been going on, and trying to figure out how many people are affected, and is this a one off, or is it the tip of an iceberg of a major outage? And that's the reactive world most teams live in.
Now picture a second.
There's an error on our create invoice flow, and you know that because your observability alerted you in Slack as soon as it happened.
Before it escalates, you're already investigating.
You've told the sales team, you've rolled back your deployment, and you're looking into the root cause.
You're in control.
That is a proactive world of observability.
Which will you choose?