Description
Having a development pipeline to move code from lower environments to higher is one part of the process – how do you use it effectively and tie it back to your roadmap? We’ll cover DORA metrics as a primer, but also how to pull in other metrics such as velocity and planning. All of these metrics together help with the health of your team and pipelines.
Matt Pieper – Director of Enterprise Engineering, LeafLink. Matt currently leads Salesforce and Backend engineering teams at LeafLink. He has worked across product engineering and business systems teams, giving a perspective into tech outside of Salesforce. When not talking Salesforce, he enjoys travel, street photography, and volunteering in disaster response and animal welfare.
Transcript
I hate it here.
I hate it here.
Okay.
My friends used to play a game where we would pick a decade. We wished we could live in instead of this.
I'd say the eighteen thirties, but without all the races and getting Tour, bearing with me here for a second.
So, today we're gonna talk about you have a pipeline now what? So, raise your hand if you want to sit in a session where you don't have to buy anything, hear about Agent Force or AI.
Yes, you're in the right spot. So, you have a pipeline. Now what? Right? So, we've all heard about establishing that pipeline.
How do we push changes through it, right? Like the table stakes, right? Getting away from chain sets, being able to have a more mature process. So, we're gonna dive in.
So, great headshot, right?
So my team gave this to me. It's hilarious because I'm not even that big. But I especially love my sales blazer as a salt blazer hoodie. I'm gonna get a custom one made and take it to Dreamforce this year.
What about me? Why am I up here? Why should you listen to me? So, my name is Matt Peeper. I'm currently an engineering director at LeafLink.
I own multiple back end engineering pods, front front end engineers and then Salesforce. So over the last twenty years, I've been an application developer back end engineer by trade. Got pulled into the Salesforce world on a project and then stuck, right? So about twenty five percent of my time these days are Salesforce, seventy five percent on regular back end engineering.
DevOps is a very important thing for us.
We deploy across our engineering org over twenty five times a day. So, true CICD where we're pushing things out in our application behind feature flags.
Always constantly pushing those small bits of code that are in testable units so we can continue to go.
Most of my time over this time has been in startups. So, value to market, right? How do we get speed to market? How do we test? How do we get feedback? If we went on monthly, weekly, quarter releases, we're kind of dead in the water.
So, what you talking about Willis and or Matt?
We're gonna have an intro. We're gonna talk about some of the metrics that we're gonna dive into. How to surface that data and look at it. We're gonna walk through a little example and then I hope you have questions.
I really want questions. That's the big feedback on this because that means I'm not doing it. DevOps, it's not a one time activity, right? It's as you've seen in the icons, right?
It's that infinity loop, right? You're constantly iterating, you're constantly evolving. Just because you set up your pipeline in code or flows or your metadata are data are pushing through, doesn't mean that you're done, right? We're constantly monitoring.
We're constantly monitoring our teams.
You have to build the foundation to get what you want here, right? So, we're gonna have to use issue management, right? This is JIRA, which is a four letter word for most people, right? A linear or a shortcut.
Why does this matter? These are building blocks that we're actually working on. A lot of the time as people get into these ALMs or issue tracking systems, they tend to break work in a giant ticket, right? I'm gonna do x. I'm gonna ship a new flow, right? Really what we want to break down is those individual user stories that tie to an individual PR or pull request.
If you don't break it down to this level you'll lose that tracking, right? What did this PR go to? How can I use git blame to figure out who changed something or where something broke? Make sure you start with the issue management and make sure that every single commit has a ticket established with it. That is your backbone, right? You can go back and look and say, where did this work come from? Why did I deploy this?
Connect your issues to source. If that user story isn't tracked, you can't measure it. Going back to what Ian said, right? All these features are built and then never sent.
If you're starting to track that, you can go back and say, okay, in my retro, why did we build it this way? What were those requirements? What were those comments? And you tie it to that source and you can always go back and figure out exactly which story it has.
Raise your hand, right, if you've seen in the description field ticket number SF-eight fifty three.
What does that even mean to you, right? What does that even do? Yeah, I can go back into my ticket management platform. But what happens when I decide to move from Jira to linear? What happens if I move from Jira Service Cloud to the ServiceNow? Those ticket numbers mean nothing.
Use your stories, use your code commits, use documentation.
Traceability.
Traceability means several different things in my world, right? But basically, use consistent branch naming. If it's a feature, right? In our org we always use the ticket number dash small description of what we're shipping. That way you know what that branch is and you can go back to that ticket.
Bonus step, when you connect a story number into your commit, into your PR, into your branch, it automatically associates those if you're using Jira. So, in that ticket you can go back and show, hey, this is that commit, this is this PR and this is that deployment where it came from. Very important when a P1 or a P2 comes up because you can exactly hit a git blame, figure out which PR caused your issue and then go back into it.
Same thing with the ticket to deployment releases, right? Every release should have that story so you can go back and track.
Development is a team sport.
It's not you, it's not your ideas, it's everybody on the team and you need to be able to prove and show exactly what you were doing at the time so your users and your other devs or admins on your team can figure out what you're doing.
Be agile.
It's easy, right, to say agile, right? Because it's like, oh we're just gonna keep on shipping, right? We're shipping left and we're gonna continue to work on things. That doesn't mean you don't have a plan, right? That doesn't mean like hey, I'm just gonna continue to throw out what sticks.
What that means is you're shipping features bit by bit along the way versus this giant waterfall release at the end. That allows you to see is your new code playing well with others? Did it cause any issues?
You don't have that midnight release where you're spending three hours trying to figure out why that metadata failed.
Go in small chunks. Use your sprints. Your sprints are there.
Sprints that work in that sprint should be concluded in that sprint. If you are seeing these tickets roll over that means that your tickets are too big. You need to sit there and break it down. A story should always exist within one sprint.
That's where the plan and that's where the feedback comes from.
So, raise your hand if you've heard about Dora.
No, it's not that one. So, in the DevOps world, right, these are the things that we live by including the AppDev world, right? There's just four easy metrics.
Anyone can do these, they're pretty easy to keep up. Deployment frequency.
How often am I successfully releasing to production?
From that code, is it showing up there? High frequency equals you have smaller frequent releases.
Low frequency, you have giant slower releases.
Lead time for change.
This is where discipline comes into play with your commits, right? A lot of time you just wanna, people when they're getting early to source are commit, commit, commit, commit push to source.
When you're working locally, keep your branches local, right? Don't push them up to your origin GitHub branch until you're ready, right? Because that's what triggers off this lead time for changes. How long does it take for your initial commit to reach production?
What does that mean? If I push code into my branch and then I open up a pull request for someone to review, how long does it take for me pushing that code and saying I'm code complete to someone going back and peer reviewing it? Maybe QA is involved and then I get to production. If it takes me one day to build a feature, you know, what's an Apex class or a Flow and then it doesn't get to production until nine days later, what happened? Where are your bottlenecks? Are there any bottlenecks?
Raise your hand if you have a UAT that involves, you know, a business user that, you know, is always busy and can never review your tickets, right? This is where you start to surface those things.
Mean Time to Recover. No, that doesn't mean angry. It doesn't mean that you're a jerk. It's your average time that you're recovering from a major incident, right? Whether that's a P1, P2, P3. If you're not familiar with the P numbering system, this is a common thing that we use, right? So, P0, P1 means major outage, P2 less, P3 is like, okay, that's annoying.
If you get that lower MTTR, that means that you're able to recover faster, right? That means that you're able to figure out where that issue arose in your bug process and then solve it and revert back.
It is unique to every single company and every single piece of software. Just because you have an MTTR of three hours doesn't mean that it's bad. That's just your baseline right now. Over time, you wanna see that decrease because that means that you've documented your systems. That means you're able to swarm as an incident. That means that you've shipped small enough that you're not causing those issues.
And then change failure rate, also known as CFR.
How many of your releases create a quality issue, whether that's a bug or an outage? Lower is better, this is your quality score, right? Four easy metrics that you can do. The best part is, this is all free, right? If you have any sort of source control plus your ticketing system, easily pull those together.
Speed versus quality.
Dora Metrics balances that, right? You can pull these individual levers. It doesn't mean that you have to be good and all.
High energy teams, high performing teams are good and all. But if you're starting off, pick one or two that you wanna improve. It's pretty easy. If you have a lot of high pressure stakeholders, yeah, keep that failure rate down, right? Because that means you won't get any noise.
You're balancing that throughput and statability, right? How quick can you get something to in your org?
If you're working on a quarter or an annual release basis, these metrics aren't really that great, right?
DORA is meant for CICD, continuous integration, continuous deployment. How often are you pushing because these things matter. It goes back to, you know, Facebook's annoying, you know, go fast and break things attitude.
Cycle time. These are identifying, you know, your bottlenecks as we talked earlier, right? This helps you figure out where your issues are as we mentioned earlier. This is very critical.
Like, These are things that we live and breathe in every single day as a director because my responsibility as an engineering director in my EEMs, right, is one, people leadership, that's always first. Two, it's being a corporate financial whiz. And three, delivery, right? I'm responsible for delivery of features.
My teams are responsible for doing the actual hard work but it's up to me to monitor that health. How quickly are we doing things?
Planning.
Not a door metric. One of my favorite though, goes into delivery.
How well does a team deliver what it planned?
Keyword plan. If you're not planning your sprints, if you're not planning your work, this metric goes out the window.
Raise your hand if you've ever done, sat there and did sprint planning and you're like, yes, this is an awesome plan and then you get to the end of the sprint and none of it was done.
Right? That's where you pull in that unplanned work, right? Marketing needs a campaign built over here. Oh, finance needs a new product set up. All of that unplanned work goes into it and when you start to track this you can go to your C suite or your boss or other units and say, look, this is the work that we're doing, right? Unplanned work is killing us. We're sitting there reacting the entire time and that's taking up our team's energy.
High accuracy, ninety five percent, that's an elite metric. Shoot for like eighty. That's where you're gonna get there. Low accuracy means you're either committing to a lot or you're getting a lot of that unplanned work.
Allocation.
I was sitting there at lunch and was listening to someone talk, right? And they said, oh, we need a roadmap. When do we bring in that Dev? How do we know what they're working on?
Does that Dev's money cost anything? When you start to track your work with your source control and with Jira or other ticket management system, you can track that resource allocation. You can say on average, right, my admin gets paid, I don't know, seventy ks a year, right? My developer gets paid one hundred ks a year.
When they're working on this feature, this feature cost us twenty thousand dollars We budgeted fifteen ks. Where did it come from? That's just from people hours, right? So, you can start to say like here's what our projects actually cost us from development time if you're working internally, right?
Or, if you force your consultants to do this then you're saying, okay guys, you're over billing us. Maybe we should hire our own internal team. Sorry any partners out there for stealing your work.
So, what does this look like? What do I look at on a daily basis?
For instance, every two weeks I sit with thirty people across our product and engineering teams to say, here's what we're working on, here's how well our teams perform.
What do we use?
First of all, this is my Datadog dashboard that I check every single day.
If you look at this, some things will scream out to you, right? So that three point nine eight thousand, that's my total number of logs. That middle number seven seventy two, that's my number of warning logs, right? Something possibly is gonna happen. And on the far right, I have six errors.
Look at your warnings, look at your errors, that informs your Dora metrics. If you see those start to creep up, you have an issue somewhere. You need to plan for that work before it becomes an incident.
Also, things that jump out at me as I look at this. Why is one of my flows running at seventeen forty and the next one down is seven eighty seven times a day? Do we need to put in perhaps a condition on that flow so it doesn't run as much?
This is my quality dashboard that I start off with every single day. I use a product called Nebula Logger and then I ship that to Datadog. It's part of our engineering.
These next couple dashboards, we use a product called LinearB. There's several different metric dashboards out there if you wanna purchase them but we love this one.
What you're looking at here is a random snapshot in time that I chose that had a lot of data. You'll notice our code changes. You have a very big exception here. That's because somebody pulled in a bunch of code from a legacy platform and pushed it into the repo, right? So, that skewed everything up. But, if you look on an average week we're shipping about a hundred lines of code each week.
Our commits, thirteen.
Our PR is open, three. Deployment frequency, one. That's an important metric, right? Why are we opening three PRs a week where we're only shipping less than one a week?
That tells us we have a bottleneck somewhere. Maybe we're not reviewing. Maybe there's some dependencies. Maybe we don't want to overwrite.
Maybe we had some merge conflicts. Maybe this PR was like just too big and took people a long time to review.
Cycle time. These are our door metrics, right? Cycle time we had two days. That's pretty elite, right? So, from the time that I was code complete and I pushed two days later it was in production. Meaning that we were able to QA and UAT that entire time.
Deploy frequency, we saw it in the last one, deploying once will be. MTTR, we only have one incident here so it's a little skewed but it took us six and a half hours to recover. That's an entire day. What took us so long? Why was it there? Maybe we just didn't close the ticket out and so the metrics didn't count.
But, we're well under fifteen percent but as you look at that spike that CFR indicated directly to our MTTR right there.
My favorite look that we have.
This is basically all of our planning and guidance here. So, we can see that we only finished less than fifty percent of our planned work this spring.
But, the capacity we're at seventy three percent. So, we were able to sit there and figure out, hey, these aren't how many points we could work on. So, we know what we could work on. We just didn't do it.
And then our delivery.
Right? In this continuing week, you saw that we added sixty seven points worth of work. Our team was, or actually maybe that was story points. Yes, story points.
So, sixty seven points. On average, an engineer in our planning, right, in our poker pointing can only do thirteen points a week. There's only four people on this team, four times thirteen. We added way too much work on top of our planned work.
And then investment profile.
We only worked on nine percent of bugs here. That's well within our metrics, right? We decide as a team that we want to work on sixty percent feature work, twenty percent KTLO or keep the lights on, ten percent tech efficiency and ten percent bugs. We're right in that wheelhouse.
And then people affect, right? We had six and a half engineers typically on iteration and what did they actually work on?
This is directly from Jira. This is our org time. And you can see over on this side with my two pods, right? And the issues we've evolved and the story points resolved.
We can see that on average, right, we resolved thirty different issues resulting in fifty nine points.
Where do those break down? Tech OKRs, right? Those are OKRs within our org that we wanna do, right? Data efficiency, refactoring, cleaning up data points, those types of work.
Product OKRs, this is what our teams want, right? We're shipping these to either our customers facing or our internal customers. And then bugs. Bugs are bad, chase them, squash them.
Not P3s and P4s though, you can let those linger.
But then on the right, you can see where our bottlenecks happen, right? We had average in progress, that ticket was on five days worth of work. But, it required then two more days to review and then six more days in QA. That is a long time. How can we get our QA days down so we can get those features into production fast?
So, an example. How do you use these metrics in the wild? No one's laughing at my Doctor. Evil.
So, Friday, ten am, we ship a release. No issues, right? Boom. Pipeline's clean. We're off to the races.
Then your users get your hands on it, right? All of a sudden, Slack starts to blow up. For some reason you don't have Honorary on it.
Eleven am, support notifies us that customers are not being provisioned in their app. All right. So, as soon as they complete their order form and sign up for our services, our bespoke application isn't receiving those notifications. That's a P1, all hands on deck.
An incident team pops into Zoom.
That's our swarm. Members of our PlatformDevOps team, Salesforce and Product Engineering teams join. So, now we have fifteen people on a call including the CTO.
We do a root cause analysis. The issue is traced back to changes in our code in our latest release, right? We use a, we look at the PRs that hit the repo across. We use a git blame to identify any new code and we laser focus on it.
Then, at two o'clock, we have that hotfix, we push it and it takes thirty minutes for our test to run, meaning we're not recovered until two thirty.
By the way, Chumbawamba, I don't know if anyone's a Tubthumping fan. I was waiting for some jokes there.
Random trivia, that song was written about a guy they looked at outside their window who was drunk and was trying to put his key into his door and kept on falling down and literally getting back up again. And so, that's where the song came from and also the Danny boy, he was literally listening Danny boy as he was falling down over and over again. So, trivia.
So, three and a half hours. What's our impact?
Our P1 incident means that in this sprint, maybe one out of five deployments, we have a twenty percent CFR, higher than our usual.
Clearly, something's wrong, right? Why didn't we catch and test this? Our MTTR was three and a half hours.
Was that good? What if our baseline was an hour? What if our baseline was eight hours? That's really what you wanna focus in on. Maybe you can get to thirty minutes. Where in our process, in our recovery could we have zoned in on to make that faster?
And then sprint velocity. This is the great thing to take to your stakeholders, right? Because of this P1 now we have an impact on our overall roadmap. Our velocity dropped. We didn't ship that work we wanted because we took almost half a day fixing a bug that should have never made it to production in our testing process.
Our planning accuracy took a hit, right, because now we're not able to complete that work.
Team morale is impacted, right? Who wants to sit there banging their head against a wall troubleshooting something and having that stress and having people yell at you? So, now we and then they want to build features, right? They don't want to sit there and debug.
And then our roadmap health check. Everything starts to shift right instead of what we're wanting to do, shift left and actually get it into production faster.
So, when you leave here, what's next? What do you do on Monday?
Integrate your pipeline with issue tracker. That's easy.
Measure your metrics. Start with that baseline.
Pick one metric that you want to track and follow on.
That's up to you. That's up to your teams, right? The best thing about Agile is we're self organizing teams. It's up to your pod plus some direction from your directors and CTO and VP of eng or whoever you report to, right, To inform that. But as your team you have that leeway to say, yeah, we're gonna focus on MTTR and see what our retro is.
Leverage tools at GearSet and then in your retro talk about these things. Get them out in the open. Like I said, every two weeks I sit with all of my other peers to talk about these things. Where do we benchmark against other pods? Do that with your pod itself too. Let them know what MTTR is. Let them know what CFR is and drive that adoption.
With that being said, I am right on your time. That's a link QR code to my LinkedIn. I do business memes and grumpy old man stuff. With that being said, what questions do you have? There's going to be at least three of them. I know it. No?
Come on. There we go.
Oh, for the logs?
Or the dashboards?
Yeah. So, I use Datadog and LinearB is the dashboarding and then I use Nebula Logger for all my logs. It's an open source package developed by Jonathan Gillespie who's one of the engineers at Salesforce. I highly recommend you check it out.
Ian. When people say to you, we're using Agile so I don't need to do in house a scheme.
I laugh.
Agile doesn't mean that you're not planning, right? Doesn't mean that you're not gathering requirements and that's where we drive into.
Hey, can you add that field?
Cool. Let's go into our refinement session. Let's make sure we have all the requirements and then we'll do the work. I don't pull it into that sprint directly, right?
Because as you know there's always hidden requirements to something. And so, we want to as a team, right? In that refinement sit there and question each other and figure out do we have those requirements and the right amount of level of effort. But we rarely ship anything without an arc spec.
Like, if we're gonna do a full feature we have an incomplete architecture spec that we've designed for those impacts and change management. But yeah, just because you're agile doesn't mean you don't plan and doesn't mean that you don't do feature analysis.
Yeah.
Yeah. Just like Agile, just like DevOps, right? It's been a process.
We probably haven't like, we used to use our OKR of like how many P ones did we have this quarter, right? And I went to the board. We realized that was unfair, right? Because the second like, we saw this game happen, right?
Where it's like, oh, I'm not gonna declare a P one because that means my metrics gonna go up. And so, we saw a little bit of people trying to hide, right? Once we said okay, that's not what we wanna focus on. We wanna focus on quality.
We started down this route and it's been a progress, right? We started at the engineering director and CTO level and came up with those metrics. And then it was slowly talking to our engineering managers, right? Then our ICs because the feedback we heard right is I'm too busy.
I don't have time to look at this. It's busy work. Like, why do I care? Once we started that conversation of this busy work actually reduces the amount of stress you have, that's when it started to take in.
We actually then gamified it across pods like, hey, why are you so high? You know, why are you so low? And then in kind of centering it out. And we use it in our review process to kinda calibrate, right?
It doesn't drive you to a five or one on it but it's one of those things of like, hey your pod is having some serious quality issues. What's up, right? Oh, we're in a legacy code base.
Everyone who was doing this is no longer here. We had to figure it out. Oops, we missed something, right? So, it's not an overnight thing and we're still iterating and going on to it, right?
Like, you see our metrics, they're not elite, right? And so, we are striving every single day and every single sprint and every single retro to get there. And ultimately what it's leading us to is thinking, okay, maybe we need to do something different, right? Maybe we don't wanna work on our legacy code base.
We have to re platform. And so, long story short, right, we're still going through the process and we'll never stop.
I had exactly three questions. Can we go over one?
Yeah. So I'm not a current GearSet customer but back when I was at CareRev, that's one thing we pulled into our data mart or our warehouse and we were able to use our pipelines in our warehouse and start to report on that.
The best thing about LinearB is you don't need to have those tools Like because you're using Jira and because you're using source, that automatically sources it out. So, I would say Gear Sets API is fantastic.
If you don't have one of these tools, it pulls in those metrics for you for free, assuming you can code.
Depends on what metric you're looking for, right? If you're velocity, stay in Jira, right, or Azure there and look at that. If you're looking for more MTTR and CFR rates, I would look at the Gear Set API and start to pull that in. So, it's really which metric do you wanna start to pull those levers on and which ones matter to you the most. Cool.
With that being said, I know I'm over time. This is a cascading effect but I appreciate, your time. I'll be around if if you feel like asking questions for me. Thanks.