The Case for the Diagnostics Team

I recently watched a lecture by Kevin Hale, who co-founded a startup named WuFoo back in 2006, grew it over five years to millions of customers, and sold it for 35M$, to SurveyMonkey. He subsequently became a partner at Y-Combinator for several years. The lecture was about making products people love, and one of the points he made was around WuFoo’s obsession with the customer:

  1. Each team member had a turn in the customer support rotation
  2. Their response time to customer support issues was a few minutes during the day, and a little longer at night
  3. They hand-wrote personalized thank you cards to random customers weekly
  4. Even though their business (form creation) was dry, the website was designed to be fun and warm, not business-y

It’s a great 45 minute video, and absolutely worth watching — it’s embedded down at the end. But what really drew my attention was that first point above, about everyone doing a customer support rotation. And that’s because at Voalte, which also had a customer obsession, we took a similar approach that we called The Diagnostics Team.

Voalte mobile app
Voalte is a communication platform for hospital staff

The team was like the cast of House: expert detectives in their domain that could tackle the hairiest problems, sometimes getting that “eureka!” moment from the unlikeliest of events. I/O throughput was our lupus.

The mission was a take on the support rotation, but with some twists:

  1. The team handled “Tier 4” support issues: the kind of stuff where a developer with source code knowledge was needed because the previous three tiers couldn’t figure out the issue.
  2. It was cross-functional, so that each codebase (Erlang backend, iOS, Android, JavaScript) was represented on the team
  3. The rotation was 6 months
  4. The team priorities were:
    1. Any urgent issues
    2. Code reviews, with a support and maintainability point of view
    3. Any customer-reported bugs
    4. Proactive log analysis, to find bugs before they’re noticed in the field
    5. Trivial, but noticeable bugs that would never get prioritized by the product teams
  5. Team members nominally did at least one customer visit during that 6 months

The model worked really well, and I think the team is still around, two acquisitions later, at Baxter. It wasn’t perfect (we never got that good at proactive log analysis while I was there, and customer visits ebbed and flowed depending on priorities and budgets) but overall, we hit the goals. “And what were those goals?”, you say. I’m glad you asked!

Cast of House, season 1

Remove uncertainty from the product roadmap

This was the main reason I pitched the idea of a Diagnostics team. After our initial release of Voalte Platform, we were constantly getting team members pulled off of product roadmap work in order to take a look at some urgent issue that a high profile customer was complaining about. And you could never tell how long they’d be gone: a day? a week? 3 weeks? How long does it take to find your keys? And if we had a couple of these going on at the same time, it would derail an entire release train.

The thinking was that having a dedicated team to handle those issues, while costly, was probably both less costly than the revenue lost from release delays, while also saving us money in the long run by preventing urgent issues.

And it worked: our releases became a lot more predictable. Not perfect of course, but a big improvement.

Keep a focus on customer needs and pain-points

Our customers were hospitals and we wanted to make sure things worked well in our app, because lives were literally on the line. Having a team that was plugged in to the voice of the customer meant that less complaints fell through the cracks of prioritization exercises. And while the Diagnostics team generally didn’t build features, once in a while they did: if the feature fixed a big pain-point.

This being Tier-4 support though is one major way in which it differed from WuFoo’s model, because the team wasn’t as much exposed to Tier-1 issues that were known to the frontline customer support people. When developers hear about a frustrating bug for the 4th time, they tend to just go ahead and fix it. But if they’re only exposed to that bug via a monthly report, it won’t frustrate them as much.

Our ideal here though, was to crush the big rocks, improve the operational excellence so that no more big rocks form, and then the team would be able to focus on the pebbles. We had varying success on this, depending on the codebase.

The other prong was customer visits. Each developer would pick a hospital and arrange a ~2 day visit. The hospital would generally assign them a buddy, and they would get the ground truth both from that buddy and by walking around to as many nurses’ stations as possible and asking them about the app.

Most of the time, they wouldn’t have anything to say. When they did, most of the time it was some known problem. But like 10% of the time, it would be revelatory: some weird issue because they tapped a combination of buttons we’d never thought of, or used a feature in a completely novel way than we meant it. And we’d write debriefs of the visit after the fact to share with the team.

No matter what was learned on the trip though, the engineers came back with a renewed sense of purpose and empathy for the customer, not to mention a much better understanding of how hospital staff work and use the product.

The House version of customer visits were rotations in the free clinic.
Great supercut on how not to act on your customer visits.

Improve the quality of the codebase over time

One of the things we were worried about in creating this team is that it would disconnect the developers on the product teams from the consequences of their actions. They’d release all kinds of bugs into the field and never be responsible for fixing them and so never improve. This was part of the reason we wanted Diagnostics to be a rotation. (Though, it ended up mostly not being a rotation, but more on that later.)

Our main tactic to prevent this problem was to make the Diagnostic team a specific and prominent part of the code review process. Part of the team’s remit was to review every PR for the codebase they worked in, and look for any potential pitfalls around quality and maintainability. Yes, those are already supposed to be facets of every code review, but:

  1. The Diagnostician would have a better sense of what doesn’t work, and
  2. They have more of a stake in preventing problematic code from seeing the light of day

Build expertise around quality and maintainability

To our great surprise, at the end of the team’s very first 6 month rotation, half of the members wanted to stay on indefinitely. They found the detective work not only interesting, but also varied in is breadth and depth, and fulfilling in a way that feature work just isn’t.

We debated on whether to allow long-term membership on the team, because we did want to expose all of the team members to this kind of work. But ultimately, we decided that the experience these veterans would build would be more valuable to the effort — especially when combined with them sharing that experience through code reviews and other avenues.

Over the years, they got exposed to more and more issues reported by customers — which are the ones that matter most — and they developed an intuition about what bothers them most and what kind of mistakes cause those kinds of issues. They also developed a sense of what programming patterns cause the Diagnosticians themselves problems both in terms of both monitoring and observability, so they can easily diagnose issues, but also in terms of refactoring code to fix problems, and what characteristics problematic components have in common.

That’s the kind of insight from which arises the most valuable part of the return on investment: preventing painful tech debt and convoluted bugs from ever getting shipped. It more than makes up for the cost the team.