If it isn't visible, it's probably broken

2025-12-07•

By Tom Nick

In my career building products at startups and in big tech, I've noticed a pattern: anything that isn't really visible yet is probably broken in some way.

"Broken" can be anything from a glitchy UI, to a bug, to a major data pipeline creating rubbish, to you losing all your data or losing $ 440 million. The point is, that if nobody is looking at it, it decays or was never working to begin with.

When something is visible, issues are found quickly, because there is someone or something that knows it's supposed to work and verifies that it stays that way:

Users use your product and notice regressions because they know how it's supposed to behave.
Integration tests expect your app to behave in a certain way and fail when it stops doing so.
Your QA team checks features before they are launched.
Your On Call is pinged at 3:14 AM due to a threshold alert and will then check what is happening

Put like that, why wouldn't you make everything visible?

Because visibility isn't free.

Letting your users be the integration tests is just wrong on so many levels.
Integration tests are non-trivial to write and maintain.
QA time is expensive and limited.
Your On Call will reflect on their life decisions the more they get pinged while sleeping

So you have to decide where to pay for visibility, and how much. As with everything, the first step is to be aware of the visibility of something.

The concept of visibility is already established quite well in infrastructure with observability. But my thinking here came more from product work than from SRE blogs. But similar how every animal's final evolution is a crab, making things more visible (in whatever way) is the logical conclusion to many disciplines.

The visibility spectrum

I've started to think in terms of three axes for visibility. It doesn't matter if the "thing" is a feature, a data pipeline, or an internal process - you can ask the same questions:

Who can spot issues - and who can actually debug them?
How much effort does it take to verify?
How often is it actually verified?

You can imagine a feature sitting somewhere on each of those axes. The more things cluster on the "only one dev knows how to check this, slowly, by running custom code, and nobody ever does" side, the more you want to turn in your resignation.

Let's go through these.

1. Who can spot issues (and who can investigate them)?

This is the most intuitive dimension: who could notice that this thing is broken? Just the original developer? Any teammate? The end user?

Most bug are not found by the person who wrote the code, instead they are found by the user. The user can be a lot of things:

The users of your product
The CEO looking at an automatically created business report
The new hire that notices that links in the documentation are dead

There's also an important distinction between:

Spotting that something looks off ("this spike is weird"), and
Verifying whether it's actually wrong ("yes, this is not valid").

You want both, but they're different skills and different levels of access and knowledge. If we want to improve this axis, we first aim to increase the number of people who can spot issues. To do this, we have to make things accessible:

If I can't see the data, I cannot spot issues in it.
If the process is not documented anywhere, I cannot even think about it.
If the feature cannot be toggled via a feature flag, I cannot try it myself.
If the business report cannot be exported into an excel, I cannot make my own calculations.

Even a very crude UI or CSV export goes a long way compared to "only accessible via SQL on the production database". I've been surprised many times by how much time is saved simply by giving other people the ability to see and poke at things.

As mentioned, spotting an issue is different from being able to verify it. Just because someone thinks a number looks sketchy in a report doesn't mean they can determine why it is the way it is.

Anecdote: even a crude debug view helps

If you want to provide your customer a way to connect your customer, you have to rely on a service like Plaid or GoCardless to do so, which also means, that every time something goes wrong, it could be any of these things:

The bank (outage, random error)
The open banking provider (implementation bug, API change)
Us (a bug in our integration)
The customer (wrong credentials, wrong bank selected, etc.)

Initially, when our support asked "what went wrong for this user?", I had to go dig through logs. Finding the right entries took time and was annoying.

So I started saving all relevant connection attempts into a table (bank_account_connection), which we already had to handle webhooks anyway. Now I just had to run a simple SQL query to see all attempts and their status.

Then I added a very simple table view for this to our internal operations app. I actually only did this to make my own life easier, but in the end, I didn't have to look at this stuff at all anymore:

Non-technical team members could see these attempts.
After explaining some error codes, our support team could perfectly understand them.
All customer requests since then were completely handled by them alone.

The team members were able to spot issues before - as a customer reached out - but they didn't have enough access to verify anything on their own. By giving them access to even the crudest debug view, they were able to debug it themselves. This is in line with the famous saying:

"Take over a support ticket for someone and you helped them for a day; give the person access to a debug table, and they can answer support tickets for their lifetime."

2. How much effort does it take to verify?

Even if people can see an issue and are allowed to investigate, there's still a question: how painful is it to actually verify that something works? The more painful it is, the less often you'll do it.

For me, "effort to verify" usually comes from four things:

How long it takes
Ease of access
Representation
Required knowledge

Let's look at each one.

2.1 How long does it take to verify?

While working at YouTube, there was often a rush on Fridays to get experiments launched before the weekend. Not because enabling experiments on a Friday is fun, but because:

If you got the experiment out on Friday,
You could look at the numbers on Monday,
Decide quickly whether to ramp up, roll back, or iterate.

The weekend basically worked like a time-skip cheat for verification. If you only managed to launch on Monday, you'd have to wait until Wednesday to get two full days of data.

The same idea shows up everywhere:

Running your test suite before you go to lunch.
Running multiple agents or jobs in parallel to shorten feedback loops.
Using a small sample dataset locally before touching the full one in CI.

If checking whether something works takes weeks or months, it effectively never gets checked unless you have very disciplined automation.

Anecdote: the joy and annoyance of pre-commit checks

Pre-commit checks allow you to execute actions whenever you commit code. The pitch is great: never forget to run formatting, type checks, unit tests etc. Simply put these checks into the pre-commit hook.

But if you ever worked at a place where pre-commit started to take longer, you are also familiar with the command line argument --no-verify that skips the pre-commit check. As you don't really want to wait that long, you start to execute it less often.

We faced this exact issue: more and more team members started to skip the checks, me included. So we simply removed the biggest offenders as we also ran them in CI. One of them was the code formatter - which in theory shouldn't break as it runs whenever you save.

This definitely helped... but it didn't actually help in merging PRs faster, as more often than not, the formatting was broken . As Claude Code or Codex skip your on-save hook and are also not the most reliable in following commands, the formatting was often skipped.

The formatter we used was an improved version of the language default, but it took much longer and there was no option to improve its time.

The solution was to drop the fancy, but slow formatter and go back to using the fast one. We were able to run it again on pre-commit without noticable lag. The trade off for better formatting was not worth it for us and in general. Bias towards feedback time.

The anecdote shows that how long something takes directly correlates with how often it gets done. Developers already know how annoying long-running tests or compilers are and do strive to make faster tools (thanks to everyone rewriting slow tools in Go, Rust or Zig). I still think decreasing the time-to-feedback for anything is underrated. If your whole test suite ran in a second instead of half an hour, you (and your AI agents) would be able to develop very differently.

2.2 Ease of access

The easier it is to access the feature, data or process, the more likely you are to actually verify it. If you have to:

Write custom SQL to read a handful of rows,
Download and manually grep logs from a particular day,
Or sign in with a special account using an unusual 2-factor setup,

you'll simply do it less often. It might be "possible", but it's not easy. The nice thing: improvements in "who can access this?" also usually improve ease of access in general.

This obviously correlates with how long things take, but it's mainly about friction: even small annoyances compound until you stop doing the check at all.

Anecdote: making device testing easier at YouTube

YouTube is one of the biggest apps ever; as such it runs on every device that can theoretically run it. To make sure we didn't break anything when changing the mobile apps, we had plenty of test devices lying around of different form factors and types (e.g. Android tablet, older iPhone).

Testing on the devices was easy if your feature was already launched - simply sign in with the setup test accounts and check it out. But that's already too late if you want to be careful. The issue was that test accounts on these devices didn't allow the manual override of feature flags. You could do it with your own corporate account, but that meant:

Signing in with your account on every test device you want to test on
Doing the security challenges
Flipping the flag and testing the new feature
Cleaning up afterwards, as you don't want anyone else to have access to your account

Only point 3 should actually be necessary. I was a bit confused why this was so annoying and why nobody had fixed it yet. Reading the docs it became clear that there actually existed a solution, but only in the main YouTube office in San Bruno: they had a custom WiFi setup that allowed setting feature flags on test accounts.

As we were a sizeable YouTube operation in Zurich back then, I was able to get our own custom WiFi setup, making testing on devices much easier.

This is the part where I should mention how this transformed our device testing, but sadly Covid hit and we were working from home. With me leaving to join re:cap, I was never able to see the full glory of the testing WiFi.

The anecdote should show that it was mostly an annoying access problem that made you not want to test quickly on a device. The steps weren't difficult, nor did they take that long. But they were annoying enough that you really didn't want to do them often.

2.3 Representation

Representation matters a lot. The more data points you have, the more important it becomes: 1000 rows of data are not intuitive; a bar chart is.

Anecdote: aggregate your snapshots

I once had to make sure a critical piece of money-handling code was tested properly so its business logic could be changed safely. We would "buy" our customers' contracts to pay them their worth upfront with a discount (factoring). They then had to pay us back over the next months. There were multiple tables involved:

The payout to the customer
Monthly payback schedules
Underlying contracts
Invoices belonging to those contracts
Future expected invoices

Every month, the data had to be updated with the current status:

Did some contracts churn?
Did the invoices we expected actually get paid?
Do we need to replace a contract or move expected cash flows to a later month?

In short: a lot of data that changed in non-trivial ways.

I created an integration test that snapshotted these tables at various important stages so we could see how any code change affected the structures. On paper, this made things "visible".

In practice, when I changed the underlying code, the snapshot diffs were huge and noisy. I could see that something changed, but not whether it changed in the right way.

The solution wasn't to become a human diff engine. It was to make the data readable. I created a custom aggregated structure that summarized the important aspects instead:

{
  "financeableContracts": 62,
  "financedContracts": 0,
  "activeContracts": 62,
  "rebatedContracts": 0,
  "replacedContracts": 0,
  "activeInvoices": 1008,
  "paidInvoices": 0,
  "residualInvoices": 0,
  "payoutAmount": "428355",
  "paybackAmount": "450900",
  "financingFee": "0.05",
  "missingPaybackAmount": "0",
  "collectedPaybackAmount": "0",
  "remainingPaybackAmount": "450900",
  "payoutStatus": "requested",
  "monthlyPaybackStats": [
    {
      "paybackAmount": "68610",
      "paidAmount": "0",
      "status": "active",
      "invoices": 160
    }
  ]
}

Now I could quickly sanity-check:

Do the totals still make sense?
Are there unexpected churned or replaced contracts?
Are invoices missing or mis-classified?

Any change in the underlying code showed up as a simple, readable diff in this structure. The invisibility problem wasn't "lack of tests"; it was a data representation that no human could parse.

The anecdote shows that if the representation is lacking, visibility can completely tank, even if every other dimension is fulfilled. Ask yourself if a different representation would make your life easier.

2.4 Knowledge

Legacy code can be understood as code where nobody on the team has a mental model of it anymore. This ties directly into how hard it is to verify something. If you set up the business process, you know the idea behind it and whether it still makes sense. If you wrote that weird part of the code with the cryptic comments, you have a better chance of understanding your past self.

If there's a playbook for recurring issues, more people can help.

I won't tell you "just write documentation" - docs have their own problems and are not a silver bullet. But you should keep a paper trail:

Pull requests should have a description and link to a ticket.
Commits should have real messages.
Somewhere, you should write down why you did something a certain way.

The easier it is to find that context, the better, but even a small breadcrumb helps. You will forget your own reasoning, and it's a humbling feeling to stare at code you wrote two years ago and think "why on earth did I do this?".

At Google, they're very good at this via tooling. They don't have much traditional documentation, but they have excellent code history tools and strong habits around leaving traces in code reviews and commits. You quickly learn to navigate through the history of a file and understand why something looks the way it does. That's also a form of visibility.

3. How often is it actually verified?

Even if something is easy and quick in theory, the important bit is: how often does anyone actually do it?

Some examples:

If my test suite takes 1 second, I'll run it on every save.
If it takes 5 minutes, I'll probably rely on CI.
If it takes hours, maybe I'll run it once a day, or just before releases.

The only advice here is to automate whatever you can, put any check you can into your CI. Put required password and token rotations into your calendar. The bare minimum is a recurring reminder of some sort.

Making things visible on purpose

So what do you do with all of this? When you work on a feature, piece of data, or internal process, ask yourself:

Who can tell if this is broken?
- Only me?
- Any team member?
- The end user?
How much effort does it take to verify it, without talking to the original author?
- Is there a debug view?
- Is there a clear representation (aggregate, graph, table)?
- Is there a quick path to the relevant data and history?
How often does this actually get exercised?
- Tests on every commit?
- Dashboards someone looks at weekly?
- A manual run once per quarter?
- Never?

Often, problems you've been fighting for months ("that report that is always wrong", "that feature that breaks every second release") are just symptoms of low visibility:

Nobody can easily see when it drifts.
Or the only person who can isn't looking anymore.

Making something visible doesn't magically fix it, but it changes the odds:

It gives people a chance to notice.
It gives them a representation that matches their mental model.
And it makes deletion an explicit option when nobody looks at it at all.

If it isn't visible, assume it's broken - and then decide whether it's worth making visible or worth deleting. Both are better than pretending it's fine in the dark.