You Don’t Have Netflix’s Problem

Great engineers mitigate risk. They don’t chase it.

A Skara Brae Systems field paper · by David Green

At some point, we couldn’t build anymore.

Not in any real sense. To start the application for a smoke test, you had to spin up somewhere between twelve and twenty-three separate services. By the end there were so many of them that a full run would not fit in a development machine’s memory. So we stopped trying. We ran the system in pieces, one section at a time, each of us keeping a private map of which services had to be alive and which we could leave dark to free up the RAM. We were a team of senior and principal engineers. And we could no longer turn on the thing we were building.

It was an inventory system. I want to be clear about that, because everything that went wrong follows from it. We were tracking the health of a fleet of utility meters. Each meter sent a small packet every eight hours: an ID, an up-or-down flag, sometimes a word about its battery. When a meter went quiet, we checked it once an hour until it reported back. A handful of employees at the utility watched the results. That was the entire job. A trickle of tiny status messages, and a few people reading them.

We had built it on microservices, running on Azure Service Fabric, which was Microsoft’s newest and most fashionable platform at the time.

We had built Netflix to run a spreadsheet.

How we got there is the more interesting story. And it started, the way these things often do, with a book.

Our boss had just read The Phoenix Project.

If you have not had the pleasure, it is a novel about a DevOps transformation, and it is genuinely good, and it has converted more engineering organizations than most religions convert souls. He read it, and then he handed it to the rest of us, and reading it stopped being optional. We were going to do DevOps. We were going to do microservices. The decision arrived fully formed, and the project’s actual requirements were invited, afterward, to fit it.

Almost overnight, the software engineers were renamed “devops engineers.” Same people, same desks, new title and new scripture. We were converts, whether we had asked to be or not.

Nobody in that room stopped to ask the only question that mattered, which was whether a meter-monitoring tool used by a dozen people had any of the problems that DevOps and microservices were invented to solve. The answer, had anyone asked, was no. But you do not ask that question once the architecture has become an identity. By then it feels like heresy.

I want to be fair to the idea. Microservices were not a mistake when Netflix and Amazon built them. Those companies had thousands of engineers who could not deploy without colliding, and billions of requests that no single machine could hold. For them the architecture was a cure for a disease they actually had. The mistake was ours. We saw the cure, admired it, and swallowed it without ever checking whether we were sick.

The bill came due immediately, and it never stopped arriving.

The tooling was new, and new tooling is buggy, even when it comes from Microsoft. We spent our days discovering defects that had nothing to do with our software and everything to do with the platform underneath it. That is the tax on being early. You pay it in other people’s bugs, and you pay it first.

The model itself fought us. Service Fabric leaned on stateful services and an actor pattern, and most of us had spent our careers thinking in straightforward client-server terms. We had to rewire how we reasoned about the simplest operations. None of it was impossible. All of it was slow, in the particular way that unlearning is slower than learning.

And then there was the matter of not being able to run our own system, which is where this story started. When you cannot stand up the whole application on your own machine, you cannot really test it. We could not do a clean regression run. So we tested in fragments and hoped the seams held, which is a poor way to find out whether the seams hold.

The whole time, a meter was running, and not the kind we were monitoring. The contract carried liquidated damages: four thousand dollars for every day we delivered late. We delivered late. The architecture we had chosen to look modern was now costing the company four thousand dollars a day, in the most literal sense. And we were late largely because we spent our hours fighting the environment to verify a single bug fix, and fighting a whole new class of bugs that existed for one reason only: we had chosen an architecture capable of producing them.

I have thought a lot about what we should have built, because it is not a hard question, and that is the painful part. A plain client-server application on .NET and SQL Server would have done the job. The one genuinely novel thing the problem required was a bit of middleware to absorb the influx of pings, and that was a small, contained, well-understood piece of work. Everything else was ordinary. It would have been easy to unit test, easy to integration test, easy for QA to exercise and regress by hand. It would have shipped on time. There was no death march hiding in that version of the project. The death march was something we built ourselves, on purpose, with the best intentions.

Here is the part I have made my peace with.

We were not fools. The people on that team were some of the best engineers I have worked with, and I have watched the same thing happen on project after project since, with different people every bit as capable. Good engineers sink good projects this way all the time.

The pull is real, and it is structural. Novelty is interesting and maintenance is not. The newest stack is the one that earns the conference talk, fills the resume, and wins the respect of the other engineers in the room. Nobody was ever promoted for choosing boring. And when a book or a vendor or a fashionable idea shows up wearing the costume of strategy, it lends the whole thing a justification that feels like rigor and is really just appetite.

What goes missing in all of it is business sense. The quiet question, the one that never got asked in our room, is the only one that ever mattered: what does this particular problem actually need, and how much risk can this particular project afford? We were optimizing for the system we wanted to build instead of the system the client needed. That is not a technical error. It is a failure of judgment, and judgment is supposed to be the senior part of the job.

There is a myth that the best engineers, like the best entrepreneurs, are bold risk takers. It is backwards. The best operators I know are relentless risk mitigators. They do not run from risk; they price it, and they refuse to pay for the kind that buys them nothing. New technology is a bet. Sometimes the bet is worth making, when you genuinely have the problem the new thing solves. Most of the time it is uncompensated risk, where you take on the bugs and the learning curve and the operational weight and get back, in return, nothing the boring option would not have handed you for free. Choosing proven, boring technology is not timidity. On that project it would have been the bravest thing in the building.

Years later I found a clean way to say it, in Mark Pincus’s Life at the Speed of Play. He sorts product work into three buckets: Proven, Better, and New. The discipline is the order. You earn the right to New by first nailing Proven, and most of the value in anything lives in Proven and Better, with New reserved for the rare place that genuinely demands it. Our project needed New in exactly one spot, the ingestion middleware, and Proven everywhere else. We chose New everywhere, and we paid for New everywhere. Pincus’s larger point is that you are in business to make something people love, and it is worth remembering who loves what. Users love what works and what ships. No one has ever loved a microservices diagram. They never see it. They only ever feel its consequences, and ours arrived late.

This firm is named for a village in Orkney that has stood for five thousand years. Skara Brae was not built from the experimental material of its age. It was built from stone, the most proven thing its builders had, by people who understood exactly what they needed, which was shelter that would last. That is why we can still walk through it. The systems that last get built the same way, out of the right thing rather than the new thing. So before the diagram, before the vocabulary, before the new religion, it is worth asking the unglamorous question and sitting with the honest answer.

What does this problem actually need?

It is almost never the exciting thing. And you almost certainly do not have Netflix’s problem.

Skara Brae Systems modernizes legacy government systems while preserving the knowledge buried inside them. This is the second in a series of field papers on why modernization goes wrong, and what to do about it.