Thursday, April 18, 2013

Architecture-Breaking Bugs – when a Dreamliner becomes a Nightmare

The history of computer systems is also the history of bugs, including epic, disastrous bugs that have caused millions of $ in damage and destruction and even death, as well as many other less spectacular but expensive system and project failures. Some of these appear to be small and stupid mistakes, like the infamous Ariane 5 rocket crash, caused by a one-line programming error. But a one-line programming error, or any other isolated mistake or failure, cannot cause serious damage to a large system, without fundamental failures in architecture and design, and failures in management.

Boeing's 787 Dreamliner Going Nowhere
The Economist, Feb 26 2013

These kinds of problems are what Barry Boehm calls “Architecture Breakers”: where a system’s design doesn't hold up in the real world, when you run face-first into a fundamental weakness or a hard limit on what is possible with the approach that you took or the technology platform that you selected.

Architecture Breakers happen at the edges – or beyond the edges – of the design, off of the normal, nominal, happy paths. The system works, except for a “one in a million” exceptional error, which nobody takes seriously until a “once in a million” problem starts happening every few days. Or the system crumples under an unexpected surge in demand, demand that isn't going to go away unless you can’t find a way to quickly scale the system to keep up – and if you can’t, you won’t have a demand problem any more because those customers won’t be coming back. Or what looks like a minor operational problem turns out to be the first sign of a fundamental reliability or safety problem in the system.

Dreamliner is Troubled by Questions about Safety
NY Times, Jan 10, 2013

Finding Architecture Breakers

It starts off with a nasty bug or an isolated operational issue or a security incident. As you investigate and start to look deeper you find more cases, gaping holes in the design, hard limits to what the system can do, or failures that can’t be explained and can’t be stopped. The design starts to unravel as each problem opens up to another problem. Fixing it right is going to take time and money, maybe even going back to the drawing board and revisiting foundational architectural decisions and technology choices. What looked like a random failure or an ugly bug just turned into something much uglier, and much much more expensive.

Deepening Crisis for the Boeing 787
NY Times, Jan 17 2013

What makes these problems especially bad is that they are found late, way past design and even past acceptance testing, usually when the system is already in production and you have a lot of real customers using it to get real work done. This is when you can least afford to encounter a serious problem. When something does go wrong, it can be difficult to recognize how serious it is right away. It can take two or three or more failures before you realize – and accept – how bad things really are and before you see enough of a pattern to understand where the problem might be.

Boeing Batteries Said to Fail 10 Times Before Incident
Bloomberg, Jan 30 2013

By then you may be losing customers and losing money and you’re under extreme pressure to come up with a fix, and nobody wants to hear that you have to stop and go back and rewrite a piece of the system, or re-architect it and start again – or that you need more time to think and test and understand what’s wrong and what your options are before you can even tell them how long it might take and how much it could cost to fix things.

Regulators Around the Globe Ground Boeing 787s
NY Times, Jan 18 2013

What can Break your Architecture?

Most Architecture Breakers are fundamental problems in important non-functional aspects of a system:

  • Stability and data integrity: some piece of the system won’t stay up under load or fails intermittently after the system has been running for hours or days or weeks, or you lost critical customer data or you can’t recover and restore service fast enough after an operational failure.
  • Scalability and throughput: the platform (language or container or communications fabric or database – or all of them) are beautiful to work with, but can’t keep up as more customers come in, even if you throw more hardware at it. Ask Twitter about trying to scale-out Ruby or Facebook about scaling PHP or anyone who has ever tried to scale-out Oracle RAC.
  • Latency – requirements for real-time response-time/deadline satisfaction escalate, or you run into unacceptable jitter and variability (you chose Java as your run-time platform, what happens when GC kicks in?).
  • Security: you just got hacked and you find out that the one bug that an attacker exploited is only the first of hundreds or thousands of bugs that will need to be found and fixed, because your design or the language and the framework that you picked (or the way that you used it) is as full of security holes as Swiss cheese.
These problems can come from misunderstanding what an underlying platform technology or framework can actually do – what the design tolerances for that architecture or technology are. Or from completely missing, overlooking, ignoring or misunderstanding an important aspect of the design.

These aren’t problems that you can code your way out of, at least not easily. Sometimes the problem isn't in your code any ways: it’s in a third party platform technology that can’t keep up or won’t stay up. The language itself, or an important part of the stack like the container, database, or communications fabric, or whatever you are depending on for clustering and failover or to do some other magic. At high scale in the real world, almost any piece of software that somebody else wrote can and will fall short of what you really need, or what the vendor promised.

Boeing, 787 Battery Supplier at Odds over Fixes
Wall Street Journal, Feb 27 2013

You’ll have to spend time working with a vendor (or sometimes with more than one vendor) and help them understand your problem, and get them to agree that it’s really their problem, and that they have to fix it, and if they can’t fix it, or can’t fix it quickly enough, you’ll need to come up with a Plan B quickly, and hope that your new choice won’t run into other problems that may be just as bad or even worse.

How to Avoid Architecture Breakers

Architecture Breakers are caused by decisions that you made early and got wrong – or that you didn't make early enough, or didn't make at all. Boehm talks about Architecture Breakers as part of an argument against Simple Design – that many teams, especially Agile teams, spend too much time focused on the happy path, building new features to make the customer happy, and not enough time on upfront architecture and thinking about what could go wrong. But Architecture Breakers have been around a lot longer than Agile and simple design: in Making Software (Chapter 10 Architecting: How Much and When), Boehm goes back to the 1980s when he first recognized these kinds of problems, when Structured Programming and later Waterfall were the “right way” to do things.

Boehm’s solution is more and better architecture definition and technical risk management through Spiral software development: a lifecycle with architecture upfront to identify risk areas, which are then explored through iterative, risk-driven design, prototyping and development in multiple stages. Spiral development is like today’s iterative, incremental development methods on steroids, using risk-based architectural spikes, but with much longer iterative development and technical prototyping cycles, more formal risk management, more planning, more paperwork, and much higher costs.

Bugs like these can’t all be solved by spending more time on architecture and technical risk management upfront – whether through Spiral development or a beefed up, disciplined Agile development approach. More time spent upfront won`t help if you make naïve assumptions about scalability, responsiveness or reliability or security; or if you don’t understand these problems well enough to identify the risks. Architecture Breakers won’t be found in design reviews – because you won’t be looking for something that you don’t know could a problem – unless maybe you are running through structured failure modelling exercises like FMEA (Failure mode and effect analysis) or FMECA (Failure mode, effects and criticality analysis), which force you to ask hard questions, but which few people outside of regulated industries have even heard about.

And Architecture Breakers can't all be caught in testing, even extended longevity/soak testing and extensive fuzzing and simulated failures and fault injection and destructive testing and stress testing – even if all the bugs that are found this way are taken seriously (because these kinds of extreme tests are often considered unrealistic).

You have to be prepared to deal with Architecture Breakers. Anticipating problems and partitioning your architecture using something like the Stability Patterns in Michael Nygard’s excellent book Release It! will at least keep serious run-time errors from spreading and taking an entire system out (these strategies will also help with scaling and with containing security attacks). And if and when you do see a “once in a million” error in reviews or testing or production, understand how serious it can be, and act right away – before a Dreamliner turns into a nightmare.

No comments:

Site Meter