NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Everything is broken, and it’s okay (increment.com)
bezout 1144 days ago [-]
Just because a piece of software won’t kill people if it fails, it doesn’t mean that we should drop the ball on trying to make it as reliable as possible. And I’m not talking about OSS. I’m talking about the software you pay for. You are giving away your money and you expect something as valuable in return.
CaptArmchair 1144 days ago [-]
From the article:

> Once you accept failure is inevitable and commit to avoiding the concatenation of failures that causes catastrophes, you—and your organization—need to decide what risk looks like and what to prioritize. If everything is the most important thing, then nothing is.

[...]

> The more complex the system, the more likely it is that some part of it is broken. As engineers, we’re at the apex of literally millions of hours of design and engineering time to help us, and our users, with everything from finding our phone to continuously integrating our code. We have to assume that some of that infrastructure, and some of our work, is broken.

> If we accept these imperfections, we can work toward building resilient systems that can handle a little static and deal with flawed foundations without falling over. We don’t stop working toward perfection just because it’s impossible.

The author's point isn't to throw up arms in defeat. Her point is to "courageously change the things you can change, accept the things you can't change, and find the wisdom to know the difference between the two".

There will always be tension between striving for the ideal world where everything is always written to perfection using the latest innovative technology, and the real world where legacy code and technical debt are par for the course. It's all too easy to compare software on which lives depend (avionics, healthcare,...) with video games, short-lived marketing apps or business suites. "value for your money" means very different things to different people in different contexts.

mgkimsal 1143 days ago [-]
> courageously change the things you can change, accept the things you can't change, and find the wisdom to know the difference between the two

And trying to tell a client to 'accept the things they can not change'... doesn't often go over too well. Harder still in some cases because, technically, almost anything can be changed, but estimating the time/effort/cost is a difficult task in itself.

afarrell 1143 days ago [-]
> doesn't often go over too well

Beware the difference between being nice and being genuinely kind. The latter requires the courage to be willing to upset somebody and lose a lucrative contract.

CaptArmchair 1143 days ago [-]
It's true that anything can be changed. Improvement, however, is a form of value attribution. It's always in the eye of the beholder. It's totally valid to hold different ideas on how things ought to be improved.

It's important to acknowledge that and bring that to the table. Even worse, compromising for the sole sake of signing a contract is a risky proposition. Projects fail because visions and ideas are being sold without acknowledging the reality represented in the limitations of existing technological solutions or infrastructure that needs to be maintained.

MattPalmer1086 1144 days ago [-]
I agree with your main point that the article is not suggesting giving up.

It does amuse me though that an ideal of perfection is having no legacy code, all written with the latest technologies.

I once spent some time refactoring some really old gnarly code to be beautiful and well designed. The rest of the team looked on in horror and asked why I was replacing highly reliable, battle tested code that had been in successful operation for nearly ten years!

mgkimsal 1143 days ago [-]
> The rest of the team looked on in horror and asked why I was replacing highly reliable, battle tested code that had been in successful operation for nearly ten years!

And they were probably correct in doing so. If people are expected to be making changes to it, but it's generally working well otherwise, adding explanatory docs, notes and tests around that code would likely be far more valuable than rewriting it. You have to understand the problem before rewriting anyway - writing up that documentation and supporting evidence is often a better goal than "rewrite" when a system is working pretty well in service.

I've also seen the "don't rewrite it!" on code that is breaking/wrong constantly - like... daily. People have adjusted habits around it, and making a change is seen as 'risky' but there's a lot of unacknowledged operational debt that can be cleaned up with updating the code (investigate, doc beforehand too).

throwaway346434 1143 days ago [-]
The tradgedy of flawed paid software is the people who pay, and the people who use it are often different. Accepting this after seeing it is equivalent to saying "I am okay with trading misery for profit". That bit isnt a software problem, but its not a good look!
namelosw 1144 days ago [-]
The customer of course desire high-quality product they have paid for.

The problem is more often than not, the guy who pays developer salaries wants the developers to spend less time on polishing.

pif 1143 days ago [-]
> The customer of course desire high-quality product they have paid for.

People pay for features first, and UX/UI second, if at all.

People pay for software that makes them money. When two competing software offer the same feature set, than and only then does UX/UI get any important.

UX/UI is not for your customers: it's for attracting pigeons to your CRUD website in order to sell them to your advertising customers.

pif 1143 days ago [-]
In other words, UX/UI is for when you don't have anything else to offer.
snarfy 1144 days ago [-]
Very true, and the guy paying the salaries is wrong.

The customers want the polish without knowing that's what they want. Customers want good software without knowing how to quantify good in software terms.

marcus_holmes 1144 days ago [-]
The purpose of commercial code is to make money. If the cost of annoyed customers is less than the cost of polishing the software, then the guy paying the salaries is right.
snarfy 1143 days ago [-]
In the short term. At some point you annoy your customers so much they get a Mac and never buy your software again. Compared to PCs, Macs are highly polished and just work (even if that's not entirely true today). Apple's success is a testament to the PC's lack of polish, the point being polish does matter.
marcus_holmes 1143 days ago [-]
I don't think this is true. I don't think people buy Macs because "they just work" - as you admit this is no longer the case and yet people still buy them.

I run a Windows machine to play games on. I would dearly love to not run a Windows machine at all. But to play the games I want, I have to run a Windows machine. I'm not buying Windows because I love Windows and think it's amazing tech. I'm forced into buying Windows because the thing I want to do will only work on Windows.

I'm forced to consider buying a macbook again, despite hating how closed the OS is and how dependent it forces me to be on Apple Support's weird shitty lying statements. I have to consider this because I will probably need to develop an iPhone app in the near future and I can't do that on a non-Apple machine.

People buy tech to do things, not because they love the tech.

pcstl 1143 days ago [-]
I believe the author's point is that accepting that errors happen and that we need to deal with them - fail gracefully, as it were - produces more reliable systems than trying to obsessively eliminate every single point of failure.
1143 days ago [-]
MattPalmer1086 1144 days ago [-]
This article makes some very good points. I've been thinking about resilience for a while. Nothing is perfect, and if your system can only work with that assumption, you are bound for failure.

I always loved the Erlang philosophy of let it crash. The idea you build a highly resilient system out of error prone parts rocked my world when I first encountered it.

None of this is to say we should accept broken things or not try to make them better. We should instead accept some parts will always fail and design accordingly.

lmm 1144 days ago [-]
This is true of current systems, which tend to simply automate human behaviour with human failings. It doesn't have to be that way though. It's possible to produce 100% correct components, and a system made of that kind of component can be dramatically simpler to understand.
daliusd 1144 days ago [-]
Could you give example of component that is 100% correct? I feel that even components can be 100% correct only within context, e.g. within ideal automaton under expected limitations and we don't have that.

While idea to construct everything from components overall seem to work better than to build monoliths quite often (Space X vs NASA, Linux vs many other OSes) but not always (e.g. SQLite).

lmm 1144 days ago [-]
> Could you give example of component that is 100% correct? I feel that even components can be 100% correct only within context, e.g. within ideal automaton under expected limitations and we don't have that.

Verified* instances in safe Idris. Of course GIGO is still valid and failures in lower-level components (such as a processor just not behaving as specified) can cause failures in higher-level components, but it's simply impossible for there to be a failure in that component itself - any observed failure will necessarily be the result of a lower-level failure.

azornathogron 1144 days ago [-]
I think you are ignoring both failures to correctly specify what was intended (the design was right conceptually, but you specified it incorrectly when formalising it so now you have a perfectly verified incorrect implementation), and failures in the design (the thing works as you designed it, but it turns out your design isn't perfect).

More likely, you're not unaware of these possibilities, you're just using a meaning of "failure" that excludes them. But regardless of whether you call these failures or call them something else, they prevent you from guaranteeing perfection.

lmm 1144 days ago [-]
On the contrary, there is guaranteed perfection: I can guarantee that they perfectly implement the given formal specification. I think we're in agreement that one can't produce a guaranteed-perfect system (since the system will always have a part that must interact with humans), but I don't think that actually supports the approach of treating every component as fallible.
krageon 1144 days ago [-]
If the formal specification is wrong, the component is still broken in practice. Your distinction is convenient as a vehicle for feeling better about your code (and I would argue you should feel good about writing such code), but it doesn't really have any bearing on the problem given the way you present it here.
lmm 1144 days ago [-]
> If the formal specification is wrong, the component is still broken in practice.

What does it even mean for an internal specification to be "wrong"? The specification for the system can be wrong, and maybe the component wouldn't be useful in a correctly-specified system, but even then I think it's misleading to say the component is "broken".

Joeri 1143 days ago [-]
All software runs on hardware and all hardware is fallible, ergo all software is fallible, even provably correct software. If you take a networked components view, then the assumption of failure must be baked into every network connection. That covers the vast majority of real world systems.
lmm 1143 days ago [-]
Not all hardware fails undetectably though, and certainly it's possible to separate failure of a component from failure of the network connection to that component. There are definitely parts of a system that need to handle failure, but I think that's different from saying that every component needs to be treated as fallible.
marcus_holmes 1144 days ago [-]
And don't ignore the fact that needs change over time - a perfectly designed and functioning component will become imperfectly specified over time as the need for it changes.

And there is the cost/benefit calculation - the effort required to get something perfect is often greater than the benefit of having a perfect component. The purpose of commercial code is to make money, and if technically perfect code makes less money than technically imperfect code, then is it fulfilling its purpose perfectly?

lmm 1143 days ago [-]
> And don't ignore the fact that needs change over time - a perfectly designed and functioning component will become imperfectly specified over time as the need for it changes.

Not my experience; the system may need new components or need to rearrange the existing ones, but the components themselves don't become imperfect. Consider something like a prime factorization algorithm; maybe your system will change to the point where you don't need it, or need to extend it, but it continues to be perfect.

> And there is the cost/benefit calculation - the effort required to get something perfect is often greater than the benefit of having a perfect component. The purpose of commercial code is to make money, and if technically perfect code makes less money than technically imperfect code, then is it fulfilling its purpose perfectly?

Tech companies go bankrupt all the time, so I think it's a stretch to say that commercial coding is working perfectly. Automating a sloppy human process in an equally sloppy way can certainly make you money, and maybe there's more money in doing that than in making perfect things, but we should at least acknowledge that there's an unexplored alternative here.

marcus_holmes 1143 days ago [-]
Your example of a component is kinda trivial: this sort of component is usually handled by a standard library and not developed in-house anyway. The stuff that's developed in-house is to service a customer need, and customer needs definitely change over time.

> Tech companies go bankrupt all the time

generally speaking they don't go bankrupt because of bad tech, though. It's usually imperfectly understood customer needs. No amount of perfect tech is going to stop that.

lmm 1143 days ago [-]
> this sort of component is usually handled by a standard library and not developed in-house anyway

Currently yes; that's what I'm arguing could change. Every library has to be written somewhere, and at a lot of jobs I've found myself rewriting the same should-be-standard pieces again and again.

> The stuff that's developed in-house is to service a customer need, and customer needs definitely change over time.

Yes and no. There are definitely parts of the system that change over time, but a lot of what gets written in-house ends up being surprisingly basic/generic things like database mapping, job scheduling etc.. You'd be surprised how often there just isn't a publicly available implementation of the thing you need even when it really "should" be part of the standard library.

> generally speaking they don't go bankrupt because of bad tech, though. It's usually imperfectly understood customer needs. No amount of perfect tech is going to stop that.

I'm not convinced. The way bad tech generally manifests itself is that the company is slow to react to changing customer needs, because it takes too long to make changes to their systems. Having been on the inside, I've definitely seen things that you would think from the outside are business failures but had poor technical choices at the root of why they happened.

marcus_holmes 1142 days ago [-]
I agree with everything you said :) Except I've seen (way, way) more failures from bad business choices than bad technical choices. I'm not saying it doesn't happen. But there's always the example of Twitter and the Fail Whale. That was almost-complete technical failure, and yet the business didn't fail because of it.
Talanes 1144 days ago [-]
>since the system will always have a part that must interact with humans

Yes, such as the part that judges whether something is fallible or not.

bezout 1144 days ago [-]
Well, your argument would only apply to software which operating contexts are not known to the developers ahead of its release. Is it the case for every piece of software? I’d argue not. Most of the times, you know what to expect . And you can always put up some guards to ensure it fails gracefully.
daliusd 1144 days ago [-]
So expected limitations in your cases is "known to the developers ahead of its release", but what happens with the software after release, e.g. after 10 iterations or 2 years or 10 years? What happens if your product from 10s of users grows to 1000s of users, millions of users? What happens when new functionality must be added and it collides with the existing one in multiple points?

You can and you must do your best, but you can't do everything - you must choose what is your sacrifice.

di4na 1144 days ago [-]
[Citation needed]

If anything, every research on complexity and correctness shows that it does not compose and does not survive in a dynamic system. I would love to see the research that support this claim. It would be a revolution in these fields.

v-erne 1144 days ago [-]
Could You point me towards research papers about complexity and corectness ? I have this unconfirmed intuition about the way we manage complexity and correctness (that we are building castles on quick sands with our leaky abstractions that try to fit into constantly changing reality)

I would like to broaden it a bit and find out if there possibly is some truth to it.

fsflover 1144 days ago [-]
stan_rogers 1144 days ago [-]
Blogs are not research papers. No, not even in computing.
fsflover 1143 days ago [-]
The author is a well known security researcher. She probably has corresponding publications, too.
di4na 1143 days ago [-]
The good news, yes.

The bad news: it depends a lot of what you mean, it may not be the answer you search and i do not have a "perfect" reference, because it is considered mostly obvious in these fields...

But before i get there, a few things.

> (that we are building castles on quick sands with our leaky abstractions that try to fit into constantly changing reality)

Nearly all the references i point to will tell you that this leakiness and this constant changing reality is exactly what makes them work in the real world.

Now onto references. A lot of them can be found at http://resiliencepapers.club/

I particularly advise to start at https://www.youtube.com/watch?v=Pb_zYs8G6Co

https://www.youtube.com/watch?v=PGLYEDpNu60

For more in depth, i particularly like https://www.ida.liu.se/~729A15/mtrl/CSEnew.pdf

And this is basically a whole meta study of all the research and ideas that came to this field since WW2. Beware it is long https://www.sintef.no/globalassets/upload/teknologi_og_samfu...

For an example of how things go wrong when things are correct but interact well together, i advise these two papers on space systems. (I link to Adrian blog on them because they may be easier to read, but there is a link to the papers in there)

https://blog.acolyer.org/2017/11/30/the-role-of-software-in-... https://blog.acolyer.org/2017/12/01/analyzing-software-requi...

If these things make you curious for more, there is a lot. Or if you want to talk about/have help navigating what is in the http://resiliencepapers.club/ link, feel free to reach to me on twitter or by email. I will be happy to talk more.

SunlightEdge 1144 days ago [-]
This could have been written by ProjectRed a la cyberpunk.

The company knew the game was full of bugs still but were under immense pressure to ship out the game way before it was ready.

rob74 1144 days ago [-]
Well, most AAA title are released full of bugs - those are the "failures" the article mentions. But most of them manage to avoid the "catastrophe" of being deemed completely unplayable...
doggodaddo78 1143 days ago [-]
Shrugging apathy is the enemy of the excellence (not perfection). Btw, with very simple programs, it is possible to have zero bugs but it is decreasingly possible as programs increase in complexity.

As an example, Android is Humpty Dumpty. Manufacturers just try to glue as many pieces together as possible.

ArcMex 1144 days ago [-]
I take this philosophy conversely when developing software. I expect my first version to be imperfect. I then chip out the imperfections. Before, I wanted to build the perfect system right away. Now failure is planned and part of the process. Some people call it TDD.
JackPoach 1144 days ago [-]
It's OK, if you are going fix it. Otherwise try living in a house where things are broken.
bloak 1144 days ago [-]
Every house I've lived in has had several things - doors, window blinds, appliances - that only I can operate because it takes practice to work around their brokenness. It's only when I see someone else trying and failing to use them that I remember that they're broken. But it doesn't convince me to repair them. I get used to things and don't like unnecessary change.
afarrell 1143 days ago [-]
Its much harder to fix things if people are unwilling to hear the idea that things are broken.
durnygbur 1144 days ago [-]
So a rented place? Where the owner has no incentive to fix it, and the tenant oftentimes doesn't have the authorization to do so. Especially when the owning entity is focused exclusively on extracting the profits. You accept it, move out, or turn insane.
moocowtruck 1143 days ago [-]
this is where as a user of software that is often times broken, I wish I had better protections on when/how to get my money back. Software is not obviously broken, and I feel like buyers have very little recourse
mshaler 1143 days ago [-]
...that's how the light gets in.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 23:50:47 GMT+0000 (Coordinated Universal Time) with Vercel.