Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Is there academic research on software fragility?
46 points by fedeb95 on Jan 6, 2023 | hide | past | favorite | 46 comments
I keep finding articles that more or less talk about this but not some serious research on the topic. Do someone have a few pointers?

Edit: to clarify what I mean by fragility, it's how complex software, when changed, is likely to break with unexpected bugs, i.e., fixing a bug causes more.



As others point out, more detail about what you're looking for.

I found Out of the Tar Pit[1] somewhat useful. I thought the back half of the paper was disappointing (sorry, functional is not the cure to all problems, and state is something inherent and we must deal with it), but the definition of "essential complexity" and "inessential complexity" from that paper are invaluable, and too often I see people/devs/PMs going "simpler is better" where "simpler" would not address essential complexity: i.e., their simpler == broken, for the use case at hand.

But once you have that, then when you see a fragile system, you can start looking it through a more productive lens of "okay, what of this must I keep, and what complexity can I dispense with?"

[1]: https://curtclifton.net/papers/MoseleyMarks06a.pdf


> the definition of "essential complexity" and "inessential complexity" from that paper are invaluable

To be fair, those terms were defined 20 years earlier by Fred Brooks.


The paper is indeed interesting and a good starting point, thanks for that. It also has some references worth checking out. However it is rather informal in its reasoning, I have yet to find papers that try to formalise a bit this kind of problems in software.


While I agree that FP doesn't solve all problems, I've personally found it to remove a large class of complexity regarding state.

For example, separating deterministic pure logic increases the percentage of your codebase that's deterministic.

Just personal experience so your mileage may vary.


I think there is a fair bit of discussion on software fragility under the rubric of "robust" software in the SE literature -- sort of the negative space of what you are looking for, but within that topic, causes of fragility are examined.

Sussman wrote an essay in 2007 called "Building Robust Systems" https://groups.csail.mit.edu/mac/users/gjs/essays/robust-sys... it's not study, admittedly, but an example of the term under which you might find what you're seeking.


Could you qualify what you mean by "fragility"?

I think the relevant academic research areas would be software resiliency, software reliability and error recovery, static and dynamic analysis, fuzzing, as well as conceptual frameworks like LANGSEC[1].

[1]: https://langsec.org/


I've clarified in my edit, thanks for pointing that out.


When I look for research, I use Google Scholar (among the last places I can find it).

https://scholar.google.com/scholar?q=software+fragility

A lot of the results are about seismic simulations, but some are about software defects:

Fragility of evolving software - https://dial.uclouvain.be/downloader/downloader.php?pid=bore...

Software is not fragile - https://hal.archives-ouvertes.fr/hal-01291120/document

Overcoming Software Fragility with Interacting Feedback Loops and Reversible Phase Transitions - https://www.scienceopen.com/hosted-document?doi=10.14236/ewi...

Agile or Fragile? - The Depleting Effects of Agile Methodologies for Software Developers - https://core.ac.uk/download/pdf/301378665.pdf


You might need to reframe your terminology on what "software fragility" is referring to. If you mean looking into critical failures in systems due to vulnerabilities in third-party libraries (I'm thinking Log4JS), then the terminology currently used in security is "software supply chain" [1].

My general process is once you've found an article, like the "Backstabber's knife collection: A review of open source software supply chain attacks" from my link, read through it with a text editor open. Make notes like a summary of what they did, what they found, strengths and weaknesses, and what I like to call "the rabbit hole". If you see a reference to something you're curious about, find what the citation for that piece of information is and follow up with a reading of that article next. Repeat until you've exhausted the findings and move on to the next interesting reference.

[1] https://scholar.google.com/scholar?as_ylo=2019&q=software+su...


It's not that I'm referring to. It's how likely a change in software will cause more bugs. Specifically, how a bug fix could result in more bugs because of interconnections (that's some layman definition I've constructed based on online stuff)


Looking at your other responses, "software supply chain" may still be a fruitful term to look through. While my example referred to how a dependency could be a vulnerability, you might find something about how updates to those dependencies introduce bugs. I'm thinking about like how changes to Python 3 broke everything. Software supply chains aren't my area of research, but given your descriptions, it sounds like it will still provide worthwhile information.


How software changes over time?

API versioning, API deprecation

Code bloat: https://en.wikipedia.org/wiki/Code_bloat#

"Category:Software maintenance" costs: https://en.wikipedia.org/wiki/Category:Software_maintenance

Pleasing a different audience with fewer, simpler features

Lack of acceptance tests to detect regressions

Regression testing https://en.wikipedia.org/wiki/Regression_testing :

> Regression testing (rarely, non-regression testing [1]) is re-running functional and non-functional tests to ensure that previously developed and tested software still performs as expected after a change. [2] If not, that would be called a regression.

Fragile -> Software brittleness https://en.wikipedia.org/wiki/Software_brittleness


Thanks I will check out those references mentioned by wikipedia and also include the term software brittleness in my searches. It seems related even if maybe not exactly my goal


Yeah IDK if that one usage on Wikipedia is consistent:

> [Regression testing > Background] Sometimes re-emergence occurs because a fix gets lost through poor revision control practices (or simple human error in revision control). Often, a fix for a problem will be "fragile" in that it fixes the problem in the narrow case where it was first observed but not in more general cases which may arise over the lifetime of the software. Frequently, a fix for a problem in one area inadvertently causes a software bug in another area.


Perhaps because I am inexperienced, I don't understand how software is much more complicated than other stuff, like electronics engineering or industrial engineering. Those are expensive to debug and have complicated interactions, yet engineers somehow can find ways to make them robust without the cost skyrocketing.

Perhaps because software is now the interface between different systems, and we are desperately trying to abstract away the underlying system yet the details are eventually somehow leaked and cause other issues? Perhaps because complexity is similar to multiplication, physical systems are limited by 3D space and softwate systems can become entangled without bound? Just some naive thoughts.


I think it's actually because in a modern environment, software is cheap to deploy, debug and update. That leads to under-investment in design and testing. For many software environments, I think the root cause of unreliability is that it's perceived to be cheaper to build and observe failures in production than it is to invest in adequate design review and QA.

It's instructive to consider how things went in parts of the software industry where failures are more expensive, for example:

- Avionics firmware gets a lot more scrutiny than a trendy website.

- In the days when consoles shipped games on cartridges, a lot of time went into QA. Today, a day-one patch is considered table stakes -- if the game even has a physical release.


Here's a shot at an answer, lifted from 'The Mythical Man Month', Fred Brooks:

"Software entities are more complex for their size than perhaps any other human construct, because no two parts are alike (at least above the statement level). If they are, we make the two similar parts into one, a subroutine, open or closed. In this respect, software systems differ profoundly from computers, buildings, or automobiles, where repeated elements abound."


Part of the issue is that there is "no cost" to changing something in the software stack, while there are very tangible costs and barrier to entry to modifying a physical structure. This tends towards a more conservative culture, where controls are placed on designs and changes. This culture also exists in safety-critical software, like that used in medical devices, aviation, and industrial automation.

Otherwise, robustness is relatively expensive because it requires the organization to value the long-term quality and function of the system, at the expense of short-term velocity and malleability. If you are competing with others who can hack together an MVP with 90% functionality overnight, then waiting for the engineered product may be problematic.


Software is more malleable than, say, a bridge. This leads to a lot of input about how to best do things, which generates a lot of rework, which generates a lot of billing hours, and keeps us all employed.

Does it need to be like this? No. But until the VC MBA gods get their hands off the tiller and we face actual energy scarcity, expect more stupidity.


Software fragility is actually very similar to physical fragility.

Think of Jenga tower: the more blocks you have, the more fragile it is. And that is despite Jenga being nicely layered (each block has limited number of direct dependencies).

There are two main ways to decrease fragility:

1) lay out your blocks more carefully.

2) decrease the total number of blocks.

What's interesting, is that most software development practices focus on 1. How to make complex system less fragile? Use a big framework, do unit tests, use static typing, have protected branches, write documentation.

While the biggest payoffs are always in the reduction of blocks. KISS.


Years ago I was working on an architecture where multiple threads were processing data in parallel. But there were dependencies.

I spent several weeks trying to figure out how to coordinate these processes.

That is until someone suggested I design it so no coordination is needed at all. It was a good lesson learned.

I was trying to solve a problem that shouldn't have been solved in the first place.

It taught me to ask the question often: do I need this at all?


Yes, I am aware of this. Your points are valid, but I am trying to gather some proper research on actual projects about this and not just anecdotal evidence (which is fine, I have many too, but should be supported by something else ideally)


You have to accept some level of complexity in order to have the features users want. Suckless software is KISS AF, but a common complaint is that it is feature-poor and thus unpleasant to use.


I could not agree more. I feel like there is constant push back against attempts at decreasing the total number of blocks.


What if some problems simply are complex? How do you reduce the number of blocks then?


Typically you can't reduce the number of essential blocks. But what you can do is make them easier to comprehend through divide and conquer. Build two smaller Jenga towers instead of one big. You will have to sacrifice something (e.g. performance) for stability and clarity.

Gotta be careful though. There's a religion that claims every 10-block Jenga tower should be built as 10 single-block towers.


David Parnas coined the phrase "Software Aging" to explain why software tends to get more fragile over time. The references in the wikipedia entry on it might be a good place to start.


That seems to be related to a running executable, and the process of restarting the application to get it back to a known good state.

In my experience from software that is still being developed or has patches over time I commonly see where the initial specification is good. The problem comes in when that specification gets extended piecemeal and you run into 'the straw that broke the camels back' or 'A + B' is ok, but 'A + B + C' has far more failure modes because you're massively increasing the amount of system usage and testing that's necessary to validate the product.

For example adding more metrics to a piece of software. I've commonly seen this done by extending an existing table already in the DB. The developer tests the different workflows they see and implement the correct indexes. Then a month later some other seemingly unrelated feature gets added and when it tries to pull in some of those metrics in a report (or chart or whatever) the system falls over because you're doing a full table scan for joined data that someone missed.


Well designed architecture will always be the guard against fragility. I do think there is a lack of architecture design and technical design discussions that contribute to the current state of software fragility. There's too many tools contributing to the noise, and not enough tools reducing that noise.

Software is hard because it is the last 10% ... Hardware was the first 90%, but anyone with project experience knows how long the last 10% lasts.


This is somewhat of a tautology: if your architecture was good then it still is. But it is hard to define "good" or "well designed" or "better than that other one".

A slightly different take: there are a couple of categories for system changes: (a) adaptive changes to respond to a changing environment or requests for new functions, and (b) corrective changes which fix bugs (bugs of any age).

Examining a proposed architecture with these two categories in mind might help. As long as changes are correctly categorized and therefore the scope of changes matches, your architecture may be seen as better or worse. Or more or less survivable.

And on yet a completely different perspective: choose between two or three possible architectures. If you haven't got a choice then you need to fix that.


In software very long lived (I'm talking 20+ years) you don't have space for such questions; can something still be done to reduce fragility?


That may be true, but: many, myself included, live in a world where only so-so design is allowed, with many teams touching many products, often with different people, bad decisions, and pushing managers. Can something be done under this constraints to reduce fragility? I'm starting from a definition of it, maybe a description or measure


One bad bit can stop the show; all else follows from that, pretty much.

Even if there is redundancy in hardware to catch a bad bit, software contains a lot of inter-connected logic in which there is no mitigation for an unexpected, incorrect value.

There are chains of dependencies such that the correct behavior is a giant conjunction of prpositions: if this works, and this is correct, and this configuration is right, and, and .... then we get stable behavior with good results. Conjunctions are fragile; one incorrect proposition and the conjunction is false.


Yes, but I'd like something more like: can two codebases be compared as per fragility? Can fragility increase and decrease with some practices?


This isn't a typical academic research paper but is from a University of Chicago researcher, and I think gives a great overview of failure modes of complex systems in general (software included): https://how.complexsystems.fail .

As someone who writes a lot of complex/evolving data analysis software that needs to work correctly, I find some of the considerations listed in the above to be immensely helpful.


Thanks, I will check it out better later, but it seems about runtime failures of complex systems, for instance because of stress or emergent behaviours. My goal is to find that fragility that is somewhat static, for instance: I have a bug in a complex codebase. I fix it in the best possible way. I end up with more bugs because of hidden dependencies and interconnections. This has happened to me multiple times. I need to study the problem of fragility to fixes in a way, not to other external conditions. But maybe the same reasoning as for other problematics apply


It’s not academic research, but I recommend Marianne Bellotti's "Kill It with Fire: Manage Aging Computer Systems (and Future Proof Modern Ones)" about modernization of legacy software systems and managing the teams maintaining these systems. The author worked at the United States Digital Service.

https://nostarch.com/kill-it-fire


Here’s an empirical example. I followed a step by step tutorial on how to setup a server with Apache + Linux + SSL.

The backbone of the internet.

It’s still not working , and it’s been two days.

Software is cobbled together with ducktape and cowchips.

After 10 years in the field. My take is:

We’ve lied to ourselves so much that we believe our own lies


Counterpoint: follow a different guide, use nginx or caddy instead of Apache, and you can be live in under thirty minutes.

In all cases, this has nothing to do with t to he question. It wasn’t about how smart you have to be to set it up yourself, it was about how robust or fragile it is once put into place. A better point would be that we had certificates with ten-year expiration (not fragile) but susceptible to a myriad of advanced security issues. The solution was to reduce that to certificates that expire every few months, held together by daemons and scripts that renew them automatically in the background, significantly increasing the number of simultaneously moving parts and making the whole more fragile (with obvious security benefits).


It is my belief that engineers get bored and the create the most convoluted solution for the simplest problems.

More complex = more smartsss.

I want to move away from model T, I want a F150 with push to start


Eh, I wouldn't use the model T as a good example. A hand crank that could kill you, and the acceleration and gear system were a gigantic mess compared to the cars that showed up a decade or two later.

There are plenty of later model cars that add back seemingly unneeded complexity for little gain that would be better examples.


I can agree with that. Right now going live is all that matters.

To add to your point: I chose the most popular stack with the biggest mindshare, just in case something goes wrong, I’m a one man shop, but why fight against it.

I’ll try Ngnix…


Apache is from the days where everything in computing was difficult and a massive number of lessons about complexity and security had not been learned yet.

It really only exists because it was the only option at some point in the past and software grew up around it so it kept momentum.

If you're doing anything new you wouldn't use it at all, and instead use one of the multitude of servers that are far simpler and better at the same tasks.


Apache isn't known to be the most user friendly, instead consider caddy[0], where a full featured web proxy/fileserver can be as easy as:

example.com { reverse_proxy localhost:9000 }

[0] https://caddyserver.com/docs/getting-started


I'm doing research on this right now actually.


That's interesting. Where can I follow your work, if I may ask?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: