On errors masked and revealed

In a recent talk professor Daniel Kahneman looks back at his research in cognitive psychology. In a side remark he notes that errors are "much more diagnostic about the mechanisms, so there might be different mechanisms for doing the same thing perfectly, and you're not going to be able to distinguish the mechanism by observing it". The approach of focusing on errors (or anomalies) in trying to accurately model something is of course a long-standing one. For example, we have the study of anomalies in medicine; natural experiments in social sciences; and of course the practice of intentionally breaking and perturbing things in about any other scientific discipline, including software engineering.

However, one of the striking characteristics of software engineers (perhaps shared with other kinds of designers, too) is that, unlike those scientists who attempt to peek under the hood of nature's work, we are obsessed with the idea of hiding the "implementation details" of our own creations - that is, abstracting away the mechanisms by which they operate to the greatest possible extent at all conceivable levels. I think that the driving force behind this is not so much the traditionally touted flexibility of exchanging implementations (which, to be honest, seldom needs to be done) and interoperability. Rather, it is our desire to simplify our own mental models, to reduce the cognitive load of having to think about (and worse, test) too many elements and interactions at once. Maybe so much so as to shift some of the processing from System 2 to System 1, in Kahneman's terms. We strive to reduce the probability of mistakes - which on the economic level also means reducing development and maintenance costs for our clients.

When software systems fail, the tables may be turned on us - the hidden mechanisms then become apparent, and the usually blessed lack of familiarity with the implementation details becomes a curse. We are then forced to learn things about our systems that we have never bothered to understand. In more extreme cases, this aspect may even represent a significant liability and risk to the maintainer, if contractual obligations demand quick, correct reactions.

I think a few conclusions may be drawn from the above:

Less intricate, "uglier" designs that reveal more of the underlying mechanisms, contain fewer layers of indirection, and put a smaller demands on supporting technology and tools might be actually preferable for economic reasons - more attractive from the point of view of contingency planning (if a total or maximum time-to-repair is a concern).
The skills of a software developer and those of a maintainer or troubleshooter are intricately linked, but nevertheless quite distinct. Great software developers know which internal aspects of (their) software to hide and which to reveal to make its normal operations easy to grasp and the rare breakdown possible to handle. Great troubleshooters know how to reveal those aspects that were intentionally hidden by the (possibly not-so-great) developers in order to understand the causal connections and required interventions. Ideally, they also have an extensive knowledge of the under-the-hood mechanisms of a particular software product or system which they maintain.
The skill set of a software developer is rightfully viewed as a superset of (and thus more valuable than) that of a troubleshooter or system administrator: a developer typically not only has to troubleshoot an analyze other developers' software because of the interfacing and teamwork requirements, but he also has to anticipate and support the future troubleshooting scenarios that will occur with respect to his own software. On the other hand, he can usually avoid accumulating truly significant amounts of factual knowledge about particular products or operating environments and rely on more general procedural knowledge.

I don't know about others, but I do seem to spend most of my software development time thinking about errors and mistakes - real and possible, static, dynamic, happening inside of my code and in the users' world around it.

plosquare.com blog