Coding for failure

Well, there is a week of my work life gone that I'll never get back. Doesn't sound so bad I guess. But it is a little frustrating. There isn't really anyone to blame. But there are faults to be addressed and things to be considered.

It is work related so, I'll avoid getting too detailed.

The short version of the story is this; we've been seeing a problem lately which only seems to happen under load, and very rarely even then. It isn't reproducible, at least not easily enough that we've seen it on other types of setups. But, we have multiple environments which are setup like this one, and we can reproduce it in these.

After ruling out how the clients are talking to the system, the next course of action was to dive into the server side and throw in code to try and detect when the situation was happening.

When I first started looking I noticed a setting we were using which was suspicious. I knew that it could, in the right circumstances cause the sort of problem we were seeing. But, ruled it out as the culprit as the setting I expected to see should behave the same way in our environment as the one being used.

I also ruled this out in the short run for a few other reasons; I wasn't the original architect of this code, and that developer was with another company now, so I couldn't confirm if the decision was intentional or not. But, also, I couldn't easily detect the changes, and at this point I was having enough trouble reproducing things that I ran the risk of making changes and getting false negatives. When you have a bug you only see a few times an hour when you're lucky, you need to stick to solutions you can test more thoroughly.

I noted it though, because I wanted to investigate it further later. As far as I was concerned, using this setting was a bug.

So, over the days after that I had improved my testing tool. I was no able to reproduce the error to the point of getting 10-15 times every 15 minutes. Which, is frequently enough that I could test for the impacts of any changes and know fairly quickly if it worked. I then tossed in debugging. Getting progressively deeper into the code.

At the end of last week, I had reach a place where I was certain; the application was doing pretty much everything we expected it to do. It was completely unable to detect that this behavior was happening. As far as it was concerned, it wasn't making any mistakes. If I can't detect the error condition, then I can't intentionally correct it in the code when it happens. Which meant, I needed to start changing other things and see if they would stop it from happening in the first place.

My testing harness had gotten good enough that if I could change things selectively, then I could probably just run my tests longer to increase the result set as a means of validating.

My first change? That setting. It drove functionality in a 3rd party library which is why I couldn't debug into it. To test, I change the code to use one value for most accounts. But for one account use the value I expected. Slowly, as the results come in, one after another, the errors only affecting the only the accounts using the original settings. Then I switch it around and do it for every account but one. Just that one account hits the errors. Pretty much nail in the coffin at this point. Enable it globally let the tests run for a few hours. 0 errors.

Get permission to reach out to the original developer and ask if there was a reason for the choice of setting. They say that there isn't, but like me they thought it should behave the same.

At this point, they express concern over changing it though. And I understand. We have several applications which share this library. It is really difficult to be sure that changing this won't break something else.

I'm still not sure why this change fixes it. By all accounts, while my setting is the more correct one, they both should act the same in our application. It is a mystery why they don't.

But, it is also a mystery why it was coded that way in the first place. The original author felt the need to explicitly overwrite the default setting in the 3rd party library. But, he overwrote it, not with the value which represented the behavior he expected, but rather with a more permissive behavior that made this situation possible.

The code was designed to permit failure.

And thus, I hit a dilemma. What do I take away from this experience? I believe I was right in NOT changing it when I first encountered it. I didn't honestly believe it was the cause of the problems and I don't believe developers should be changing code without a well documented reason. A younger version of me did that and got burned on numerous occasions.

At the same time, I can't simply fault the original developer either. We can't always research every facet of every decision ahead of time. I don't know what he knew at the time he wrote it or what he may have been considering. So, I can't guarantee I wouldn't make the same mistake.

Ideally, I would come out of this with some guidance for myself so that I can either catch and fix these errors quicker in the future, or suggestions on how to prevent my own code from causing similar problems down the road.

I don't want to walk away from a problem which took well over a week with nothing gained, excepting the problem being solved. Especially not in a case where I looked at and thought about the thing which ended up being the solution much earlier on in the process.

I guess in the spirit of honestly trying to take something away from this... I should have pursued the matter. Part of the reason I held back was because I suspected that if I had made the change and it hadn't resolved the problem that I might have left it in there simply because it was my preference. In the end though, I rolled back my other changes which didn't solve the problem. I may need to learn to be stricter on myself. Though it will be hard. That was really just a hunch. I'll also need to practice distinguishing good hunches from bad. I could easily go way off course following hunches.

Comments

Popular Posts