Iain McLaren/That time when I spent two weeks hunting down one software bug

Created Tue, 27 Aug 2024 00:00:00 +0000 Modified Tue, 27 Aug 2024 00:00:00 +0000
1540 Words

Bug hunt

This post is a response to two questions that I am commonly asked. ‘Is coding fun?’ and ‘What is coding like?’

That time when I spent two weeks hunting down one software bug.

TV shows about commercial lawyers are nothing like being a commercial lawyer

The lawyers on these shows never do any work! Actual commercial lawyers spend a lot of their time sitting in front of a computer drafting documents. This clearly does not make good television.

TV shows about computer programmers are also nothing like being a coder

Similarly, TV shows about any type of computer programmers, particularly ‘hackers’, often show teenagers wearing hoodies in dark rooms typing at 3000 words per minute and saying “I’m in” like Neo from the Matrix.

The reality is that a lot of computer programming, and bug or exploit hunting, involves the coder sitting in front of a computer reviewing code for hours and then saying … ‘Hold on, that doesn’t work’. Again, not great television.

Once you start coding it is hard to stop

I learned early on in my career that building software that is used by other people is addictive. Joy comes from creating something that makes the lives of other people easier. Once you start it is hard to stop.

Coding is, by its nature, a solitary activity. It involves solving hard problems in a way that is difficult to describe and not necessarily interesting even for other coders.

Bug hunting can be frustrating … but fun

I recently spent two weeks hunting down an elusive bug. The process was frustrating but ultimately fun. I am writing this because I think that the details might be interesting even for people who are not particularly interested in technology.

My software had a bug

For context, I have built and run software that works like Google Drive or Dropbox. The software stores files in the cloud. The files are automatically synchronised between multiple Apple computers.

There is also an iPhone and iPad app on the Apple App store which allows users to view and interact with the uploaded files.

The bug … and why I was fooled

I went on a trip for two weeks.

When syncing the files, the file sync service runs a full check of all downloaded files and then compares (amongst other things) statistics like the number of files, folders, and total stored sizes of files on both the laptop and in the cloud. These numbers are always changing as different users change the files and folders on their laptops.

If these statistics on the server and the laptop match, then all files are synced, and the software can stop checking files and folders. If not, then the software on the laptop runs another full check of all downloaded files and keeps doing so until these numbers do match.

The problem was that the laptops were running this full check, and these statistics on the server and laptops never matched. The software never stopped churning through all of the files and folders to check for changes even when the users were not changing any files or folders.

The journey

(1) Bugs in this software tend to be scale issues or edge cases

Writing the basic code to synchronise files between a computer and the cloud is relatively simple. The tricky parts are:

  • scale issues, or ensuring that the software works for millions of files, with a collective size of multiple terabytes, for large numbers of users for a price that is cheaper than Apple iCloud / Google Drive / Dropbox; and
  • dealing with edge cases.

(2) The edge cases for this software are generally file edge cases, and internet/network edge cases

(a) File edge cases

File edge cases include dealing with what happens if:

  • two or more files with the same name but different contents are uploaded at the same time;
  • files and folder names use characters that are problematic when used on Apple laptops (e.g. string normalisation);
  • a folder is deleted on one computer and a file on another computer is updated at the same time; and
  • file attributes (such as making the file read only) for the same file are different on two or more different computers;
  • etc.

(b) Internet/network edge cases

Networking edge cases generally involve dealing with flaky internet. For example, we need to deal with what happens if:

  • the internet is not available for one or more computers for seconds, hours, days, or weeks; and
  • internet that is available but flaky in weird ways.

(3) Wait … how can internet access be flaky in weird ways?

I am glad (the hypothetical) you asked. Wi-Fi, particularly hotel Wi-Fi, is often bad in strange ways. For example some Wi-Fi services:

  • stop downloading files after an arbitrary number of megabytes of data, or after an arbitrary time period;
  • stop uploading files after uploading an arbitrary number of megabytes of data, or after an arbitrary time period; and/or
  • randomly get stuck and pause uploads and/or downloads forever.

Travelling is great because it helps test for these types of problems.

(4) Ah, so our bug has to be a weird Wi-Fi issue right?

To find a new bug that we have not encountered in the past, we generally look at what has changed. In this case I was accessing the internet using potentially flaky Wi-Fi. This is not the first time that I have had this problem while travelling and had to update the software to cater for various bad types of Wi-Fi. It is how I relax while on holiday!

But I couldn’t find the problem. The laptop checked all of the files and folders without reporting any errors. The server was processing these folders without any errors. However, the statistics (e.g. the total number and size of files and folders) on the laptop and server were never the same.

I even tried creating a new test account with the same files and folders, and uploaded all of the folders to the cloud server again, in case this was a strange database corruption issue. The numbers still did not match on the server and laptops.

I was stumped.

(5) Hmm. Still not magically fixed after I returned home

I was unable to fix the issue while travelling and, interestingly, the problem did not go away when I returned home to reliable internet access. Well, maybe saying that any home internet in Australia is ‘reliable’ is too strong. Let’s just say that my home internet access is unreliable in a predictable way?

Anyway, this did not seem to be an internet/network edge case issue.

(6) Time to use brute force. This is where I gave up and manually checked the files and folders to diagnose the issue

In the end I gave up. I ran a brute force comparison between all of the files and folders on the server and one laptop and compared which files and folders were consistently on the server and not on the laptop (and vice versa). For context, this software stores millions of files.

In the end, I tracked this problem down to a folder containing the following files (I have changed the real name of the person to ‘Alex Correia’ here):

  • “/path/to/17635-Alex Correia.pdf”
  • “/path/to/17635-Alex Correia.PDF”
  • “/path/to/17635-Alex Correia.RTF”
  • “/path/to/18340-Alex Correia.RTF”
  • “/path/to/23073-Alex Correia.RTF”

I looked at these file names for about 20 seconds before it clicked. The problem is with the names of the first two files on this list.

The problem

Apple laptops work using filesystems that are case insensitive. This means that all of the following different file names are the same file:

  • “/path/to/ABC.DOC”;
  • “/path/to/ABC.doc”;
  • “/path/to/aBc.doc”;
  • “/path/to/abc.doc”; and
  • “/path/to/abc.DOC”.

These cannot be five different files.

I knew this. However, this had not been a problem for the sync software in the past because there will never be a file on any one laptop that is named “/path/to/abc.DOC” at the same time as another file exists on the same laptop named “/path/to/abc.doc”. Therefore, the software on one laptop will not try to upload both “/path/to/abc.DOC” and “/path/to/abc.doc”.

What seems to have happened is that one laptop uploaded “/path/to/17635-Alex Correia.pdf” to the server and another laptop uploaded “/path/to/17635-Alex Correia.PDF” to the server at the same time, so the server recorded entries for both files. Each laptop then checked whether both “/path/to/17635-Alex Correia.pdf” and “/path/to/17635-Alex Correia.PDF” (i.e. two files) were stored on the laptop. The laptop operating system indicated that yes, both “/path/to/17635-Alex Correia.pdf” and “/path/to/17635-Alex Correia.PDF” were indeed stored on each laptop.

In other words, the server indicated that all files had been uploaded without issue, the laptops indicated that all files had been downloaded without issue, but the software insisted that there was one more file stored on the server than had been downloaded to each laptop.

Classic edge case. Classic race condition.

The solution … which took about 15 minutes

Once I found the problem, the fix took about 15 minutes.

I just:

  • updated the server software to always treat files with the same case insensitive file name (e.g. “/path/to/17635-Alex Correia.pdf” and “/path/to/17635-Alex Correia.PDF”) as the same file; and
  • built a mechanism to deal with what happens if “/path/to/17635-Alex Correia.pdf” and “/path/to/17635-Alex Correia.PDF” are uploaded from two or more different laptops at the same time.

And everything started working again.