Postmortem of a fun couple bugs
Story at my previous job:
Tieg: Hey Tyrel, I can't run invoke sign 5555, can you help with this?
This is How my night started last night at 10pm. My coworker Tieg did some work on our CLI project and was trying to release the latest version. We use invoke to run our code signing and deployment scripts, so I thought it was just a quick "oh maybe I screwed up some python!" fix. It wasn't.
I spent from 10:30 until 1:30am this morning going through and looking into why Tieg wasn't able to sign the code. The first thing I did was re-run the build on CircleCI, which had the same error, so hey! at least it was reproducible. The problem was that in our Makefile scripts we run tidelift version > tidelift-cli.version and then upload that to our deployment directories, but this was failing for some reason. We let clients download this file to see what the latest version is and then our CLI tool has the ability to selfupdate (except on homebrew) to pull this latest version if you're outdated.
Once I knew what was failing, I was able to use CircleCI's ssh commands and log in, and see what happened, but I was getting some other errors. I was seeing some problems with dbus-launch so I promptly (mistakenly) yelled to the void on twitter about dubs-launch. Well would you know it, I may have mentioned before, but I work with Havoc Pennington.
Havoc Pennington: fortunately I wrote dbus-launch so may be able to tell you something, unfortunately it was like 15 years ago
Pumped about this new revelation, I started looking at our keychain dependency, because I thought the issue was there as that's the only thing that uses dbus on Linux. Then we decided (Havoc Pointed it out) that it was a red herring, and maybe the problem was elsewhere. I at least learned a bit about dbus and what it does, but not enough to really talk about it to any detail.
Would you know it, the problem was elsewhere. Tieg was running dtruss and saw that one time it was checking his /etc/hosts file when it was failing, and another time it was NOT, which was passing. Then pointed out a 50ms lookup to our download.tidelift.com host.
Tieg then found Issue 49517 this issue where someone mentions that Go 1.17.3 was failing them for net/http calls, but not the right way.
It turns out, that it wasn't the keyring stuff, it wasn't the technically the version calls that failed. What was happening is every command starts with a check to https://download.tidelift.com/cli/tidelift-cli.version which we then compare to the current running version, if it's different and outdated, we then say "you can run selfupdate!". What fails is that call to download.tidelift.com, because of compiling with go1.17.3 and a context canceled due to stream cleanup I guess?
Okay so we need to downgrade to Go 1.17.2 to fix this. Last night in my trying, I noticed that our CircleCI config was using circle/golang:1.16 as its docker image, which has been superseded by cimg/go:1.16.x style of images. But I ran into some problems with that while upgrading to cimg/go:1.17.x. The problem was due to the image having different permissions, so I couldn't write to the same directories that when Mike wrote our config.yml file, worked properly.
Tieg and I did a paired zoom chat and finished this up by cutting out all the testing/scanning stuff in our config files, and just getting down to the Build and Deploy steps. Found ANOTHER bug that Build seems to run as the circleci user, but Deploy was running as root. So in the build working_directory setting, using a ~/go/tidelift/cli path, worked. But when we restored the saved cache to Deploy, it still put it in /home/circle/go/tidelift/cli, but then the working_directory of ~/go/tidelift/cli was relative to /root/. What a nightmare!
All tildes expanded to /home/circleci/go/tidelift/cli set, Makefile hacks undone, (removing windows+darwin+arm64 builds from your scripts during testing makes things A LOT faster!) and PR Merged, we were ready to roll.
I merged the PR, we cut a new version of TideliftCLI 1.2.5, updated the changelog and signed sealed delivered a new version which uses Go 1.17.2, writes the proper tidelift-cli.version file in deployment steps, and we were ready to ROCK!
That was fun day. Now it's time to write some rspec tests.