Tyrel's Blog

Code, Flying, Tech, Automation

Nov 11, 2021

Postmortem of a fun couple bugs

Story at my previous job:

Tieg: Hey Tyrel, I can't run invoke sign 5555, can you help with this?

This is How my night started last night at 10pm. My coworker Tieg did some work on our CLI project and was trying to release the latest version. We use invoke to run our code signing and deployment scripts, so I thought it was just a quick "oh maybe I screwed up some python!" fix. It wasn't.

I spent from 10:30 until 1:30am this morning going through and looking into why Tieg wasn't able to sign the code. The first thing I did was re-run the build on CircleCI, which had the same error, so hey! at least it was reproducible. The problem was that in our Makefile scripts we run tidelift version > tidelift-cli.version and then upload that to our deployment directories, but this was failing for some reason. We let clients download this file to see what the latest version is and then our CLI tool has the ability to selfupdate (except on homebrew) to pull this latest version if you're outdated.

Once I knew what was failing, I was able to use CircleCI's ssh commands and log in, and see what happened, but I was getting some other errors. I was seeing some problems with dbus-launch so I promptly (mistakenly) yelled to the void on twitter about dubs-launch. Well would you know it, I may have mentioned before, but I work with Havoc Pennington.

Havoc Pennington: fortunately I wrote dbus-launch so may be able to tell you something, unfortunately it was like 15 years ago

Pumped about this new revelation, I started looking at our keychain dependency, because I thought the issue was there as that's the only thing that uses dbus on Linux. Then we decided (Havoc Pointed it out) that it was a red herring, and maybe the problem was elsewhere. I at least learned a bit about dbus and what it does, but not enough to really talk about it to any detail.

Would you know it, the problem was elsewhere. Tieg was running dtruss and saw that one time it was checking his /etc/hosts file when it was failing, and another time it was NOT, which was passing. Then pointed out a 50ms lookup to our download.tidelift.com host.

Tieg then found Issue 49517 this issue where someone mentions that Go 1.17.3 was failing them for net/http calls, but not the right way.

It turns out, that it wasn't the keyring stuff, it wasn't the technically the version calls that failed. What was happening is every command starts with a check to https://download.tidelift.com/cli/tidelift-cli.version which we then compare to the current running version, if it's different and outdated, we then say "you can run selfupdate!". What fails is that call to download.tidelift.com, because of compiling with go1.17.3 and a context canceled due to stream cleanup I guess?

Okay so we need to downgrade to Go 1.17.2 to fix this. Last night in my trying, I noticed that our CircleCI config was using circle/golang:1.16 as its docker image, which has been superseded by cimg/go:1.16.x style of images. But I ran into some problems with that while upgrading to cimg/go:1.17.x. The problem was due to the image having different permissions, so I couldn't write to the same directories that when Mike wrote our config.yml file, worked properly.

Tieg and I did a paired zoom chat and finished this up by cutting out all the testing/scanning stuff in our config files, and just getting down to the Build and Deploy steps. Found ANOTHER bug that Build seems to run as the circleci user, but Deploy was running as root. So in the build working_directory setting, using a ~/go/tidelift/cli path, worked. But when we restored the saved cache to Deploy, it still put it in /home/circle/go/tidelift/cli, but then the working_directory of ~/go/tidelift/cli was relative to /root/. What a nightmare!

All tildes expanded to /home/circleci/go/tidelift/cli set, Makefile hacks undone, (removing windows+darwin+arm64 builds from your scripts during testing makes things A LOT faster!) and PR Merged, we were ready to roll.

I merged the PR, we cut a new version of TideliftCLI 1.2.5, updated the changelog and signed sealed delivered a new version which uses Go 1.17.2, writes the proper tidelift-cli.version file in deployment steps, and we were ready to ROCK!

That was fun day. Now it's time to write some rspec tests.

 · · ·  Go  dbus  bugs

Jan 28, 2015

Too many open files

When I worked at Propel Marketing, we used to outsource static websites to a third party vendor, and then host them on our server. It was our job as developers to pull down the finished website zip file from the vendor, check it to make sure they used the proper domain name, (they didn't a lot of the time,) and make sure it actually looks nice. If these few criteria were met, we could launch the site.

Part of this process was SCPing the directory to our sites server. The sites server was where we had Apache running with every custom static site as a vhost. We would put the website in /var/www/vhosts/domain.name.here/ and then create the proper files in sites-available and sites-enabled (more on this in another entry). After that the next step was to run a checkconfig and restart Apache.

Here's where it all went wrong one day. If I can recall correctly, my boss was on vacation so he had me doing a bit of extra work and launching a few more sites than I usually do. Not only that, but we also had a deadline of the end of the month which was either the next day, or the day after. I figure I'll just setup all mine for two days, and then have some extra time the next day for other things to work on. So I started launching my sites. After each one, I would add the domain it was supposed to be at into my /etc/hosts file and make sure it worked.

I was probably half way done with my sites, and suddenly I ran into one that didn't work. I checked another one to see if maybe it was just my network being stupid and not liking my hosts file, but no, that wasn't the problem. Suddenly, EVERY SITE stopped working on this server. Panicking, I delete the symlink in sites-enabled and restart Apache. Everything works again. I then proceed to put that site aside, maybe something in the php files breaks the server, who knows, but I have other sites I can launch.

I setup the next site and the same thing happens again, no sites work. Okay, now it's time to freak out and call our sysadmin. He didn't answer his phone, so I call my boss JB. I tell him the problem and he says he will reach out to the sysadmin and see what the problem is, all the while I'm telling JB "It's not broken broken, it just doesn't work, it's not my fault" etc etc. A couple hours later, our sysadmin emails us back and says he was able to fix the problem.

It turns out, there's a hard limit to the number of files your system can have open per user, and this was set to 1000 for the www-data user. The site I launched was coincidentally the 500th site on that server, each of them having an access.log and an error.log. These two files apparently constantly open on each site for apache to log to. He was able to change www-data's ulimit to a lot higher, (I don't recall now what it was) and that gave a lot more leeway in how many sites the sites server could host.

 · · ·  python  linux  ulimit  bugs

Aug 06, 2013

Help, I have too many Django ManyToMany Queries [FIXED]

My boss tasked me with getting the load time of 90 seconds(HOLY CARP!) on one page down. First thing I did was install the Django Debug Toolbar to see what was really happening.

There are currently 2,000 users in the database, the way our model is setup is that a UserProfile can have other UserProfiles attached to it in one of three M2M relations, which in the Django Admin would cause 2,000 queries PER M2M field. This is very expensive as obviously you don't want 10,000 queries that each take 0.3ms to take place.

The solution, after a day and a half of research is to override the formfield_for_manytomany method in the Admin class for our UserProfile object.

Our solution is to prefetch for any M2M that are related to the current Model.

def formfield_for_manytomany(self, db_field, request, **kwargs):
    if db_field.__class__.__name__ == "ManyToManyField" and \
        db_field.rel.to.__name__ == self.model.__name__:
        kwargs['queryset'] = db_field.rel.to.objects.prefetch_related("user")
    return super(UserProfileInline, self).formfield_for_manytomany(
        db_field, request, **kwargs)

This goes inside our admin class UserProfileInline(admin.StackedInline). Simple clean and easy to drop into another ModelAdmin with minimal changes.

Other things I pondered was to set all our M2M's as raw_id_fields, then using Select2 or Chosen, query our UserProfiles when the related users were being selected. This would take a lot of load off the initial page load, but is more of a bandaid rather than a real fix.

I tried to override the Admin class's def queryset(self, request): but this was not affecting anything.

 · · ·  python  django  bugs