This blog post describes how I discovered a linux feature called “OOM Killer” that can have strange effects if it interrupts a program at a place where it really shouldn’t be interrupted.
I have a low-end VPS (Virtual Private Server), or at least: it used to be low-end, now it’s at least one step above lowest and cheapest.
On this server I’m running various personal stuff: email (with spamassassin), some public web pages, and some private web pages with various small webapps for the household (e.g. the “weekly allowance app” registering work done and payments made).
I had noticed occasional slow startups of the webapps, and in particular in September this year, when I was demonstrating/showing off the webapps at this year’s JavaZone the demos were less than impressive since the webapps took ages to load.
I was quick to blame the slowness on the wi-fi, but as it turns out, that may have been completely unfair.
The webapps had no performance issues on my development machine even when running with a full desktop, IDE and other stuff. The webapps on the VPS also seemed to have no performance issues once they loaded.
I thought “this is something I definitely will have to look into at some later time…” and then moved on with doing more interesting stuff, i.e. basically anything other than figuring out why the webapps on a VPS were slow to start.
But then the webapps started failing nightly and I had to look into the problem.
What I saw in the logs was that the reason the webapps were broken in the morning was that they were stuck waitning for a liquibase database changelog lock that never was released.
Liquibase is how my webapps set up and update database schemas. Every time a webapp starts it connects to the database and checks what liquibase scripts that have been run against that database and applies the ones that have not already been run. The list of scripts that have been run is a tabled called databasechangelog. And to avoid having more than one liquibase client attempting to modify the database schema, liquibase uses a different table called databasechangeloglock to moderate write access to the database,
I.e. the databasechangeloglock is just a database table that has one or 0 rows. A liquibase client tries to insert a lock into the table at startup and waits and retries if this fails (and eventually completely fails).
In my case the webapps were failing because they were hanging at startup, trying to get a liquibase lock and failing to get one and were hanging in limbo and never completing their startup process. Manually clearing the lock from the table and restarting the webapps made the webapps start up normally. However, the next day the webapps were failing again for the same reason: the webapps were stuck waiting for a liquibase lock.
I initially suspected errors in my code, specifically in the liquibase setup. But googling for similar problems, code examination and debugging revealed nothing. I found nothing because there was nothing to be found. The actual cause of the problem had nothing to do with the code or with liquibase.
I run my webapps in an instance of apache karaf that is started and controlled by systemd. And I saw that karaf was restarted 06:30 (or close to 06:30) every morning. So my next theory was that systemd for some reason decided to restart karaf 06:30 every morning.
No google searches for similar symptoms found anything interesting.
So I presented my problem to a mailing list with network and linux/unix experts and got back two questions:
- Was something else started at the same time?
- Did that something else use a lot of memory and trigger the OOM killer?
And that turned out to be the case.
I have been using and maintaining UNIX systems since the mid to late 80ies and setting up and using and maintaining linux systems since the mid to late 90ies, but this was the first time I’d heard of the OOM killer.
The OOM killer has been around for a while (the oldest mention I’ve found is from 2009), but I’ve never encountered it before.
The reason I’ve never encountered it before is that I’ve mostly dealt with physical machines. Back in the 80ies I was told that having approximately two and a half times physical memory was a good rule of thumb for scaling swap space, so that’s a rule I’ve followed ever since (keeping the ratio as the number of megabytes increased, eventually turning into gigabytes).
And when you have two and a half times the physical memory as a fallback, you never encounter the conditions that make the OOM killer come alive. Everything slows down and the computer starts trashing before the condtions that triggers the OOM killer comes into play.
The VPS on the other hand, has no swap space. And with the original minimum configuration (1 CPU core, 1GB of memory), if it had been a boat it would have been said to be riding low in the water. It was constantly running at a little less than the available 1GB. And if nothing special happened, everything ran just fine.
But when something extraordinary happened, such as e.g. spamassassin’s spamd starting at 06:30 and requiring more memory than was available, then OOM started looking for a juicy fat process to kill, and the apache karaf process was a prime cadidate (perhaps because of “apache” in its name combined with OOM killer’s notorious hatred of feathers?).
And then systemd discovered that one of it’s services had died and immediately tried to restart it, only to have OOM killer shoot it down, and this continued for quite a while.
And in one of the attempted restarts, the webapp got far enough to set the databasechangeloglock before it was rudely shot down, and the next time(s) it was attempted started it got stuck waiting for a lock that would never be released.
The solution was to bump the memory to the next step, i.e. from 1GB to 2GB. Most of the time the VPS is running at the same load as before (i.e. slightly below 1GB) but now a process that suddenly requires a lot of memory no longer triggers the OOM killer and everything’s fine. Also the available memory is used for buff/cache and everything becomes much faster.
I bumped the memory 8 weeks ago and the problem hasn’t occurred again, so it looks like (so far) the problem has been solved.