Cassandra flush to disk delay and docker images

2015-12-12

At mnubo, we use docker everyday. The other day we had an issue with an image generated by a database schema manager tool that we built internally. Here is how we found about the problem, troubleshooted it and eventually fixed it.

The schema manager tool

First, let me give you a brief explanation of what is this database schema manager tool that we built. We use different kind of databases depending on what we want to do with the data. We use cassandra, MySQL and ElasticSearch. To manage all of these databases schema, we built a tool.

The tool is a sbt plugin and is coded with scala. So for every different database schema we have a git repo. This repo is a sbt project using the plugin. The project has to follow a specific directory structure. It looks like that:

    ├── build.sbt
    ├── config
    │   └── logback.xml
    ├── db.conf
    ├── migrations
    │   ├── V0_0_0_1
    │   │   ├── downgrade.cql
    │   │   └── upgrade.cql
    │   ├── V0_3_4_29
    │   │   ├── downgrade.cql
    │   │   └── upgrade.cql
    │   ├── V0_3_7_0
    │   │   ├── downgrade.cql
    │   │   └── upgrade.cql
    ├── src
    └── version.sbt

The migrations that are to be applied are located in the migrations folder. Each migration requires an upgrade and a downgrade file. The migration can be SQL/CQL (or whatever the query language) or it can be a java/scala class. If it is the former, the upgrade/downgrade files contain the migration queries to be executed. If it is the latter, the files will contain the fully qualified class name to be used to upgrade/downagrade. This class implements a specific interface where you can use the database connection to perform your operations.

When the migrations are applied, the database will contains a versions table where all the applied migrations are listed along with a MD5 of the migration itself.

The db.conf file contains the connection parameters of every environment where the database is located and the migration is to be performed. It also contains the name of the database schema.

When your migration is ready, you can use the plugin to multiple things:

The last one is the most interesting one in my opinion. The tool can be used to generate a database with a schema at the version you want. The database is packaged as a docker image.

For example, if the database schema name of the repo content shown above is: music. From the migrations files above, we can see the database use CQL file so we can deduce it is a cassandra database. You can generate a test database for version V0_3_7_0 that can be started like that: docker run -d -p 9042:9042 test-music:V0_3_7_0. The started container will have a running cassandra instance.

The problem

This tool is used internally by Jenkins’ jobs to generate a test database for every migration available. The process starts with a clean database container. It applies the first migration on the container database, if successful, it adds an entry to the version table. The last step will commit the database container and tag with the version of the migration it just ran. The process do the same thing for all of the migrations available in order. In the end, if you have 5 different migrations, you will end up with 5 different docker images containing different version of the database schema.

The problem was that the versions table did not contain the correct information. There were missing entries in the table. The problem was only happening in cassandra. So the last step of the process: writing the version to the version table was not producing consistent result.

Hunting down the problem and fixing it

We started debugging the plugin and realized that while debugging we could end up with a correct test database container. That meant that the delay induced while debugging would affect the resulting docker image. After some time researching the issue, we realized that cassandra would acknowledge the write operation but the data was not written to the disk until after a specific delay.

This is a problem because committing a docker image will pause it and produce a docker image. If the write is only recorded in memory, the resulting docker image’s data will not contain it. We can configure this delay, see the config file reference here.

To confirm this was the problem, we added a sleep(10) before committing the image. It worked, so this was our problem.

The solutions

Right away, we thought of two solutions:

  1. change the config to reduce the delay from the default 10 seconds to a few milliseconds
  2. keep the sleep between every commit

Changing the configuration was suboptimal because it would be that the test database we use to do our applications’ tests would have a different configuration than our production database. This leads to a lot of other problems! Sleeping had no downside apart from slowing the whole process a little bit. We went with this very inelegant solution first.

Shortly after, we discovered the nodetool utility. This utility has a flush operation that we can call on demand. We got rid of the sleep call, and added a call to flush the data to the disk. We now had a much more stable solution.