Troubleshooting a mysterious Mastodon bug: the Accept-Encoding header and federation

28 May, 2018

The story

As you may all know, I am the administrator of a Mastodon instance, https://sn.angry.im. One thing that is really fun doing this job (and every SysAdmin job) is that you run into different problems from time to time, sometimes without doing anything or sometimes after some upgrade.

Last week, Mastodon v2.4.0 was out and I, along with my friend, admin at https://cap.moe, decided to upgrade to the new release as quickly as possible. Since there was nothing breaking in the new version, it didn't take long before we both finished executing a few Docker commands and restart into the new version. As usual, we tried to post something to ensure that everything works fine after any upgrade, and this is where things started to break.

We first noticed that I cannot see anyone on cap.moe on my home timeline, while he could see everyone from my instance on his timeline. We thought this was a problem of subscription, so we both did a resubscription task in the administrator panel of our Mastodon instances. However, it was not fixed in any way by this. We then tried to mention each other in a toot to find out if it was because a timeline logic error, but it was not. Still, he could see me but I can't see anyone on his instance.

One thing interesting is that, since some other instances, for example, pawoo.net, can see both of our instances' posts, I can simply retoot one of his toots on pawoo and I will receive the toot on my instance in several seconds. I didn't know what this meant, but it was really something 面白い.

Since other mysterious bugs have happened before and just magically fixed themselves after a while, I decided that it was a good idea to leave it alone and see if things go back to normal. Now it is a week after the initial upgrade, and nothing has changed throughout the entire week, and I can't bear a Mastodon timeline without the jokes from fakeDonaldTrump account of cap.moe to fill my spare time anymore. I finally decided to troubleshoot this "bug".

Attempts

My first idea was that it could be caused by some errors in the task queue or something in the database, both of which could be reset by applying an instance block and removing it after everything is cleared from my instance, at least this was what I believed. This, obviously, was not the case. After removing the instance block, everything was still like what they were before. Mastodon provides no support for really removing users anyway, at least in the database. As what the admin of cap.moe said:

This is completely suicide attack.

If you are an administrator, do NEVER attempt anything that works like a suicide attack, because it solves nothing but adds complexity.

The only option left here is to dump all the traffic and see what's going wrong with the requests. As I had already known, the ActivityPub protocol, which Mastodon relies on, uses active pushes rather than passive pulls to distribute messages. Thus, it could be something on my side that prevented the push to succeed. I decied to capture all the traffic by tcpdump and inspect it using Wireshark.

Since all the traffic of my Mastodon instance is HTTPS-encrypted behind a reverse proxy, I could only dump all the traffic between Nginx and the upstream, then feed all of them into Wireshark to filter by HTTP headers. This was a pain, but I eventually did it and figured out something from the traffic: My instance was replying with 401 Unauthorized to the pushes from cap.moe.

A little inspection into the source code indicated that such error is linked to signature verification. Each ActivityPub needs to be signed by an Actor's private key, which can be verified using the public key. I assumed that this could only be caused by database errors -- my database must have stored a different public key from the original one, either by an error in database upgrade or some random cosmos radiation. I checked the public key by

account = Account.find(id_on_cap_moe)
account.public_key

in the Ruby console of Mastodon. I also asked the admin of cap.moe to run the same command with the id on his own instance, and then we compared the output public key. Unfortunately, they are exactly the same -- This can't be the problem either.

The solution

With all the attempts above failed, I decided that I should compare the request of a successful delivery with the failed one. I tried to toot something on pawoo and then toot something on cap.moe, while I kept tcpdump running. After this, I fed them to Wireshark as usual and followed the individual HTTP streams. The Siganture header drew my attention.

This is the header in the failed request

Signature: keyId="https://cap.moe/users/PeterCxy#main-key",algorithm="rsa-sha256",headers="(request-target) user-agent host date accept-encoding digest content-type",signature="ZC4c0wxPRn+RVYTeAaPjEgA3PDW/jHQ3CdUSn3u+mH2HUxsiQV3TV0dObzC4Z9VGOmY0ZE0cbQ9KiketDxPAq99InDnDjJ49aUT6/L0gSXJQlpM4SGGT8VyipkFm/dzoxbJ8jiT9WjcrXwD1/sJV4IvuA0LJs96mRkuexykguSu2PefvS7PTw5ufAxGTWn3YmtvkMeYLBi5V7LUz3xcONe2iqcSO6hKZ77puTvvWJZgfeNxMyoRXyrcrKUSUZhgfR8z7rwPgxvcoigfiL/SH0xrKyBIdO6HjjjuMsTOSa4xRsrGgopowpAx19ya83YiTRdvkO720u3Dy3ZsWifoRCw=="

And in the successful request from pawoo

Signature: keyId="https://pawoo.net/users/PeterCxy#main-key",algorithm="rsa-sha256",headers="(request-target) user-agent host date digest content-type",signature="Esf8TAlrYId7XhP7AKlRdGTz+tWXT+/ehYCrCLKCgx3UWPxnzNBssawr7oG5xPuB1QU/TLw6M09Rp9pd+0+F20GaEVUE2UTLNwKDizDbEj2XmK7RjEE4ys3Md1b8E+d4YbTVnUWqi0WnufUNTrjLCdyPCPHn3fqJ5Bv9/W4aUDF+nFbJAZr2n1cmu6Nb28nhS1PQAz7AzzsZy/Du+R6S3x91OjRMIa7Xt1EgLWH6/TEchUsxiP78QKZIbzIlEca+BhWCQiQ2qjO+VtwNDDypqh9HheNn23iuy4xm6hKwjHiVVkfekbEK47fNRXH5fakhmHmN7Zl813lrotkIGbDrdA=="

Notice that the headers in the failed signature indicated that the accept-encoding header is also signed, while it was absent in the successful request.

Now I knew what was wrong with the Mastodon stuff: I erased the Accept-Encoding header in my Nginx reverse proxy configuration! This was due to the use of sub_filter, since I needed to insert something into the HTML of Mastodon while I was too lazy to modify the source code and re-build the Docker image myself.

The solution seems easy now. Originally, my Nginx configuration included

proxy_set_header Accept-Encoding "";

Since I do still want to use sub_filter for HTML pages, I changed it to

set $my_encoding $http_accept_encoding;
if ($http_content_type != "application/activity+json") {
  set $my_encoding "";
}
proxy_set_header Accept-Encoding $my_encoding;

This erases the Accept-Encoding header except when the content type is application/activity+json, which is used to communicate between Mastodon nodes.

Save and reload the Nginx configuration, everything works fine now.

The cause and more questions

After asking the maintainer of Mastodon, @Gargron@mastodon.social, I figured out where was this problem introduced:

https://github.com/tootsuite/mastodon/pull/7425/commits/4de98db0312de2a45d8f08d6f6611ebc64eed8b1

This pull request added direct support of gzip compression in Mastodon, thus bringing the Accept-Encoding header into the signature. My erasure of this header, obviously, broke the signature check and made all of these happen.

However, these questions are still not answered after all of these:

  1. Why am I only losing federation with some 2.4.0 instances but not all? The pull request seemed to be enabled by default and there should be no way to disable it.
  2. What's the point of including this header in the signature?

I couldn't find the answer on my own, and I decided not to because nothing is wrong now.

And that's it, the process of troubleshooting a mysterious bug.