External connections unreliable
Between ~00:08 and ~08:41 UTC, pretix had issues reaching external systems including payment providers and other APIs. This has mostly been visible through failed payments, but may have also caused delayed webhooks, missing data transfers to external systems, and similar events.
Root cause analysis:
-
To prevent SSRF-style attacks, we run a local proxy server for all outgoing HTTP requests, such as webhooks or API calls. This proxy server runs in a Docker container.
-
To stay safe in case of software vulnerabilities, we have implemented a mechanism that automatically updates new versions of lots of the software we run – including docker itself. To minimize the risk of a new version breaking the system for you, we do not run the updates on all servers at the same time, but slowly roll them out across our infrastructure.
-
Tonight, around 00:08 UTC, the first server started installing the update 29.1.0 of Docker Engine. This release contained a faulty change that caused the internal DNS resolver of Docker to break. By around 04:00 UTC, the update was rolled out to around 70% of our infrastructure. On these hosts, our application was no longer able to make any external HTTP requests.
-
Since our monitoring monitors the health of our service itself as well as the ability to purchase tickets with a fake payment provider, none of our monitoring checks caught the problem. Only some of the code paths triggered exception reports to our development team, which did not get noticed early enough.
-
After a customer made us aware of a significant issue with their payments with our support team, the issue was escalated internally around 08:12 UTC. Around 08:25 UTC an engineer was able to start investigating the issue. By 08:41, we had implemented a temporary solution by redirecting the DNS requests to an external DNS resolver.
-
Around 13:08 UTC, Docker released Docker Engine 29.1.1 containing a rollback of the faulty change, giving us the final puzzle piece to understand what was going on.
Going forward, we will:
-
Add additional monitoring checks to make sure problems with the outgoing HTTP proxy no longer can go unnoticed.
-
Review our update practices and attempt to find a way to run less bleeding-edge versions of docker while still ensuring timely automated security patches.
We have analyzed the root cause of this issue and added it to the incident description.
Our fix seems to work, all systems are running. We'll need some more time to research the root cause of this.
A preliminary solution is rolled out, we are watching the system closely.
We are diagnozing the issue and looking for solutions.