Emerald Outage Postmortem

August of 2022.

Image of EmeraldPay, Inc. Management

By EmeraldPay, Inc. Management

August 25th, 2022

post image

Emerald Wallet incident postmortem.

On Thursday August 18th of 2022, we received an alert from our monitoring service that the Emerald servers were not responding.

It turns out that Google Cloud, where we host all our services, had suspended our account for some unknown reason and asked for identification documents of our company. We were disappointed that we didn't get any prior notification allowing us to provide everything without disrupting the service. We immediately provided all the requested information and hoped they would quickly restore the service.

Unfortunately, even by Friday 19th, we hadn’t gotten the service restored, and Google was still verifying the documents.

All our internal infrastructure is built on Google Cloud. Including the websites, development instruments, and user support. That affected all our users and internal processes.

Please note that our desktop application, Emerald Wallet, manages our user’s private keys, and that neither us nor Google can access them. We use Google Cloud to run our services that provide access to the blockchain, broadcast transactions, read balances and status of transactions, exchange rates, etc. Unfortunately, without them, the wallet cannot function.

By the weekend, it became clear that we could not resolve the issue as quickly as we wanted, and our users could still not receive or transfer their cryptocurrencies. At that time, we were looking for a way to recover at least basic functionality. The backend part is complex software with many dependencies on services provided by Google Cloud and, therefore, cannot be effortlessly migrated to other providers.

Only by Monday 22nd evening were we able to set up a basic functionality of the Emerald backend on Amazon AWS, which was still not enough to provide everything needed for the desktop application.

On the morning of Tuesday 23rd, after five days of outage, Google verified the provided documents and restored our account.

That was a very stressful five days for our customers and for the team.

We learned that we cannot rely on Google Cloud to provide the service.

From the beginning, we always understood that the infrastructure for Emerald should be decentralized, and we were planning to start using at least two cloud providers at some point. At the time, we didn't expect that an outage of one of our providers would last more than an hour or two, so we decided that for the initial development, we could use just one provider and redesign our backend when we had more development resources. We were building the backend, keeping in mind that it must be decentralized in the end, but it wasn't there yet, and we were not ready for this problem yet.

Now we realize we cannot keep building on top of just one cloud provider, so we are changing the priority of the tasks related to having a multi-cloud backend. The end goal is to have two or three independent providers and the servers spread over different data centers in different continents.

Another decision we have made is to immediately start to wind down our dependencies on Google Cloud and are moving out our infrastructure to other providers. In the following months, we will move the most critical parts of the backend to Amazon AWS and Cloudflare.

We learned the lesson early enough to build an infrastructure that will be decentralized and uninterruptible.

Thank you for reading this incident postmortem report.

The EmeraldPay, Inc. Management.