FiveM outage postmortem: 2018-10-13


#1

There was an incident today where FiveM services suffered a brief major outage as result of a cascade effect of a routine service upgrade.

Timeline

  • 5:30 PM CEST: An upgrade of the Discourse forum software was initiated.
  • 5:33 PM: It was noticed that the upgrade hung while pulling the Docker base image, potentially due to network issues. The upgrade was canceled, and retried.
  • At this time, the forums and policy services had gone offline. This was intended to be brief and not noticed by users.
  • 5:40 PM: Docker had corrupted parts of the new base image on vedic. Internet search results indicated that a reset of Docker was the only solution.
  • 5:4x PM: Stopping the Docker service led to an infinite timeout, prompting a host reboot.
  • 5:48 PM: The host hadn’t come up, and accessing iLO initially failed due to a misconfiguration. PXE boot was reconfigured to boot from a rescue filesystem, and the host was power-cycled again.
  • 5:56 PM: The misconfiguration was noticed, and prefixing https:// to the incorrect iLO link resolved an invalid redirect to a LAN IP.
  • 5:57 PM: Connecting to the iLO console instantly made the machine resume booting. For safety purposes, a backup was initiated to a remote host.
  • 5:58 PM: A tweet was posted indicating that this issue is being worked on.
  • 6:25 PM: The backup procedure completed, and vedic could be rebooted to continue rebuilding the Docker data store.
  • 6:35 PM: Since Postgres wasn’t shut down cleanly, the Discourse start scripts had to be modified to allow Postgres time to recover.
  • Around the same time, we re-added the new Docker host data to the Rancher cluster.
  • 6:51 PM: Users started reporting downtime of CnL heartbeats. Investigation showed that oceanic2 had its database service suspended, leading to the second shard of the heartbeat table only having a single replica left. This meant that the table could not be used for writes leading into this shard, and therefore people encountering errors upon joining servers.
  • At this point, war mode engaged, and timestamps weren’t kept.
  • Reconfiguring Docker led to DNS settings being incorrect, which caused Rancher to not be able to bring up the new host in time. Therefore, we attempted to reconfigure the data table.
  • The heartbeat table was flushed and recreated after attempts to set a 2/5 configuration (2 shards, 5 replicas; allowing for 3 failures per shard) led to IO overload of all servers in the cluster.
  • This also mandated a rolling recycle of all database nodes, leading to multiple intermittent outages of CnL.
  • 7:26 PM: All services resumed normal operation, and monitoring indicated heartbeats were being kept in the transient dataset.

Lessons learned

  • The external backup system of the forums should be reconfigured so that it automatically saves backups of all data, not just the Postgres database.
  • A 1/5 configuration should be used instead of a 2/3 configuration, since 2 servers failing can mean that a whole shard becomes unreachable.
  • Monitoring should not run on the same node as other services, for we only found out about the CnL outage a few minutes late, since the Docker host that ran the monitoring service was still being rebuilt.

#2

i love reading postmortems, i find them interesting :smiley: nice job fixing it as quickly as possible nonetheless


#4

They should be back already.

Hard to tell without any information?

Weird - there should be no caching for e.g. the clothing streaming benefit? Are you sure it’s ‘nothing’?


#6

NVM Cancel <3 Just popped up right now


#8

Just popped up had to restart client a few times and it popped


#9

Love the transparency. Thanks for the clear and interesting info.


#10

Wow.
Very professionally done.

Thank you for your hard work.