Arbitrum Foundation: Post-mortem Analysis Report on Arbitrum Sequencer Outage

Arbitrum Foundation has released a post-incident analysis document on the sorter incident, which mainly outlines the root cause of the incident, the specific timeline of the event, the solution, and specific action measures. The foundation also stated that the lesson learned from the failure is to ensure that temporary workarounds added as fixes to issues are cleaned up and configuration options are removed.

Root Cause: An issue was introduced with the batch processing release program version v2.1.0-beta.1, where when the batch processing release program attempts to update its L1 state, it uses the nonce of the previous L1 block number despite filling in a new L1 block number, causing them to become unsynchronized. To make matters worse, the previous L1 block number is “nil” during the first update of the L1 state, resulting in the latest block being queried for the nonce.

Our solution is to remove the problematic state from the batch processing release program’s Redis storage and restart with an older version that doesn’t have the issue. Specific action items include creating a public Arbitrum state page to reduce confusion when the service encounters issues, re-evaluating sorter client and server timeouts to improve reliability under transaction backlog, creating a new test version “v2.1.0-beta.2” that fixes the root cause and takes various batch processing release program hardening measures to prevent similar issues from occurring in the future, currently undergoing comprehensive testing in Arbitrum Goerli.


