Running an OP Stack L2 in Production: Infrastructure Lessons

Launching an OP Stack L2 is well-documented. Running one in production — with real users, real assets, and uptime expectations — is an entirely different engineering problem. After deploying and operating an L2/L3 network across AWS and GCP with Terraform, we have a clear picture of what the OP Stack documentation doesn't tell you.

This post covers the infrastructure patterns, observability requirements, and operational realities of running rollup infrastructure that teams building on OP Stack actually need to know.

The OP Stack Component Map

Before you can operate an OP Stack network, you need a clear mental model of what's actually running. The core components are:

The sequencer is the single node that orders transactions and produces L2 blocks. It's your most critical component — if it goes down, transaction submission stops. Its state must be backed up continuously, and your recovery procedure for a sequencer failure needs to be tested, not assumed.

op-batcher takes the L2 blocks produced by the sequencer and submits them to L1 (Ethereum) as compressed transaction data. This is your L1 data availability submission pipeline. If it falls behind, your L2 state is not being committed to L1, which affects security guarantees.

op-proposer periodically submits L2 state roots to the L1 dispute game contracts. This is what enables withdrawals from L2 to L1. A delayed proposer means delayed withdrawal finality for users.

RPC nodes are what your users and dApps actually connect to. These can be horizontally scaled and are stateless enough to recover from failures without ceremony. The sequencer and batcher cannot.

IaC from Day One: Why Terraform Saved Us

Standing up OP Stack components manually is fast. Reproducing that environment for staging, disaster recovery, or a second region is where manual setups collapse. We Terraformed everything from day one — VPCs, compute instances, load balancers, DNS, secrets management, and the deployment configurations for each OP Stack component.

The multi-cloud requirement (AWS primary, GCP for certain components and DR) made this even more critical. A consistent Terraform module structure meant that spinning up a GCP replica of an AWS environment was a variable change, not a re-architecture. The discipline paid off the first time we needed to run a failover drill.

One specific recommendation: version-lock your OP Stack component Docker images in Terraform variables. It's tempting to track latest, but an unplanned update to the sequencer image in production is not a debugging situation you want to be in.

Observability: What Actually Matters for an L2

Standard web application metrics (p99 latency, error rate, CPU) are necessary but not sufficient for blockchain infrastructure. The metrics that actually tell you if your L2 is healthy are different.

Sequencer block production rate. The sequencer should be producing blocks at a consistent rate. A gap or slowdown is the earliest signal of a problem — often before any user-facing error appears.

Batcher L1 submission lag. How far behind is op-batcher from the sequencer's latest block? If this number grows over time, your L1 data availability is degrading. Set an alert well before it becomes a user problem.

L1 gas price vs. batcher wallet balance. The batcher pays L1 gas fees to submit data. An underfunded batcher wallet will stop submitting, silently. We set aggressive alerts on both the wallet balance and on submission gaps.

RPC node sync status. Nodes that fall behind the sequencer serve stale state to users. Monitor the block height delta between each RPC node and the sequencer.

Our Prometheus + Grafana + Loki stack with pre-built dashboards for each of these metrics reduced MTTR significantly — the difference between "something is wrong, go investigate" and "the batcher is 40 blocks behind, here's why" is measured in hours of engineer time.

Disaster Recovery: Test It or It Doesn't Exist

The sequencer is a single point of failure by design in the current OP Stack architecture. This means your DR plan for a sequencer failure is the most important runbook you'll write.

Our procedure: continuous snapshots of sequencer state to object storage (S3/GCS), with a tested restore procedure that brought a replacement sequencer online from the latest snapshot within a defined window. "Tested" means we actually ran the restore, not that we believed it would work.

Secrets management for node keys (sequencer private key, batcher wallet, proposer wallet) used cloud-native secrets managers (AWS Secrets Manager and GCP Secret Manager) rather than environment files. These keys are operationally critical and losing them — or exposing them — has irreversible consequences. Treat them accordingly.

CI/CD for Blockchain Infrastructure

GitHub Actions pipelines ran on every infrastructure change: Terraform plan validation, security scanning with Checkov, and a staged apply flow that required manual approval before touching production. The temptation to skip the approval gate for "small changes" is real — and has caused more incidents than the approval gate ever prevented.

Deployment to staging always preceded production, with a smoke test suite that verified block production, RPC responsiveness, and batcher/proposer health before the production deploy was unblocked. It added time to the deployment cycle. It also caught three configuration issues that would have caused production incidents if deployed directly.