Rejourney Hits 1.3 Million Session Replays in 3 Months
Lessons in edge security, durable pipelines, and scaling during a 10x traffic spike.
Lessons in edge security, durable pipelines, and scaling during a 10x traffic spike.
Quick growth for any startup is hard. However, the difficulty is far greater when your startup handles hundreds to thousands of small files ingested every minute. Every startup dreams of that "hockey stick" moment, but as we recently learned at Rejourney, the infrastructure that supports 10,000 sessions a day doesn't always handle 100,000 with the same grace.
Last month, we officially onboarded new customers with expansive user bases across several high-traffic mobile apps. It was a milestone for our team, but it also quickly became an "all-hands-on-deck" engineering challenge.
The trouble started almost immediately after they loaded Rejourney live on their apps. Within minutes, our ingestion metrics spiked by an order of magnitude.
At the edge, Cloudflare’s automated security systems saw this sudden, massive influx of traffic to our API and did exactly what they were programmed to do: they flagged it as a Distributed Denial of Service (DDoS) attack. Legitimate session data from thousands of users was being dropped before it even reached our infrastructure.
Our immediate fix was to implement a bypass filter for our specific API endpoint. We wanted to ensure no data was lost and that the onboarding experience was seamless. We flipped the switch, the "Attack Mode" subsided, and the floodgates opened.
Opening the floodgates is only a good idea if your reservoir can handle the volume. By bypassing the edge protection, we redirected the full, unthrottled weight of the traffic directly to our origin server.
At the time, our backend was running on a single-node K3s cluster. While we’ve optimized our ingestion pipeline to be lean, no single node is immune to a "thundering herd." As thousands of concurrent connections hit our API, our Ingest Pods were pinned at Max CPU, and the server eventually became unresponsive.
We realized that scaling "up" (getting a bigger VPS) was no longer enough. We needed to scale "out."
The biggest bottleneck in our old setup was the "monolithic" nature of ingestion. If a pod restarted, in-memory tasks were lost. We’ve now decomposed the pipeline into five specialized, durable stages:
ingest-upload layer. These pods act as a relay to Hetzner S3, ensuring that a flood of incoming bytes doesn't starve our core API of resources.ingest-workers handle lightweight metadata like events and crashes, while replay-workers tackle the heavy lifting of screenshots and hierarchies.session-lifecycle-worker performs periodic sweeps to recover stuck states or abandon expired artifacts.By using Postgres as the source of truth for state and S3 for storage, our system is now remarkably resilient. Even if Redis or individual pods face transient issues, the state survives and processing resumes exactly where it left off.
We’ve moved away from the single-node bottleneck to a High Availability configuration. We now run HA Postgres and Redis with automated failover. If a VPS goes down, the databases automatically fall back to a replica. The platform keeps moving, and the data stays safe.
As we scaled, we hit a literal physical limit: providers like Hetzner often impose a 50-million-object limit per bucket. To bypass this, we implemented a dynamic multi-bucket topology.
Instead of hard-coding storage locations in environment variables, we moved the source of truth to a storage_endpoints table in Postgres. This allows us to manage storage with extreme granularity:
endpoint_id on every artifact. This "pins" future reads to the correct bucket, even as global defaults change.Despite the intensity of the traffic spike, we managed to implement these changes with less than five minutes of total downtime.
This incident reinforced why we focus so much on performance. Our lightweight SDK ensures we aren't taxing the user’s device, while our new HA infrastructure ensures we can handle whatever volume the next "hockey stick" growth moment throws at us.
We’re now back to 100% stability, with a much larger "reservoir" ready for the next wave of growth. If you’ve been looking for a session replay tool that respects your app’s performance as much as you do, we’re more ready for you than ever.