We Were Not Allowed To Stop The Business

A platform seven years old, one developer who understood it, and a company that depended on it running every single day. This is how we rebuilt it without anyone noticing.

The first thing the CEO said to us was: we cannot go down.

Not for a day. Not for a weekend. Not even for a planned maintenance window that clients would notice. The platform processed transactions. It sent invoices. It triggered fulfilment. Every hour it was offline was an hour the business was not running.

We had heard versions of this before. What we had not seen before was the specific combination of conditions they were carrying: a codebase written by a single developer over seven years, almost no documentation, dependencies on libraries that had not been updated since 2017, and a business that had grown to the point where the platform was visibly struggling to keep up.

The system was not broken. It was just built for a company that no longer existed.

The company it was built for had twelve clients and a team of five. The company it now served had over two hundred clients, three times the staff, and a product that had evolved in directions the original architecture was never designed to accommodate.

The developer who built it had left eighteen months earlier. He had been careful. He had left comments in the code. But comments are not documentation, and no one still at the company fully understood what the system did or why certain decisions had been made the way they were.

When we first sat down with the technical team, the honest answer to most of our questions was: we think it works this way, but we are not certain.

The Decision We Had to Make First

There are two ways to approach a rebuild of this kind. The first is the full rewrite, you stop everything, spend six months building the new system in isolation, then switch over. It is cleaner architecturally. It is also, in a live business, almost always wrong.

The second is a parallel migration. You build the new infrastructure alongside the old. You sync data across both environments. You verify feature parity module by module. You migrate traffic incrementally, with rollback capability at every stage.

The parallel approach is slower. It requires you to maintain two systems simultaneously for a period of time. It requires discipline about not changing scope halfway through. It requires the client's team to operate normally, in their existing system, while you build the replacement underneath them.

It is the difference between replacing the engine of a plane while it is on the ground and replacing it mid-flight. One of those is theoretical. The other is what we were being asked to do.

We chose the parallel approach. There was no serious alternative given the constraint.

Image not found

The migration ran across fourteen months. Four modules. Each verified independently before traffic was moved. Rollback capability maintained throughout.

What the Work Actually Looked Like

The first three months were not about building anything. They were about understanding what existed.

We mapped every data flow. We traced every integration. We documented every dependency. Some of what we found was elegant, the original developer had made smart architectural choices under constraint. Some of what we found was the accumulated weight of seven years of small decisions that had never been revisited.

There was a module that processed a specific type of transaction in a way that seemed inefficient until we understood why. The inefficiency was intentional, it had been built around a quirk in a third-party API that the company still used. Removing it would have broken the integration. We kept it.

The job of understanding a legacy system is not to judge it. It is to understand the decisions that produced it well enough to know which ones you can change and which ones you cannot.

By month four we were building. The new infrastructure ran in parallel from the start. Every record written to the old system was mirrored to the new one. We ran both environments simultaneously for weeks before moving any live traffic, comparing outputs at every step.

The first module we migrated was reporting. It was the lowest-risk starting point, if something went wrong with a report, the business did not stop. It gave us confidence in the data sync before we touched anything transactional.

The second module was client onboarding. Higher stakes. We ran it in shadow mode for two weeks, new clients were onboarded in both systems, with the old system as the live environment and the new one processing in parallel. When the outputs matched consistently, we flipped the switch.

The third module was transaction processing. This was the one that kept us careful. We built a reconciliation layer that compared every transaction processed by both systems in real time and flagged any discrepancy immediately. In eight weeks of parallel running, there were eleven discrepancies. We investigated all eleven. Seven were edge cases in the old system that we had reproduced correctly. Four were genuine issues in the new system that we fixed before cutting over.

Eleven discrepancies.

Four real issues.

All caught before they became client-visible problems.

That reconciliation layer was the thing we were most glad we built.

The Moment It Was Over

The final cutover happened on a Tuesday morning. We had scheduled it deliberately, not a Friday, when a weekend problem would be harder to staff, and not a Monday, when the week's transaction volume was highest.

We had briefed the client team. We had prepared rollback scripts that could reverse the cutover within four minutes if needed. We had people monitoring both environments from the moment the switch happened.

The platform migrated. The transactions continued. The invoices went out.

Nothing stopped.

The CEO sent us a message that afternoon. It said: I was watching the dashboard the whole time. I could not tell when it happened.

That is the only measure of success that matters in a migration of this kind. Not that it was technically impressive. Not that the new architecture was cleaner or more scalable or easier to maintain, though all of those things were true. But that the business kept running, the clients never noticed, and the team could now build on a foundation that would carry them for the next decade rather than the next eighteen months.

What We Learned From It

Every project teaches you something you did not know before. This one had three lessons worth writing down.

The first: legacy systems are almost always more coherent than they appear. The code looks strange until you understand the constraints it was written under. Spend time with it before you decide what to throw away.

The second: the reconciliation layer is not optional. In any migration where both systems run in parallel, you need a mechanism that compares outputs and surfaces discrepancies in real time. Building it feels like overhead. Not building it means you only find the problems after they have reached clients.

The third: rollback capability is not a fallback. It is what gives you permission to move. Every time we cut over a module, we knew we could reverse it in four minutes. That knowledge changed how we made decisions. We were less cautious than we would have been without it, because caution was no longer our only safety net.

Infrastructure work is not glamorous. It does not make a good pitch deck slide. But it is the work that determines whether everything else holds.

The company is still running on the platform we built. Two years later, they have added three new product lines on top of it. None of them required a rebuild.

That is what it looks like when the architecture decision was right.

If you have a system that is becoming a liability —

DM KIVARA before it becomes a crisis.

Read on Substack