I’ve consistently underestimated the effects of tech debt. All the books and blog posts have branded its negative impact in my mind, but I still tend to prioritize business needs as long as the software works.
But working software is only good here and now. Well-designed software is good in the long term.
When we think of tech debt, we think of the small hacks we do to get something faster to production. But we can accumulate it just the same by not doing anything. For example, not upgrading our tools to newer versions.
I once inherited the heroic task of migrating our clusters to use a new version of Kafka. Tens of microservices publishing messages non-stop had to be moved to a new message system with, ideally, zero downtime.
The guy that tried to do it before me had quit.
This didn’t set off any alarms for some reason, so I downloaded the free Kafka book and got to work. Four weeks later, after going through my predecessor’s notes and drying up every last drop of hope, I realized that this goes beyond anything else I had ever done.
But not for the reasons you’d think.
I designed the migration in a way that would allow us to move without turning anything off. No downtime. Both Kafkas would run simultaneously, replicating messages from the old to the new one.
Once this was set up, we’d start moving the consumers to read the replicated messages from the new cluster. We found a way to manually set the offset for a given consumer group before switching clusters so no messages would be duplicated or missed.
Then once all consumers were reading from the new cluster, we’d move the producers to publish directly to it as well, strangling the old Kafka and preparing it to be decommissioned.
This was an operationally-complex approach and it required solid observability and keeping track of what’s talking to what cluster. But the real surprise wasn’t the complexity of the migration.
It was the overall state in which I found the system.
The team that built this pipeline followed the old mantra that working software is good enough. As long as this version of Kafka was working, there was no point in moving to a newer one.
But when I started looking at the services and the pipeline, I realized that sometimes you could accumulate tech debt even if you don’t do anything.
Our version was so old that there were no resources to learn from - most of them referenced newer Kafka versions. I attempted to use APIs that don’t exist and tried to replicate messages with incompatible tools.
There was no community to take advantage of.
With so many tools and libraries around, an engineer must always consider when to use an available solution and when to build one. Each option has its trade-offs. But when you’re running old software, you only have the latter available.
You build the things you don’t have.
Replicating messages between Kafka clusters is a matter of using a single tool called Mirror Maker. At least in the latest versions. But since our Kafka was so old, we had to build a consumer and producer ourselves.
Multiply that by the number of topics, and you end up with many extra components to maintain and monitor.
And infrastructure aside, this problem is felt down to the lowest implementation levels. Old Kafka clusters don’t support headers, for example. So the teams that want to pass metadata together with the message body must mimic this behavior.
This can confuse a lot of engineers who see headers referenced in the code but don’t get the expected result when they pass them through a CLI.
Changing a client library is no walk in the park either. The newer client versions had a different API, leading to more application-level changes for an infrastructural change.
Designing the migration, upgrading all our libraries, preparing the components we needed, and starting to move the messages took us a total of six months. Half a year in which we rebuilt the entire Kafka client library we used.
This wasn’t in the initial plan but we had no choice. The deeper we got, the more things we had to tear out or change.
And chances are that upgrading the tools as new versions came out wouldn’t have taken much less in total.
But the difference is that we had to drop everything else in that period and focus entirely on that migration.
We had to devote all our attention to improving our tooling. Had this been done incrementally throughout the years, it would’ve been less risky, it would’ve reduced the product’s overall complexity and the team wouldn’t have to freeze the rest of its work.
Big architectural changes are always challenging. But they can be much easier if we avoid the unintentional tech debt accumulated by running obsolete tech.
I’m not hellbent on running the latest tech, but ever since this migration, I make sure to run the latest versions of my tools, even if they’re not that trendy.
Because libraries and frameworks tend to solve more problems as they evolve. And there’s a clear cost in terms of complexity for running your own solution.
The same rule applies to other parts of the stack as well. You could write your own SSR and deal with all its intricate details, or you could spin up a Next.js application and focus on domain logic.
I’ve done it once to keep more granular control over our tooling, and every time I dealt with a hydration bug, I wished I had used a framework.