Once upon a time, our team inherited a horrible mess of a legacy project - let’s call it Project Mammoth. It had all the elements of a project with a tech debt problem - spaghetti code, no automated testing, tightly coupled components that were leaking abstractions like sieves, and the list goes on. There was a plan in place to replace the application, but Mammoth needed to stay alive and functional for at least a couple of years until its replacement got built and gradually rolled out.
I was initially disappointed at the prospect of working on a legacy code base for the foreseeable future. But bringing the tech debt on this code base down to a manageable level ended up being one of the most rewarding initiatives I’ve ever led in my career.
This article is part 2 of a three-part series about tech debt:
Part 2: Tools to reduce tech debt in your project (this article)
Part 3: How to keep on top of tech debt in an ongoing project (coming soon)
In this article I share the steps we took to gradually reduce tech debt on Project Mammoth. We went from having to constantly fight fires to actually having some bandwidth for other meaningful work.
Get management buy-in for time
Reducing tech debt in a code base takes time. To convince management to give us time, we presented them with the real-world business impact of leaving Project Mammoth as-is:
Release was once in two weeks because of the complicated manual steps, slowing down the pace at which users get access to new features.
The frequency of bugs detected in production was high - causing lost revenue and customer trust.
Applying urgent security updates would take longer than was advised - leaving the product vulnerable to attacks for longer.
On-call shifts were stressful, affecting team morale.
(In reality, we presented the above reasons with actual numbers rather than using words like “takes longer” or “lost revenue”)
We also presented to them the tasks we’d like to prioritize which we had selected based on what would save us the most pain with the least amount of investment.
Management is more likely to allocate time for tech-debt reduction when presented with business reasons like the ones mentioned above, rather than purely tech-focused reasons like “we aren’t using the latest framework”.
Add automated testing
Adding automated tests helped us reduce bugs in production, have more confidence in any refactoring changes we make in the future and brought us closer to continuous integration/deployment.
Ideally, we would’ve added every type of test from the testing pyramid, but in our case it made most sense to focus on unit tests (low effort, easiest to cover the most amount of business logic with) and API tests. Our application had a backend API and a frontend UI. Ideally, we’d have liked to have browser-based UI tests that would test the entire application end-to-end, but UI tests are a bit more high-maintenance - slower to execute and tend to be flaky. Instead, as a compromise, we chose to focus on API tests that would call our backend APIs directly and verify their responses. We mimicked every interaction our UI would have with the backend and wrote API tests for each of them.
Make resiliency improvements
One of the most frequent reasons for Project Mammoth to fail in production was failing calls to external services. We made several changes to improve resiliency:
Retrying calls to external services with exponential backoff and jitter.
Ensuring the minimum possible impact on our side when an external service failed (e.g. failing only a part of an API response).
Caching responses from external services where possible.
Observability - add metrics and alerts
We wanted to be the first ones to know if any part of our application wasn’t working as expected in production. This would help us with taking the right remediation steps and possibly fix the issue before users noticed any symptoms.
We started collecting metrics for the following data points:
request counts
response times
error response counts
counts for error responses received from external services, etc
In addition to this, we set up alarms for key metrics so that our on-call engineer would get paged if one of them exceeded a certain threshold. This reduced some manual effort for the on-call engineer - they didn’t need to stare at metrics all day, they would be alerted if something was wrong.
Simplify hotfix release and rollback processes
One of the most important aspects of production support is to be able to quickly alleviate user pain. If there is a bug in production that is affecting users significantly, there are two strategies you can apply - fix the bug or rollback the release that introduced it. Rolling back the latest release is usually a better choice since it guarantees the removal of the code that introduced the bug. But sometimes this isn’t possible and you need to be able to apply just enough code changes that fix the bug in production, without adding other unrelated changes that could have been introduced into your codebase’s main branch since the last release - this is called a hotfix release.
We figured out the simplest processes that would allow us to rollback and hotfix a release and documented them in great detail. This helped our on-callers greatly by providing them with a guide to follow during a production incident which is usually a very stressful time.
Enable continuous-ish deployment
Project Mammoth had a very manual and cumbersome release process with a long list of manual tests to run before manually tagging a version, building the artifact and pushing it to production. We took the following steps to reduce the manual effort and increase the frequency of releases:
In our efforts to add automated testing, we prioritized automating the tests that were previously run manually before every release. This helped us get rid of the manual testing step, and we were now catching bugs much earlier during the software development life cycle - at development/code-review time rather than right before a release.
We wrote very basic scripts that mimicked the manual release steps like tagging the version, building and pushing the artifact etc.
This allowed us to build a deployment pipeline that would deploy to a non-production environment, run automated tests and get all the artifacts for a production release ready. There were some more manual steps that we weren’t able to get rid of due to the legacy nature of the code base BUT we were able to bring down the manual effort for each release from several hours to just a few button clicks. This encouraged us to release every few days rather than once per week.
Bonus: Add documentation
We took some additional steps to make on-call less stressful for our team:
Created architecture diagrams for and described in detail the most significant business flows in Project Mammoth.
Created a playbook that consisted of a list of common production incidents followed by steps to remediate them.
I hope you can find something useful from this article to apply to your legacy code bases! Let me know what you would add to this list.
❤️ My favorite things this week
I binged on productivity book Make Time by Jake Knapp and John Zeratsky on Audible this week. What a book - it has very easily applicable, actionable advice to live a fulfilling life with time to spend on what matters to you. I will be sharing my favorite learnings from this book on my LinkedIn this week - make sure you don’t miss them!
I’ve been dabbling in Python lately since joining Microsoft since most AI-related projects use Python. It has been a mindset shift for a hardcore Java programmer like me, and I’ve been reading Fluent Python to help me with it. I needed a book that would teach me idiomatic Python (rather than one that teaches programming to beginners) and it fits the bill perfectly.
Awesome article, Bhavana and I’m excited to hear about your biggest “Make Time” takeaways so I can apply them too 😄
How do you track Technical Debt? We used an approach like having a separate Jira board, but since this board was separate from the main project, it was not always clear how to include Technical Debt items into sprints.