Engineering

We Inherited a Codebase That Was Bleeding $38K a Month

Technical debt isn't an abstract concept. It shows up on your AWS bill, in your hiring pipeline, and in the three-hour deploy cycles your team dreads every Thursday. Here's what we found when we audited a real production system, and what we did about it.

Lanos Technologies10 min read

What is technical debt?

Technical debt is the accumulated cost of shortcuts, outdated decisions, and deferred maintenance in a software system. Like financial debt, it compounds over time. Small compromises made early to ship faster turn into expensive problems later: slower development velocity, higher infrastructure costs, fragile deployments, and frustrated engineers.

Last year a founder came to us with a problem that sounded simple: "Our product works, but everything takes forever and our AWS bill is insane."

We hear some version of this every few months. The product is live, users are paying, things mostly work. But the engineering team is drowning. Features that should take a week take five. Deploys feel like defusing a bomb. And somewhere in the background, a cloud bill is quietly eating 30% of revenue.

This particular product was a B2B SaaS platform. Around 2,000 active users, decent revenue, small engineering team of four. The founder had hired an agency to build version one about two years prior, then brought on in-house developers to maintain and extend it.

On the surface, everything looked fine. Under the hood, it was a mess. Not because anyone was incompetent. Because every team had optimized for their moment in time, and nobody had ever stepped back to look at the whole picture.

What we actually found

We spent eight days inside the codebase and infrastructure. Not a cursory review. We read code, traced queries, mapped infrastructure, interviewed the dev team, and went through their deploy process in real time.

Here's what came back.

The database was doing way too much work

The PostgreSQL instance was an RDS db.r5.2xlarge. That's 8 vCPUs and 64GB of RAM. For 2,000 users.

Why? Because the ORM was generating absolute horror-show queries. We found a single API endpoint that was triggering 47 database queries per request. Not a typo. Forty-seven. The classic N+1 problem, but layered three levels deep because the data model had circular relationships that the ORM was eagerly loading.

The team's solution had been to keep upgrading the database instance. Every few months, things would slow down, someone would bump the instance size, and the bill would go up. At the time we looked at it, the RDS bill alone was $1,400 a month. For a database that should have been running on something a quarter of that size.

Three copies of the same background job

The system had a job queue for sending emails, processing webhooks, and generating PDF reports. Reasonable. The problem was that over two years, three different developers had each built their own version.

One used a cron job with a raw SQL polling loop. One used Bull on Redis. One used AWS SQS. All three were running in production simultaneously, sometimes processing the same event.

The result: duplicate emails to customers, race conditions on webhook processing, and an SQS bill that nobody could explain because nobody remembered setting it up.

Deploys took three hours (and everyone was afraid of them)

The CI/CD pipeline was a 200-line bash script that someone had written in the first month and never touched again. It pulled from main, built a Docker image, pushed it to ECR, then manually SSH'd into two EC2 instances to pull and restart.

There were no health checks. No rollback mechanism. No staging environment. The dev team deployed on Thursday afternoons because "if something breaks, we have Friday to fix it before Monday."

When we asked about the last failed deploy, one of the engineers laughed and said, "Which one?"

No observability whatsoever

The application logged to console.log. That's it. No structured logging, no error tracking, no performance monitoring. When something broke in production, the process was: a customer emails support, support tells the dev team, and someone SSH's into the server and greps the logs.

The mean time from "something broke" to "we know about it" was about four hours. Because they only found out when a user complained.

What it was actually costing

We put together a spreadsheet for the founder. Not to be dramatic, but because she needed to see this in business terms, not engineering terms.

The hard infrastructure costs alone were brutal. The oversized RDS instance was running $1,400 a month. Redundant queue infrastructure (SQS, Redis, and the cron poller all doing the same job) added another $600. EC2 instances with no auto-scaling were costing $900 because they were provisioned for peak load 24/7.

Then there was the human cost. The team was spending roughly 20 hours a month dealing with deploy issues. That's $3,000 in engineering time just babysitting a deploy process. Another 40 hours a month went to investigating and fixing bugs that better observability would have caught in minutes. That's $6,000. Duplicate event processing was causing re-sends and manual customer fixes, around $800 a month. And the customer support overhead from all these bugs added another $2,400.

But the biggest number was the one you can't see on any invoice. We called it the feature velocity tax. When your team of four is spending a third of their time fighting the codebase instead of building features, you're paying four salaries but getting the output of two and a half engineers. We estimated that cost at roughly $23,000 per month in lost productivity.

Add it all up and you're looking at roughly $38,000 per month being burned by technical debt.

The product's monthly revenue was around $120K. So about 30% of revenue was going to decisions nobody had revisited in two years.

What we fixed (and what we didn't)

This is important. We didn't rewrite the product. Rewrites are almost always a mistake. We stabilized the foundation and gave the team room to breathe.

Month 1: Stop the bleeding

Database. We spent three days on query optimization. We added proper indexes, rewrote the six worst endpoints to use raw SQL with joins instead of ORM eager loading, and added connection pooling with PgBouncer. The RDS instance went from db.r5.2xlarge down to db.r5.large. Same performance, one-quarter the cost.

Background jobs. We consolidated everything onto a single Bull queue running on their existing Redis instance. Killed the cron poller and the SQS queue. Duplicate processing stopped immediately.

Deploy pipeline. We replaced the bash script with a proper GitHub Actions workflow. Build, test, push to ECR, deploy to ECS Fargate with health checks and automatic rollback. Deploy time went from three hours to twelve minutes. The team started deploying on Tuesdays and Thursdays instead of holding their breath once a week.

Month 2: See what's happening

Error tracking. We set up Sentry with proper source maps and environment tagging. In the first week, Sentry surfaced 340 unhandled errors that the team had never seen because they were only visible in server logs that nobody was reading.

Structured logging. We replaced console.log with Pino, added request IDs and tenant context to every log line, and shipped logs to CloudWatch with proper retention policies. Now when something breaks, you can trace the exact request path in seconds instead of hours.

Uptime monitoring. Basic health check endpoints with alerting. The team found out about outages from PagerDuty instead of from customers.

Month 3: Pay down the architecture debt

Data model cleanup. We broke the circular relationships in the ORM and restructured the three worst models. This was surgery, not a rewrite. We changed how data was related, not what data existed.

API layer. We added input validation, proper error responses, and rate limiting. The API had been returning 500 errors with raw stack traces to the client. In production.

Documentation. We wrote architecture decision records for the 10 most important technical choices in the system. When the team eventually hires engineer number five, that person won't have to reverse-engineer the entire system from scratch.

What we left alone

The frontend. It was a React app with some rough edges, but it worked and users weren't complaining. Technical debt in the UI is annoying but rarely dangerous. We recommended addressing it in a future sprint but didn't touch it.

The auth system. It was using a third-party provider and working correctly. Not our favourite implementation, but replacing auth is high-risk, low-reward when the current system is functional.

The results after 90 days

The monthly AWS bill dropped from $4,200 to $1,800. Not by switching providers or renegotiating contracts. Just by right-sizing what was already there.

Deploys went from three hours to twelve minutes, and the team went from deploying once a week (on Thursdays, with dread) to three times a week with confidence. When something did go wrong, they found out in under five minutes instead of four hours, because Sentry and proper monitoring replaced the "wait for a customer to complain" system.

The 340 unhandled errors Sentry found in week one? Down to 12 by the end of month three. Engineering time spent firefighting dropped from 40 hours a month to about 8. And feature velocity, measured in story points per sprint, went from 18 to 34. Not because the team got faster. Because they stopped being slowed down.

The founder's reaction after month one: "I didn't know it could be this easy to deploy." That sentence tells you everything about what technical debt actually costs. It doesn't just cost money. It costs confidence. It makes your team dread the thing they should be excited about.

The lesson we keep relearning

Technical debt isn't a moral failing. Nobody wakes up and decides to write bad code. Debt accumulates because priorities shift, teams change, and the decisions that made sense in month three stop making sense in month eighteen.

The problem is that most teams never go back and check. They just keep building on top of a foundation that's slowly sinking.

Here's what we tell every founder we work with: schedule a technical review every six months. Not a rewrite. Not a refactor sprint. Just eight to ten days of someone experienced looking at the whole system and asking, "What made sense then that doesn't make sense now?" This is exactly the kind of work a fractional CTO does in month one of a typical engagement.

The cost of that review is trivial compared to the cost of discovering your database is four times bigger than it needs to be, or that your team is spending 20 hours a month fighting a deploy process that should take 12 minutes.

You don't have to fix everything. You just have to know what you're carrying.


We do technical debt assessments as part of our fractional CTO and consulting work. If your AWS bill keeps going up and your team keeps slowing down, those two things are probably related. Let's look at it together.


TopicsTechnical DebtArchitectureSaaSInfrastructureProduction

Explore More

More engineering insights, delivered.

Practical thinking on architecture, infrastructure, and shipping software that lasts.

← Browse all insights