We Needed Event-Driven Architecture. Kafka Was Overkill. So We Used Redis.

Last year we were working with a B2B platform that had a notification problem. The product sent emails, in-app notifications, and Slack messages as things happened throughout the system. New order. Payment received. Delivery confirmed. About a dozen event types in total.

The original implementation was synchronous. When a user placed an order, the API handler would create the order, then send an email, then create an in-app notification, then post to Slack. All in the same request. If the email provider was slow, the user waited. If Slack's API timed out, the whole request failed and the order didn't get created.

At about 2,000 orders per day, this started breaking. Email provider latency spikes would cascade into API timeouts. Users were seeing 500 errors on actions that should have been instant. The engineering team was spending hours every week investigating "phantom" order failures that were actually notification delivery failures.

They needed event-driven architecture. The question was what to build it with.

Redis Streams vs Kafka: why we chose Redis

The team's first instinct was Kafka. It's the default answer when someone says "event-driven" in a technical conversation, and for good reason. Kafka is an incredible piece of technology.

But Kafka comes with costs that their product couldn't justify.

Running Kafka properly means running a cluster. At minimum, three brokers, each needing real compute and storage. At the time, they were running on a single 4GB Redis instance that was using about 600MB. Adding a Kafka cluster would have tripled their infrastructure footprint for a system handling 2,000 events per day, not per second.

The operational complexity is real too. Kafka needs ZooKeeper or KRaft for consensus. Topics need to be managed. Partitions need to be rebalanced when consumers change. You need someone on the team who understands consumer group semantics, offset management, and partition key design. This team was four engineers, none of whom had Kafka experience.

And the learning curve isn't a one-time cost. Every new engineer who joins the team needs to understand Kafka concepts before they can touch the notification system.

So we asked the obvious question: can Redis do this? They already had Redis running for caching. And the answer was yes.

What are Redis Streams?

Redis Streams is a data structure that was added in Redis 5.0. Think of it as a log. You append entries to it, and consumers read from it in order. It supports consumer groups so multiple workers can share the load. It supports acknowledgment so you know when an entry has been processed. And it supports claiming so failed entries can be retried.

That's basically everything you need for event-driven architecture at this scale.

What we built

The architecture was simple. When something happens in the system (order created, payment received), the application publishes an event to a Redis Stream. A separate worker process reads from the stream, figures out which notifications to send, and sends them. If a notification fails, the event stays in the pending list and gets retried.

Here's the producer side. Straightforward:

import Redis from 'ioredis';

const redis = new Redis(process.env.REDIS_URL);

async function publishEvent(
  stream: string,
  event: Record<string, string>
) {
  const id = await redis.xadd(
    stream,
    '*',
    ...Object.entries(event).flat()
  );
  return id;
}

// When an order is created
await publishEvent('orders:events', {
  action: 'created',
  orderId: 'abc123',
  amount: '99.00',
  timestamp: Date.now().toString(),
});

The * tells Redis to auto-generate a timestamp-based ID. The event gets appended to the stream and the API handler returns immediately. The user doesn't wait for emails, Slack messages, or anything else. That all happens asynchronously.

Before consumers can read, you set up the consumer group:

async function ensureConsumerGroup(
  stream: string,
  group: string
) {
  try {
    await redis.xgroup(
      'CREATE', stream, group, '0', 'MKSTREAM'
    );
  } catch (err: any) {
    if (!err.message.includes('BUSYGROUP')) throw err;
  }
}

The MKSTREAM flag creates the stream if it doesn't exist yet. The BUSYGROUP error just means the group was already created, which is fine.

Then the consumer loop:

async function consumeEvents(
  stream: string,
  group: string,
  consumer: string,
  handler: (entry: StreamEntry) => Promise<void>
) {
  while (true) {
    const results = await redis.xreadgroup(
      'GROUP', group, consumer,
      'COUNT', 10,
      'BLOCK', 5000,
      'STREAMS', stream, '>'
    );

    if (!results) continue;

    for (const [, entries] of results) {
      for (const [id, fields] of entries) {
        try {
          await handler({ id, fields: parseFields(fields) });
          await redis.xack(stream, group, id);
        } catch (err) {
          console.error(`Failed to process ${id}:`, err);
        }
      }
    }
  }
}

The BLOCK 5000 parameter makes the consumer wait up to five seconds for new entries before returning. This is way more efficient than polling in a tight loop. The > tells Redis to only deliver entries that haven't been delivered to this consumer group yet.

When processing succeeds, we call xack to acknowledge the entry. When it fails, we log the error and move on. The entry stays in the pending list because it was never acknowledged. That's where the retry mechanism comes in.

Handling failures

Messages that fail processing need to be retried. We built a pending entries claimer that runs on a schedule:

async function claimPendingEntries(
  stream: string,
  group: string,
  consumer: string,
  maxRetries: number = 3
) {
  const pending = await redis.xpending(
    stream, group,
    '-', '+', 100
  );

  for (const entry of pending) {
    const [id, , idleTime, deliveryCount] = entry;

    if (deliveryCount >= maxRetries) {
      const fields = await redis.xrange(stream, id, id);
      if (fields.length > 0) {
        await redis.xadd(
          `${stream}:dead-letter`, '*',
          'originalId', id,
          ...fields[0][1]
        );
      }
      await redis.xack(stream, group, id);
      continue;
    }

    if (idleTime > 60000) {
      await redis.xclaim(
        stream, group, consumer, 60000, id
      );
    }
  }
}

If an entry has been delivered three times and still hasn't been acknowledged, it gets moved to a dead letter stream. We monitor that stream and investigate anything that lands there. In practice, it's usually a downstream API that was temporarily down.

If an entry has been pending for more than 60 seconds but hasn't exceeded the retry limit, we claim it and try again. This handles the case where a consumer crashes mid-processing.

Keeping Redis from growing forever

Streams grow indefinitely unless you trim them. We use the MAXLEN flag with the approximate operator:

await redis.xadd(
  stream, 'MAXLEN', '~', '10000', '*',
  ...fields
);

The ~ tells Redis "keep approximately 10,000 entries." It allows Redis to trim in bulk rather than entry-by-entry, which is significantly more efficient. For this product, 10,000 entries was about two days of events, which was more than enough for debugging.

The monitoring we added

We track three metrics per stream. Stream length tells us how many unprocessed entries are queued up. Pending count tells us how many entries have been delivered but not acknowledged. Consumer lag tells us how far behind each worker is.

async function getStreamMetrics(
  stream: string,
  group: string
) {
  const info = await redis.xinfo('STREAM', stream);
  const groupInfo = await redis.xinfo('GROUPS', stream);

  return {
    length: info[1],
    groups: groupInfo.map((g: any) => ({
      name: g[1],
      consumers: g[3],
      pending: g[5],
      lastDeliveredId: g[7],
    })),
  };
}

We pipe these into their existing monitoring dashboard. If stream length grows faster than consumers can process, or if pending count stays elevated for more than a few minutes, an alert fires.

The results

The whole thing took about two weeks to build and deploy. No new infrastructure. The Redis instance they were already paying for had plenty of headroom.

API response times dropped because order creation was no longer waiting for email delivery. The notification system handled failure gracefully because one slow downstream service no longer cascaded into user-facing errors. And the team could reason about the system because it was Redis, which everyone already understood.

Eight months later, they're processing about 8,000 events per day on the same Redis instance without any changes. Memory usage went from 600MB to about 900MB. The dead letter stream catches maybe two or three entries per week, usually when the email provider has a momentary hiccup.

When you should actually use Kafka

Redis Streams aren't a Kafka replacement for every situation. If you need long-term event storage for replay and audit, Kafka retains data more efficiently over time. If you need exactly-once delivery semantics (and your consumers aren't idempotent), Kafka has better support for that. And if you're consistently processing above 50,000 events per second, Kafka's distributed architecture scales more naturally than Redis Streams on a single instance.

But most SaaS products aren't anywhere near those thresholds. We've run Redis Streams in production handling 15,000 events per second on a single instance without issues. If you're under that, and you already have Redis running, you probably don't need another piece of infrastructure in your stack.

Start here. Upgrade when the data tells you to.

We design and ship backend architectures for production SaaS products. If you're thinking about event-driven patterns and wondering whether you actually need Kafka, let's talk.

We Needed Event-Driven Architecture. Kafka Was Overkill. So We Used Redis.

Redis Streams vs Kafka: why we chose Redis

What are Redis Streams?

What we built

Handling failures

Keeping Redis from growing forever

The monitoring we added

The results

When you should actually use Kafka

Related Insights

The Stack We Recommend to Most Clients and Why

Pooled, Siloed, or Bridge? How to Pick Your Multi-Tenant Architecture.

A Founder Wanted Kubernetes for 200 Users. Here's the Conversation We Had.

More engineering insights, delivered.