Scalable Anomaly Detection (with Zero Machine Learning)

I gave another Strange Loop this year. It was about our real-time anomaly detection system called Raju. Here is the abstract:

In a large scale distributed system, detecting and pinpointing failures gets exponentially harder as an architecture gets more complex. Netflix’s cloud architecture is composed of thousands of services and hundreds of thousands of VMs and containers. Failures can happen at any level and can often cascade quickly, some can cause massive outages on several systems, while others only only break one or two. This creates a needle in a haystack problem that requires automated and precise detection. Zuul, as the front door for all of Netflix’s cloud traffic, sees all requests and responses and is ideally positioned to identify and isolate only the broken paths in the maze of microservices.

We leveraged Zuul to stream real-time events for each request-response and built an anomaly detector to automatically identify and alert services in trouble. We scaled this detector to thousands of nodes, handling millions of requests, without a single line of machine learning. Sometimes you need machine learning and sometimes you don’t. Although it’s en vogue to apply machine learning to every problem, it can be more practical and approachable to solve certain problems with old-fashioned math!

In this talk, we’ll discuss how we built this system with stream processing, anomaly detection algorithms, and a rules engine. We will also deep-dive into the anomaly detection algorithm and show how sometimes a simple, elegant algorithm can be just as good as any sophisticated machine learning.

Read this entry...

Open Sourcing Zuul 2

Today we officially announced the release of Zuul 2. It’s an exciting day for the team and the blog post linked hightlights some of the other work we’ve been doing.

Here’s a summary of the major features included in the open source release:

Today we are releasing many core features. Here are the ones we’re most excited about:

Server Protocols

  • HTTP/2 — full server support for inbound HTTP/2 connections
  • Mutual TLS — allow for running Zuul in more secure scenarios

Resiliency Features

  • Adaptive Retries — the core retry logic that we use at Netflix to increase our resiliency and availability
  • Origin Concurrency Protection — configurable concurrency limits to protect your origins from getting overloaded and protect other origins behind Zuul from each other

Operational Features

  • Request Passport — track all the lifecycle events for each request, which is invaluable for debugging async requests
  • Status Categories — an enumeration of possible success and failure states for requests that are more granular than HTTP status codes
  • Request Attempts — track proxy attempts and status of each, particularly useful for debugging retries and routing

You can find on instructions on getting started on the Github wiki. We have a slew of features that we’re working on and will release shortly, so stay tuned.

Read this entry...