The Lambda Architecture

The Lambda Architecture

In big-data circles, the Lambda Architecture has become a well established practice. In short, when faced with a data stream, Lambda architecture prescribes that we process the data using two parallel paths:

  • A “store the data and batch process it later” path – the slow lane
  • A near-realtime processing path – the fast lane

It’s usually represented something like this:


The driving idea here is that your full analytics processing is probably too much to accomplish in near-realtime, so you implement a lightweight/simplified set of processing in the fast lane, and your full application logic goes in the slow lane.

There are certainly some cases where this is necessary.   For example, some Machine Learning algorithms are amenable to online learning, where the model is updated incrementally as new data arrives, but many are not.   In those cases where the model must be rebuilt in batch against the whole dataset, a Lambda-style architecture is unavoidable.

What’s flawed about all this, at least from my POV, is that presumption that a slow lane is necessary.    Streaming data engines have gotten much more capable over the years.   Starting with Apache Storm, CEP-style stream processing got much more practical.  Then with Spark Streaming, it became possible to share code between batch and streaming processing, which only added fuel to the Lambda fire.

In practice I like an attitude I’d call Streaming-First.   To the extent that it’s practical, implement all of your data processing in the fast lane – I believe that more can be achieved on a streaming basis than people tend to think.

Second, it should almost never be necessary to do a “light” and “complete” version of the same processing as prescribed in some versions of the Lambda story.   Instead try this:  Process, enrich, and store the data in near-realtime, and if there is work that is prohibitively slow to do while streaming then move only that part to workflow-triggered batch jobs that follow behind & operate on the already enriched data.  This reduces complexity and prepares you well for a future where pretty much everything will be done on a streaming basis.