Introduction
Perhaps I should introduce myself, mmmmh?
I’m Jeff Nadler, and I have been developing software for companies in the Pacific Northwest for 20 years now. In the early days of my career I did applications for traditional relational DBs like Oracle and Sybase, with programming in C / C++ / Java, and some data warehousing projects using ROLAP star-schema and MOLAP tools.
For the past 7 years or so I’ve been focused on scalable distributed systems based on NoSQL data stores and “big data” systems. I wish there was a better term than “big data” that adequately communicated the concept, but it’s reasonably easy to define: Big Data is any dataset that is hopelessly large to consider processing on a single server (or cloud instance).
As a result, you will need a distributed system to process the data, and that brings some new challenges but also the opportunity to architect a system that can grow with your business by adding servers (‘horizontal scaling’).
Lately I’ve been working on projects that emphasize streaming big data or “fast data”, where a nonstop stream of data arrives on a scalable message queue like Kafka or AWS’s Kinesis. From there the data is processed using applications that run on Storm, Spark, Flink, or Google Cloud DataFlow and finally the data is persisted to a suitable data store.
Here’s what the architecture looks like in general:
Of course even with the right architecture for the project, there are still lots of problems to solve. And lots of posts to come-