The previous post in the series:
Apache Beam — From Zero to Hero Pt. 1: Batch Pipelines
In this post we’re going to implement a Streaming Pipeline while covering the rest of Apache Beam’s basic concepts. Let’s begin by explaining what is a streaming pipeline and how it differentiates from a batch pipeline.
The basic difference between a Batch and a Streaming pipeline is that a batch pipeline runs until it’s finished with its input data (the amount of data is finite), while a streaming pipeline runs forever (until manually stopped of course), since the input data is infinite.
In the previous post I talked about how we can use Python Generators to create simple data pipelines. In this post I am going to introduce the Apache Beam framework for building production grade data pipelines, and build a batch pipeline with it while explaining some of its main concepts.
Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines.
With Apache Beam we can implement complex and scalable data pipelines in a concise, clear manner. Apache Beam has 3 SDKs — Python, Java & Go (we’re going to use Python here). Pipelines…
In this post you’ll learn how we can use Python’s Generators feature to create data streaming pipelines. For production grade pipelines we’d probably use a suitable framework like Apache Beam, but this feature is needed to build Apache Beam’s custom components of your own.
Let’s look at the following demands for a data pipeline:
Write a small framework that processes an endless event stream of integers. The assumption is that processing will be static and sequential, where each processing unit pass the output to the next processing unit, unless defined differently (e.g. filter, fixed-event-window). …