Image for post
Image for post
Photo by Emile Guillemot on Unsplash

In the previous post I talked about how we can use Python Generators to create simple data pipelines. In this post I am going to introduce the Apache Beam framework for building production grade data pipelines, and build a batch pipeline with it while explaining some of its main concepts.

Introduction

With Apache Beam we can implement complex and scalable data pipelines in a concise, clear manner. Apache Beam has 3 SDKs — Python, Java & Go (we’re going to use Python here). Pipelines written with the framework can run on different platforms via different runners — some of the available runners are Spark, Flink and Google Cloud Dataflow. …


In this post you’ll learn how we can use Python’s Generators feature to create data streaming pipelines. For production grade pipelines we’d probably use a suitable framework like Apache Beam, but this feature is needed to build Apache Beam’s custom components of your own.

The Problem

Let’s look at the following demands for a data pipeline:

Write a small framework that processes an endless event stream of integers. The assumption is that processing will be static and sequential, where each processing unit pass the output to the next processing unit, unless defined differently (e.g. filter, fixed-event-window). …

Ilan Uzan

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store