Photo by Azamat E on Unsplash

For my previous posts on Apache Beam:

Apache Beam Batch Pipelines
Apache Beam Streaming Pipelines

While these 2 posts are written using the Python SDK for brevity, in my company we actually use the Java SDK for performance reasons — in checks I’ve made, the Java SDK tends to be 2–3 times faster then the Python one for our usecase.

The Initial Problem — Sharded BQ Tables

Lately I came across an interesting usecase that was needed in our Streaming Pipelines. Let’s describe the need here first:


Photo by Emile Perron on Unsplash

In my previous post I wrote about how we can implement polymorphism in Clojure with multimethods. I showed that it’s actually such a powerful tool, that we can do with it a lot more then just modelling class hierarchy.

However, we shouldn’t use a sledgehammer to crack a nut, or in other words — while we can implement traditional polymorphism (single type based dispatch) with multimethods, it doesn’t mean we should. There’s a simpler (and more performant — see here) way in Clojure — Protocols.

The Basics

Protocols replace what in an OOP language we know as interfaces. …


Photo by Kevin Ku on Unsplash

Who says you have to be OOP in order to have Polymorphism? In this post I am going to explore how Clojure provides us with a much richer Polymorphic capabilities then your standard OOP languages with Multimethods.

The Basics

A Clojure multimethod consists of a dispatching method (defined with the defmulti Macro), and one or more methods (defined with the defmethod Macro).

Let’s see the following example:

(defmulti make-sound (fn [x] (:type x)))
(defmethod make-sound "Dog" [x] "Woof Woof")
(defmethod make-sound "Cat" [x] "Miauuu")
(make-sound {:type "Dog"}) => "Woof Woof"
(make-sound {:type "Cat"}) => "Miauuu"


Photo by Mathew Schwartz on Unsplash

The previous post in the series:
Apache Beam — From Zero to Hero Pt. 1: Batch Pipelines

In this post we’re going to implement a Streaming Pipeline while covering the rest of Apache Beam’s basic concepts. Let’s begin by explaining what is a streaming pipeline and how it differentiates from a batch pipeline.

Introduction

The basic difference between a Batch and a Streaming pipeline is that a batch pipeline runs until it’s finished with its input data (the amount of data is finite), while a streaming pipeline runs forever (until manually stopped of course), since the input data is infinite.

We…


Photo by Emile Guillemot on Unsplash

In the previous post I talked about how we can use Python Generators to create simple data pipelines. In this post I am going to introduce the Apache Beam framework for building production grade data pipelines, and build a batch pipeline with it while explaining some of its main concepts.

Introduction

Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines.

With Apache Beam we can implement complex and scalable data pipelines in a concise, clear manner. Apache Beam has 3 SDKs — Python, Java & Go (we’re going to use Python here). Pipelines…


In this post you’ll learn how we can use Python’s Generators feature to create data streaming pipelines. For production grade pipelines we’d probably use a suitable framework like Apache Beam, but this feature is needed to build Apache Beam’s custom components of your own.

The Problem

Let’s look at the following demands for a data pipeline:

Write a small framework that processes an endless event stream of integers. The assumption is that processing will be static and sequential, where each processing unit pass the output to the next processing unit, unless defined differently (e.g. filter, fixed-event-window). …

Ilan Uzan

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store