Why We Use Apache Beam For Our Systematic Trading Data Pipeline

Posted on Jun 03, 2020 by Ajet Luka
2 comments
882 Views

In the world of Big Data, there are lots of tools and technologies to choose from. Choosing the “right” one depends on the things that you are building and the problems you are trying to solve.

Trading firms have skilled teams that monitor and deploy data pipelines for their organisation and the technical overhead that comes with that. Firms invest in data infrastructure and research because data is at the centre of what they do.

Data pipelines need to be robust and meet the technical requirements set by the organisation,  and they also need to be cost-efficient.

These are challenges that a sole systematic trader can have a hard time tackling. This is especially true when you take into account that solo-traders also need to allocate their time in other parts of their trading business.

So choosing a technology that is easy to manage and deploy pipelines with, and also offers good price efficiency is very important for a systematic trader.

Enter Apache Beam…

Apache Beam is a unified programming model for batch and streaming data processing jobs. It comes with support for many runners such as Spark, Flink, Google Dataflow and many more (see here for all runners).

You can define your pipelines in Java, Python or Go. At this time the Java SDK is more mature with support for more database connections, but Python is being rapidly developed and comes at a close second, Go is still in its early stages of development.

In the Robot Wealth batch data pipeline, we rely on Beam (on the Google Dataflow runner) for:

  • Downloading data from various APIs
  • Loading it to Google Storage
  • Transforming and enriching the data
  • Calculating features
  • Loading it to Big Query
  • Data integrity checks

This gives us a scalable data pipeline which is also cost-efficient, because you only pay for Beam when you are using it.

Here is how our batch data pipeline currently looks:

The great thing about running Beam on Google Cloud is how seamlessly everything works together. In fact, every connection between the technologies used in the pipeline has native support in Beam.

The most appealing features that make Beam the right choice for our data pipeline are

  • Autoscaling
  • GCP Integration
  • Easy to maintain codebase.

Apache Beam in Action in a Trading Workflow

Let’s take a look at Beam in action.

In the following code block, we will be defining a data integrity check pipeline that logs an error if OHLC data is faulty.

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
import logging

def price_integrity(row):
    """A function that checks the integrity of a OHLC row, 
    High must be the largest or equal data point in the row and Low must be the smallest or equal

    Arguments:
        row {dict} -- A dict with Keys being the column name and values being the column value

    Returns:
        bool -- True if row has errors False if row is good
    """
    
    #Checks that the low(high) price is the lowest(highest) or equall 
    low_error = not(round(float(row['low']),3) <= round(float(row['high']),3) and round(float(row['low']),3)<= round(float(row['open']),3) and round(float(row['low']),3) <= round(float(row['close']),3))
    high_error = not(round(float(row['high']),3) >= round(float(row['low']),3) and round(float(row['high']),3) >= round(float(row['open']),3) and round(float(row['high']),3) >= round(float(row['close']),3))
    if low_error or high_error:
        row.update({'error':True})
        logging.error('Integrity Check error uploading to datastore ....')
        return row
    else:
        row.update({'error':False})
        return row
        
        
def run():

    options = PipelineOptions()


    p = beam.Pipeline(options=options)
    PriceCheck = (
            p 
            | 'Mimic BigQuery Read' >> beam.Create([{'ticker':'RW','open':10,'high':11,'low':8,'close':10},
                                              {'ticker':'RW','open':10,'high':11,'low':8,'close':11},
                                              {'ticker':'RW','open':11,'high':11,'low':8,'close':9},
                                              {'ticker':'RW','open':10,'high':12,'low':8,'close':10},
                                              {'ticker':'RW','open':10,'high':11,'low':8,'close':17}])
            | 'Row Integrity Check' >> beam.Map(price_integrity)
            | 'Print Results' >> beam.Map(print)
        )
    p.run()

run()

This was a small part of our OHLC integrity check, in the production pipeline we are uploading the exceptions to Google Datastore, and we are running some other kinds of integrity checks.

Using this made up dataset we should have no errors on the rows except for the last one, where the close is bigger than the high. In that case the row will fail the integrity check and the error flag will be set to True.

When running this in production, the Dataflow runner will optimize your pipeline and scale it across many workers .

Using Apache Beam, we have no issues scaling this to run on hundreds of millions of rows.

Its programming syntax is also easy to learn, and intuitive. This allows us to rapidly experiment with our pipelines designs and try out new things.

Summary

In summary Apache Beam is a great way to run your batch and streaming workflows with an easy to read API, scalable infrastructure and painless integration with the GCP suite of tools.

If you liked this you’ll probably like these too…

Using Apache Airflow to Extract CoT Data

How to Run Python from R Studio

Rolling and Expanding Windows For Dummies

(2) Comments

June 5, 2020 at 5:35 am

In a prior article you mentioned using Apache Airflow for what appears to be a similar process.
When would you choose one versus the other?

Ajet Luka
June 5, 2020 at 5:51 am

Great Question Ron!

In our experience you would want to use Airlfow for very light jobs, such as scheduling and orchestration of workflows across GCP technologies.

You can see that Google Composer (Airflow) is the first step in our data pipeline it manages all of the reporting, error handling, and has determined steps on what to do when something fails (i.e retry the job again).

So whatever light jobs that you might have Composer can take care of that, however when you start dealing with millions of rows it gets very expensive to run stuff in Composer, Dataflow compliments that very well, because of it’s Autoscaling and “Pay what you use” features.

Leave a Comment