Deep Learning for Trading Part 4: Fighting Overfitting with Dropout and Regularization

This is the fourth in a multi-part series in which we explore and compare various deep learning tools and techniques for market forecasting using Keras and TensorFlow.

In Part 1, we introduced Keras and discussed some of the major obstacles to using deep learning techniques in trading systems, including a warning about attempting to extract meaningful signals from historical market data. If you haven’t read that article, it is highly recommended that you do so before proceeding, as the context it provides is important.

Part 2 provides a walk-through of setting up Keras and Tensorflow for R using either the default CPU-based configuration, or the more complex and involved (but well worth it) GPU-based configuration under the Windows environment.

Part 3 is an introduction to the model building, training and evaluation process in Keras. We train a simple feed forward network to predict the direction of a foreign exchange market over a time horizon of one hour and assess its performance.

Click here to download all the code and data used in this post, as well as the Zorro script that generated the data.

In the last post, we trained a densely connected feed forward neural network to forecast the direction of the EUR/USD exchange rate over a time horizon of one hour. We landed on a model that predicted slightly better than random on out of sample data. We also saw in our learning plots that our network started to overfit badly at around 40 epochs. In this post, I’m going to demonstrate some tools to help fight overfitting and push your models further. Let’s get started.

Regularization

Regularization is a commonly used technique to mitigate overfitting of machine learning models, and it can also be applied to deep learning. Regularization essentially constrains the complexity of a network by penalizing larger weights during the training process. That is, by adding a term to the loss function that grows as the weights increase.

Keras implements two common types of regularization:  

  • L1, where the additional cost is proportional to the absolute value of the weight coefficients
  • L2, where the additional cost is proportional to the square of the weight coefficients

These are incredibly easy to implement in Keras: simply pass  regularizer_l2(regularization_factor)  or regularizer_l2(regularization_factor)  to the kernal_regularizer  argument in a Keras layer instance (details on how to do this below), where regularization_factor * abs(weight_coefficient)  or regularization_factor * weight_coefficient^2  is added to the total loss, depending on the type of regularization chosen.

Note that in Keras speak, 'kernel'  refers to the weights matrix created by a layer. Regularization can also be applied to the bias terms via the argument bias_regularizer  and the output of a layer by activity_regularizer .

Getting smarter with our learning rate

When we add regularization to a network, we might find that we need to train it for more epochs in order to reach convergence. This implies that the network might benefit from a higher learning rate during early stages of model training.1

However, we also know that sometimes a network can benefit from a smaller learning rate at later stages of the training process. Think of it like the model’s loss being stuck halfway down the global minimum, bouncing from one side of the loss surface to the other with each weight update. By reducing the learning rate, we can make the subsequent weight updates less dramatic, which enables the loss to ‘fall’ further down towards the true global minimum.

By using another Keras callback, we can automatically adjust our learning rate downwards when training reaches a plateau:

This tells Keras to reduce the learning rate by a factor of 0.9 whenever validation accuracy doesn’t improve for patience  epochs. Also note the epsilon  parameter, which controls the threshold for measuring the new optimum. Setting this to a higher value results in fewer changes to the learning rate. This parameter should be on a scale that is relevant to the metric being tracked, validation accuracy in this case.

Putting it together

Here’s the code for an L2 regularized feed forward network with both  reduce_lr_on_plateau and model_checkpoint callbacks (data import and processing is the same as in the previous post):

Plotting the training curves now gives us three plots – loss, accuracy and learning rate:

This particular training process resulted in an out of sample accuracy of 53.4%, slightly better than our original unregularized model. You can experiment with more or less regularization, as well as applying regularization to the bias terms and layer outputs.

Dropout

Dropout is another commonly used tool to fight overfitting. Whereas regularization is used throughout the machine learning ecosystem, dropout is specific to neural networks. Dropout is the random zeroing (“dropping out”) of some proportion of a layer’s outputs during training. The theory is that this helps prevents pairs or groups of nodes from learning random relationships that just happen to reduce the network loss on the training set (that is, result in overfitting). Hinton and his colleagues, the discoverers of dropout, showed that it is generally superior to other forms of regularization and improves model performance on a variety of tasks. Read the original paper here.2

Dropout is implemented in Keras as it’s own layer, layer_dropout() , which applies dropout on it’s inputs (that is, on the outputs of the previous layer in the stack). We need to supply the fraction of outputs to drop out, which we pass via the rate  parameter. In practice, dropout rates between 0.2 and 0.5 are common, but the optimal values for a particular problem and network configuration need to be determined through appropriate cross validation.

At the risk of getting ahead of ourselves, when applying dropout to recurrent architectures (which we’ll explore in a future post), we need to apply the same pattern of dropout at every timestep, otherwise dropout tends to hinder performance rather than enhance it.3

Here’s an example of how we build a feed forward network with dropout in Keras:

Training the model using the same procedure as we used in the L2-regularized model above, including the reduce learning rate callback, we get the following training curves:

One of the reasons dropout is so useful is that it enables the training of larger networks by reducing their propensity to overfit. Here’s the training curves for a similar model but this time eight layers deep:

Notice that it doesn’t overfit significantly worse than the shallower model. Also notice that it didn’t really learn any new, independent relationships from the data – this is evidenced by the failure to beat the previous model’s validation accuracy. Perhaps 53% is the upper out of sample accuracy limit for this data set and this approach to modeling it.

With dropout, you can also afford to use a larger learning rate, which means it is a good idea to make use of the reduce_lr_on_plateau  callback and kick off training with a higher learning rate, which can always be decayed as learning stalls.

Finally, one important consideration when using dropout is constraining the size of the network weights, particularly when a large learning rate is used early in training. In the Hinton et. al. paper linked above, constraining the weights was shown to improve performance in the presence of dropout.

Keras makes that easy thanks to the kernel_constraint  parameter of layer_dense() :

This model provided an ever-so-slight bump in validation accuracy:

And quite a stunning test-set equity curve:

Interestingly, every experiment I performed in writing this post resulted in a positive out of sample equity curve. The results were all slightly different, even when using the same model setup, which reflects the non-deterministic nature of the training process (two identical networks trained on the same data can result in different weights, depending on the initial, pre-training weights of each network). Some equity curves were better than others, but they were all positive.

Here are some examples:

With L2-weight regularization and no dropout:

With a dropout rate of 0.2 applied at each layer, no regularization, and no weight constraints:

Of course, as mentioned in the last post, the edge of these models disappears when we apply retail spreads and broker commissions, but the frictionless equity curves demonstrate that deep learning, even using a simple feed-forward architecture, can extract predictive information from historical price action, at least for this particular data set, and that tools like regularization and dropout can make a difference to the quality of the model’s predictions.

What’s next?

Before we get into advanced model architectures, in the next unit I’ll show you:

  1. One of the more cutting edge architectures to get the most out of a densely connected feed forward network.
  2. How to interrogate and visualize the training process in real time.

Conclusions

This post demonstrated how to fight overfitting with regularization and dropout using Keras’ sequential model paradigm. While we further refined our previously identified slim edge in predicting the EUR/USD exchange rate’s direction, in practical terms, traders with access to retail spreads and commission will want to consider longer holding times to generate more profit per trade, or will need a more performant model to make money with this approach.

Experimentation is encouraged! Download the code and data used in this post here.


Where to from here?

  • To find out why AI is taking off in finance, check out these insights from my days as an AI consultant to the finance industry 
  • If this walk-through was useful for you, you might like to check out another how-to article on running trading algorithms on Google Cloud Platform
  • If the technical details of neural networks are interesting for you, you might like our introductory article 
  • Be sure to check out Part 1Part 2, and Part 3 of this series on deep learning applications for trading. 
  • If you’re ready to go deeper and get more practical tips and tricks on building robust trading systems, consider becoming a Robot Wealth member

 

Deep Learning for Trading Part 3: Feed Forward Networks

This is the third in a multi-part series in which we explore and compare various deep learning tools and techniques for market forecasting using Keras and TensorFlow.

In Part 1, we introduced Keras and discussed some of the major obstacles to using deep learning techniques in trading systems, including a warning about attempting to extract meaningful signals from historical market data. If you haven’t read that article, it is highly recommended that you do so before proceeding, as the context it provides is important. Read Part 1 here.

Part 2 provides a walk-through of setting up Keras and Tensorflow for R using either the default CPU-based configuration, or the more complex and involved (but well worth it) GPU-based configuration under the Windows environment. Read Part 2 here.

Part 3 is an introduction to the model building, training and evaluation process in Keras. We train a simple feed forward network to predict the direction of a foreign exchange market over a time horizon of hour and assess its performance.

Click here to download all the code and data used in this post.

Now that you can train your deep learning models on a GPU, the fun can really start. By the end of this series, we’ll be building interesting and complex models that predict multiple outputs, handle the sequential and temporal aspects of time series data, and even use custom cost functions that are particularly relevant to financial data. But before we get there, we’ll start with the basics.

In this post, we’ll build our first neural network in Keras, train it, and evaluate it. This will enable us to understand the basic building blocks of Keras, which is a prerequisite for building more advanced models.

Problem Formulation

There are numerous possible ways to formulate a market forecasting problem. For the sake of this example, we will forecast the direction of the EUR/USD exchange rate over a time horizon of one hour. That is, our model will attempt to classify the next hour’s market direction as either up or down.

Data

Our data will consist of hourly EUR/USD exchange rate history obtained from FXCM (IMPORTANT: read the caveats and limitations associated with using past market data to predict the future here). Our data covers the period 2010 to 2017.

Features

Our features will simply consist of a number of variables related to price action:

  • Change in hourly closing price
  • Change in hourly highest price
  • Change in hourly lowest price
  • Distance between the hourly high and close
  • Distance between the hourly low and close
  • Distance between the hourly high and low (the hourly range)

We will use several past values of these variables, as well as the current values, to predict the target. We’ll also include the hour of day as a feature in the hope of capturing intraday seasonality effects. 

Feature scaling

Training of neural networks normally proceeds more efficiently if we scale our input features to force them into a similar range. There are various scaling strategies throughout the deep learning literature (see for example Geoffrey Hinton’s Neural Networks for Machine Learning course), but scaling remains something of an art rather than a one-size-fits all type problem.

The standard approach to scaling involves normalizing the entire data set using the mean and standard deviation of each feature in the training set. This prevents data leakage from the test and validation sets into the training set, which can produce overly optimistic results. The problem with this approach for financial data is that it often results in scaled test or validation data that winds up being way outside the range of the training set. This is related to the problem of non-stationarity of financial data and is a significant issue. After all, if a model is asked to predict on data that is very different to its training data, it is unlikely to produce good results.

One way around this is to scale data relative to the recent past. This ensures that the test and validation data is always on the intended scale. But the downside is that we introduce an additional parameter to our model: the amount of data from the recent past that we use in our scaling function. So we end up introducing another problem to solve an existing one.

Like I said, feature scaling is something of an art form, particularly when dealing with data as poorly behaved as financial data!

We’ll do our model building and experimentation in R, but first we need to generate our data. There is a Zorro script named ‘keras_data_gen.c’ for creating our targets and scaled features, and for exporting that data to a CSV file in this download link.. The script will allow you to code your own features and targets, use different scaling strategies, and generate data for different instruments. Just make the changes, then click ‘Train’ on the Zorro GUI to export the data to file. If you’d prefer to just get your hands on the data used in this post, it’s also available via the download link..

Our target is the direction of the market over a period of one hour, which implies a classification problem. The target exported in the script is the actual dollar amount made or lost by going long the market at 0.01 lots, exclusive of trading costs. We need to convert this to a factor reflecting the market’s movement either up or down. More on this below.

Let’s import our data into R and take a closer look. First, here’s a time series plot of the first ten days of our scaled features:

You can see that our features are roughly on the same scale. Notice the first feature, V1, which corresponds to the hour of the day. It has been scaled using a slightly different approach to the other variables to ensure that the cyclical nature of that variable is maintained. See the code in the download link above for details.

Next, here’s a scatterplot matrix of our variables and target (the first ten days of data only):

Now that we’ve got our data, we’ll see if we can extract any predictive information using deep learning techniques. In this post, we’ll look at fully connected feed-forward networks, which are kind of the like the ‘Hello World’ example of deep learning. In later posts, we’ll explore some more interesting networks.

Fully Connected Feed Forward Networks

A fully connected feed forward network is one in which every neuron in a particular layer is connected to every neuron in the subsequent layer, and in which information flows in one direction only, from input to output.

Here’s a schematic of such a network with an input layer, two hidden layers and an output layer consisting of a single neuron (source: datasciencecentral.com):

Input data processing

It makes sense that our network would likely benefit from using not only the features for the current time step, but also a number of prior values as well, in order to predict the target. That means that we need to create features out of lagged values of our raw feature variables.

Thankfully, that’s easily accomplished using base R’s embed()  function, which also automatically drops the NA values which arise in the first \(n\) observations, where \(n\) is the number of lags to use as features. Here’s a function which returns an expanded data set consisting of the current features as well as their lags  lagged values. It assumes that the target is in the final column (and doesn’t embed lagged values of the target) and drops the relevant NA values from the target column.

Let’s test the function and take a look at its output:

You can see that the function returns a new dataset with the current features and their last two lagged values, while the target remains unchanged in the final column. Note that the two rows that wind up with NA values are automatically dropped.

Essentially, this approach makes new features out of lagged values of each feature. But here’s the thing about feed forward networks: they don’t distinguish between more recent values of our features and older values. Obviously the network differentiates between the different features that we create out of lagged values, and has the ability to discern relationships between them, but it doesn’t explicitly factor the sequential nature of the data.

That’s one of the major limitations of fully connected feed forward networks applied to time series forecasting exercises, and one of the motivators of recurrent architectures, which we will get to soon enough.

Introducing the Keras sequential model

Now that we can process our input data, we can start experimenting with the model building process. The best place to start is Keras’ sequential model, which is essentially a paradigm for constructing deep neural networks, one layer at a time, under the assumption that the network consists of a linear stack of layers and has only a single set of inputs and outputs. You’ll find that this assumption holds for the majority of networks that you build, and it provides a very modular and efficient method of experimenting with such networks. We’ll use the sequential model quite a lot over the coming posts before getting into some more complex models that don’t fit this paradigm.

In Keras, the model building and exploration workflow typically consists of the following steps:

  1. Define the input data and the target. Split the data into training, validation and test sets.
  2. Define a stack of layers that will be used to predict the target from the input. This is the step that defines the network architecture.
  3. Configure the model training process with an appropriate loss function, optimizer and various metrics to be monitored.
  4. Train the model by repeatedly exposing it to the training data and updating the network weights according to the loss function and optimizer chosen in the previous step.
  5. Evaluate the model on the test set.

Let’s go through each step.

Set up our data

Here’s some code for loading and processing our data. It firstly loads the data set we created with our Zorro script from above and creates a new data set consisting of the current value of each feature, as well as the seven recent lagged variables. That is, we have a total of eight timesteps for each feature. And since we started with 7 features, we have a total of 56 input variables.

We also split the dataset into a training, validation and testing set. Here, I arbitrarily chose to use 50% of the data for training, 25% for validation and 25% for testing. Note that since the time aspect of our data is critical, we should ensure that our training, validation and testing data are not randomly sampled as is standard procedure in many non-sequential applications. Rather, the training, validation and test sets should come from chronological time periods.

Note that we convert our target into a binary outcome, which enables us to build a classifier.

Recall that we scaled our features at the same time as we generated them, so no need to do any feature scaling here.

Define network architecture

Next we define the stack of layers that will become our model. The syntax might seem quirky at first, but once you’re used to it, you’ll find that you can build and experiment with different architectures very quickly.

The syntax of the sequential model uses the pipeline operator %>%  which you might be familiar with if you use the dplyr  package. In essence, we define a model using the sequential paradigm, and then use the pipeline operator to define the order in which layers are stacked. Here’s an example:

This defines a fully connected feed forward network with three hidden layers, each of which consists of 150 neurons with the rectified linear ( 'relu' ) activation function. If you need a refresher on activation functions, check out this post on neural network basics.

layer_dense()  defines a fully connected layer – that is, one in which each input is connected to every neuron in the layer. Note that for the first layer, we need to define the input shape, which is simply the number of features in our data set. We only need to do this on the first layer; each subsequent layer gets its input shape from the output of the prior layer. layer_dense()  has many arguments in addition to the activation function that we specified here, including the weight initialization scheme and various regularization settings. We use the defaults in this example.

Keras implements many other layers, some of which we’ll explore in subsequent posts.

In this example, our network terminates with an output layer consisting of a single neuron with the sigmoid activation function. This activation function converts the output to a value between 0 and 1, which we interpret as the probability associated with the positive class in a binary classification problem (in this case, the value 1, corresponding to an up move).

To get an overview of the model, call summary(model)  and observe the output:

This model architecture could be better described as ‘wide’ as opposed to ‘deep’ and it consists of around 54,000 trainable parameters. This is more than the number of observations in our data set, and has implications for the ability of our network to overfit.

Configure the training process

Configuration of the training process is accomplished via the keras::compile()  function, in which we specify a loss function, an optimizer, and a set of metrics to monitor during training. Keras implements a suite of loss functions, optimizers and metrics out of the box, and in this example we’ll choose some sensible defaults:

The 'binary_crossentropy'  loss function is standard for binary classifiers and the rmsprop()  optimizer is nearly always a good choice. Here we specify a learning rate of 0.0001, but finding a sensible value typically requires some experimentation. Finally, we tell Keras to keep track of our model’s accuracy, as well as the loss during the training process.

An important consideration regarding loss functions for financial prediction is that the standard loss functions rarely capture the realities of trading. For example, consider a regression model that predicts a price change over some time horizon trained using the mean absolute error of the predictions. Say the model predicted a price change of 20 ticks, but the actual outcome was 10 ticks. In practical trading terms, such an outcome would result in a profit of 10 ticks – not a terrible outcome at all. But that result is treated the same as a prediction of 5 ticks that resulted in an actual outcome of -5 ticks, which would result in a loss of 5 ticks in a trading model. That’s because the loss function is only concerned with the magnitude of the difference between the predicted and actual outcomes – but that doesn’t tell the full story. Clearly, we’d likely to penalize the latter error more than the former. To do that, we need to implement our own custom loss functions. I’ll show you how to do that in a later post, but for now it’s important to be cognizant of the limitations of our model training process.

Train the model

We can train our model using keras::fit() , which exposes our model to subsequent batches of training data, updating the network’s weights after each batch. Training progresses for a specified number of epochs and performance is monitored on both the training and validation sets.

We would normally like to stop training at the number of epochs that maximize the model’s performance on the validation set. That is, at the point just before the network starts to overfit. The problem is we can’t know a priori how many training epochs this requires.

To combat this, keras::fit()  implements the concept of a callback, which is simply a function that performs some task at various points throughout the training process. There are a number of callbacks available in Keras out of the box, and it is also possible to implement your own.

In this example we’ll use the model_checkpoint()  callback, which we configure to save the network and it’s weights at the end of any epoch whose weight update results in improved validation performance. After training is complete, we can then load our best model for evaluation on the test set.

First, here’s how to configure the checkpoint callback (just set up the relevant filepath for your setup):

And here’s how to configure keras:fit()  for a short training run of 75 epochs, with the model checkpoint callback:

After training is complete, we can plot the loss and accuracy of the training and validation sets at each epoch by simply calling plot(history) , which results in the following plot:

We can see that loss on the training set continuously decreases while accuracy almost continuously increases as training progresses. That is expected given the power of our network to overfit. But note the small decrease in validation loss and the bump in validation accuracy that we also get out to about 40 epochs before stalling.

A validation accuracy of a little under 53% is certainly not the sort of result that would turn heads in the classic applications of deep learning, like image classification. But trading is an interesting application, because we don’t necessarily need the same sort of performance to make money. But is a validation accuracy of 53% enough to give us some out of sample profits? Let’s find out by evaluating our model on the test set.

Evaluate the model out of sample

Here’s how to remove the fully trained model, load the model with the highest validation accuracy and evaluate it on the test set, with the output shown below the code:

We end up with a test set accuracy that is only slightly worse than our validation accuracy.

But accuracy is one thing, profitability is another. To assess the profitability of our model on the test set, we need the actual predictions on the test set. We can get the predicted classes via predict_classes() , but I prefer to look at the actual output of the sigmoid function in the final layer of the model. That enables you to use a prediction threshold in your decision making, for example only entering a long trade when the output is greater than 0.6, say.

Here’s how to get the test set predictions and implement some simple, frictionless trading logic that assigns the target as an individual trade’s profit or loss when the prediction is greater than some threshold (equivalent to a buy) and the negative of the target when the prediction is less than 1 minus the threshold (equivalent to a sell) :

This results in the following equity curve (the y-axis is measured in dollars of profit from buying and selling the minimum position size of 0.01 lots):

I think that’s quite an amazing equity curve that demonstrates the potential of even a very small edge. However, note that adding typical retail transaction costs would destroy this small edge, which suggests that longer holding periods are more sensible targets, or that higher accuracies are required in practice.

Also note that you might get different results depending on the initial weights used in your network, as the weights aren’t guaranteed to converge to the same values when initialized to different values. If you repeat the training and evaluation process a number of times, you’ll find that validation accuracies in the range of 52-53% occur most of the time, but while most produce profitable out of sample equity curves, the range of performance is actually quite significant. This implies that there might be benefit in combining the predictions of multiple models using ensemble methods.

What’s next?

Before we get into advanced model architectures, in the next unit I’ll show you:

  1. How to fight overfitting and push your models to generalize better.
  2. One of the more cutting edge architectures to get the most out of a densely connected feed forward network.
  3. How to interrogate and visualize the training process in real time.

Conclusions

This post demonstrated how to process multivariate time series data for use in a feed forward neural network, as well as how to construct, train and evaluate such a network using Keras’ sequential model paradigm. While we uncovered a slim edge in predicting the EUR/USD exchange rate, in practical terms, traders with access to retail spreads and commission will want to consider longer holding times to generate more profit per trade, or will need a more performant model to make money with this approach.

Experimentation is encouraged! Download the code and data used in this post here.


Where to from here?

  • To find out why AI is taking off in finance, check out these insights from my days as an AI consultant to the finance industry 
  • If this walk-through was useful for you, you might like to check out another how-to article on running trading algorithms on Google Cloud Platform
  • If the technical details of neural networks are interesting for you, you might like our introductory article 
  • Be sure to check out Part 1 and Part 2 of this series on deep learning applications for trading. 

Deep Learning for Trading Part 2: Configuring TensorFlow and Keras to run on GPU

This is the second in a multi-part series in which we explore and compare various deep learning tools and techniques for market forecasting using Keras and TensorFlow.

In Part 1, we introduced Keras and discussed some of the major obstacles to using deep learning techniques in trading systems, including a warning about attempting to extract meaningful signals from historical market data. If you haven’t read that article, it is highly recommended that you do so before proceeding, as the context it provides is important. Read Part 1 here.

Part 2 provides a walk-through of setting up Keras and Tensorflow for R using either the default CPU-based configuration, or the more complex and involved (but well worth it) GPU-based configuration under the Windows environment.

Stay tuned for Part 3 of this series which will be published next week.

CPU vs GPU for Deep Learning

No doubt you know that a computer’s Central Processing Unit (CPU) is its primary computation module. CPUs are designed and optimized for rapid computation on small amounts of data and as such, elementary arithmetic operations on a few numbers is blindingly fast. However, CPUs tend to struggle when asked to operate on larger amounts of data, for example performing matrix operations on large arrays. And guess what: the computational nuts and bolts of deep learning is all about such matrix operations. That’s bad news for a CPU.

The rendering of computer graphics relies on these same types of operations, and Graphical Processing Units (GPUs) were developed to optimize and accelerate them. GPUs typically consist of hundreds or even thousands of cores, enabling massive parallelization. This makes GPUs a far more suitable hardware for deep learning than the CPU.

Of course, you can do deep learning on a CPU. And this is fine for small scale research projects or just getting a feel for the technique. But for doing any serious deep learning research, access to a GPU will provide an enormous boost in productivity and shorten the feedback loop considerably. Instead of waiting days for a model to train, you might only have to wait hours. Instead of waiting hours, you’ll only have to wait minutes.

When selecting a GPU for deep learning, the most important characteristic is the memory bandwidth of the unit, not the number of cores as one might expect. That’s because it typically takes more time to read the data from memory than to perform the actual computations on that data! So if you want to do fast deep learning research, be sure to check the memory bandwidth of your GPU. By way of comparison, my (slightly outdated) NVIDIA GTX 970M has a memory bandwidth of around 120 GB/s. The GTX 980Ti clocks in at around 330 GB/s!

Baby Steps: Configuring Keras and TensorFlow to Run on the CPU

If you don’t have access to a GPU, or if you just want to try out some deep learning in Keras before committing to a full-blown deep learning research project, then the CPU installation is the right one for you. It will only take a couple of minutes and a few lines of code, as opposed to an hour or so and a deep dive into your system for the GPU option.

Here’s how to install Keras to run TensorFlow on the CPU.

At the time of writing, the Keras R package could be installed from CRAN, but I preferred to install directly from GitHub. To do so, you need to first install the devtools package, and then do

Then, load the Keras package and make use of the convenient install_keras()  function to install both Keras and TensorFlow:

That’s it! You now have the CPU-based versions of Keras and TensorFlow ready to go, which is fine if you are just starting out with deep learning and want to explore it at a high level. If you don’t want the GPU-based versions just yet, then I’m afraid that’s all we have for you until the next post!

Serious Deep Learning: Configuring Keras and TensorFlow to run on a GPU

Installing versions of Keras and TensorFlow compatible with NVIDIA GPUs is a little more involved, but is certainly worth doing if you have the appropriate hardware and intend to do a decent amount of deep learning research. The speed up in model training is really significant.

Here’s how to install and configure the NVIDIA GPU-compatible version of Keras and TensorFlow for R under Windows.

Step 1: What hardware do you have?

First, you need to work out if you have a compatible NVIDIA GPU installed on your Windows machine. To do so, open your NVIDIA Control Panel. Typically, it’s located under C:\Program Files\NVIDIA Corporation\Control Panel Client , but on recent Windows versions you can also find it by right-clicking on the desktop and selecting ‘NVIDIA Control Panel’, like in the screenshot below:

When the control panel opens, click on the System Information link in the lower left corner, circled in the screenshot below:

This will bring up the details of your NVIDIA GPU. Note your GPU’s model name (here mine is a GeoForce GTX 970M, which you can see under the ‘Items’ column): While you’re at it, check how your GPU’s memory bandwidth stacks up (remember this parameter is the limiting factor of the GPU’s speed on deep learning tasks).

 

Step 2: Is your hardware compatible with TensorFlow?

Next, head over to NVIDIA’s GPU documentation, located at https://developer.nvidia.com/cuda-gpus. You’ll need to find your GPU model on this page and work out its Compute Capability Number. This needs to be 3.0 or higher to be compatible with TensorFlow. You can see in the screenshot below that my particular GPU model has a Compute Capability of 5.2, which means that I can use it to train deep learning models in TensorFlow. Hooray for productivity.

In practice, my GPU model is now a few years old and there are much better ones available today. But still, using this GPU provides far superior model training times than using a CPU.

Step 3: Get CUDA

Next, you’ll need to download and install NVIDIA’s CUDA Toolkit. CUDA is NVIDIA’s parallel computing API that enables programming on the GPU. Thus, it provides the framework for harnessing the massive parallel processing capabilities of the GPU. At the time of writing, the release version of TensorFlow (1.4) was compatible with version 8 of the CUDA Toolkit (NOT version 9, which is the current release), which you’ll need to download via the CUDA archives here.4

Step 4: Get your latest driver

You’ll also need to get the latest drivers for your particular GPU from NVIDIA’s driver download page. Download the correct driver for your GPU and then install it.

Step 5: Get cuDNN

Finally, you’ll need to get NVIDIA’s CUDA Deep Neural Network library (cuDNN). cuDNN is essentially a library for deep learning built using the CUDA framework and enables computational tools like TensorFlow to access GPU acceleration. You can read all about cuDNN here. In order to download it, you will need to sign up for an NVIDIA developers account.

Having activated your NVIDIA developers account, you’ll need to download the correct version of cuDNN. The current release of TensorFlow (version 1.4) requires cuDNN version 6. However, the latest version of cuDNN is 7, and it’s not immediately obvious how to acquire version 6. You’ll need to head over to this page, and under the text on ‘What’s New in cuDNN 7?’ click the Download button. After agreeing to some terms and conditions, you’ll then be able to select from numerous versions of cuDNN. Make sure to get the version of cuDNN that is compatible with your version of CUDA (version 8), as there are different sub-versions of cuDNN for each version of CUDA.2

Confusing, no? I’ve circled the correct (at the time of writing) cuDNN version in the screenshot below (click for a clearer image):

Once you’ve downloaded the cuDNN zipped file, extract the contents to a directory of your choice.

Step 6: Modify the Windows %PATH%  variable

We also need to add the paths to the CUDA and cuDNN libraries to the Windows %PATH%  variable so that TensorFlow can find them.  To do so, open the Windows Control Panel, then click on System and Security, then System, then Advanced System Settings like in the screenshot below:

Then, when the System Properties window opens, click on Environment Variables. In the new window, under System Variables, select Path and click Edit. Then click New in the Edit Environment Variable window and add the paths to the CUDA and cuDNN libraries. On my machine, I added the following paths (but yours will depend on where they were installed):

  • C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin
  • C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\libnvvp
  • C:\ProgramData\cuDNN_6_8\bin
Here’s a screenshot of the three windows and the relevant buttons involved in this process (click for a larger image):

Step 7: Install GPU-enabled Keras

Having followed those steps, you’re finally in a position to install Keras and configure it to run TensorFlow on the GPU. From a fresh R or R-Studio session, install the Keras package if you haven’t yet done so, then load it and run install_keras()  with the argument tensorflow = 'gpu' :

The installation process might take quite some time, but don’t worry, you’ll get that time back and a whole lot more in faster training of your deep learning experiments.

That’s it! Congratulations! You are now ready to perform efficient deep learning research on your GPU! We’ll dive into that in the next unit.

A troubleshooting tip

When I first set this up, I found that Keras was throwing errors that it couldn’t find certain TensorFlow modules. Eventually I worked out that it was because I already had a version of TensorFlow installed in my main conda environment thanks to some Python work I’d done previously. If you have the same problem, explicitly setting the conda environment immediately after loading the Keras package should resolve it:

Also note that the compatible versions of CUDA and cuDNN may change as new versions of TensorFlow are released. It is worth double checking the correct versions at tensorflow.org.

 


Where to from here?

Deep Learning for Trading Part 1: Can it Work?

This is the first in a multi-part series  in which we explore and compare various deep learning tools and techniques for market forecasting using Keras and TensorFlow. In this post, we introduce Keras and discuss some of the major obstacles to using deep learning techniques in trading systems, including a warning about attempting to extract meaningful signals from historical market data.

Part 2 provides a walk-through of setting up Keras and Tensorflow for R using either the default CPU-based configuration, or the more complex and involved (but well worth it) GPU-based configuration under the Windows environment.

In the last few years, deep learning has gone from being an interesting but impractical academic pursuit to a ubiquitous technology that touches many aspects of our lives on a daily basis – including in the world of trading. This meteoric rise has been fuelled by a perfect storm of:

  • Frequent breakthroughs in deep learning research which regularly provide better tools for training deep neural networks
  • An explosion in the quantity and availability of data
  • The availability of cheap and plentiful compute power
  • The rise of open source deep learning tools that facilitate both the practical application of the technology and innovative research that drives the field ever forward

Deep learning excels at discovering complex and abstract patterns in data and has proven itself on tasks that have traditionally required the intuitive thinking of the human brain to solve. That is, deep learning is solving problems that have thus far proven beyond the ability of machines.

Therefore, it is incredibly tempting to apply deep learning to the problem of forecasting the financial markets. And indeed, certain research indicates that this approach has potential. For example, the Financial Hacker found an edge in predicting the EUR/USD exchange rate using a deep architecture stacked with an autoencoder. Here at Robot Wealth, we compared the performance of numerous machine learning algorithms on a financial prediction task, and deep learning was the clear outperformer.

Not so fast…

However, as anyone who has used deep learning in a trading application can attest, the problem is not nearly as simple as just feeding some market data to an algorithm and using the predictions to make trading decisions. Some of the common issues that need to be solved include:

  1. Working out a sensible way to frame the forecasting problem, for example as a classification or regression problem.
  2. Scaling data in a way that facilitates training of the deep network.
  3. Deciding on an appropriate network architecture.
  4. Tuning the hyperparameters of the network and optimization algorithm such that the network converges sensibly and efficiently. Depending on the architecture chosen, there might be a couple of dozen hyperparameters that affect the model, which can provide a significant headache.
  5. Coming up with a cost function that is applicable to the problem.
  6. Dealing with the problem of an ever-changing market. Market data tends to be non-stationary, which means that a network trained on historical data might very well prove useless when used with future data.
  7. There may be very little signal in historical market data with respect to the future direction of the market. This makes sense intuitively if you consider that the market is impacted by more than just its historical price and volume. Further, pretty much everyone who trades a particular market will be looking at its historical data and using it in some way to inform their trading decisions. That means that market data alone may not give an individual much of a unique edge.

The first five issues listed above are common to most machine learning problems and their resolution represents a big part of what applied data science is all about. The implication is that while these problems are not trivial, they are by no means deal breakers.

On the other hand, problems 6 and 7 may very well prove to thwart the best attempts at using deep learning to turn past market data into profitable trading signals. No machine learning algorithm or artificial intelligence can make good future predictions if its training data has no relationship to the target being predicted, or if that relationship changes significantly over time.3

Said differently, feeding market data to a machine learning algorithm is only useful to the extent that the past is a predictor of the future. And we all know what they say about past performance and future returns.

In deep learning trading systems that I’ve taken to market, I’ve always used additional data, not just historical, regularly sampled price and volume data and transformations thereof. While there does appear to be a slim edge in using deep learning to extract signals from past market data, that edge may not be significant enough to overcome transaction costs. And even if it does, it may not be significant enough to justify the risk and effort required to take it to market. On the other hand, supplementing historical market data with innovative, uncommon data sets has proven more effective – at least in my experience. 2

In this series of posts, we explore and compare various deep learning tools and techniques in relation to market forecasting using the Keras package. We will do so using only historical market data, so the results need to be interpreted considering the discussion above.

We expect deep learning to uncover a slim edge using historical market data, but the purpose of this analysis is to compare different deep learning tools in relation to market forecasting, not necessarily to build a market-beating trading system. That I leave to you – perhaps you can supplement the models we explore here with some creative or uncommon data or other tools to find a real edge.

What is Keras?

Keras is a high-level API for building and training neural networks. Its strength lies in its ability to facilitate fast and efficient research, which of course is very important for systematic traders, particularly those of the DIY persuasion for whom time is often the limiting factor to success. Keras is easy to learn and its syntax is particularly friendly. Keras also plays nicely with CPUs and GPUs and can integrate with the TensorFlow, Theano and CNTK backends – without limiting the flexibility of those tools. For example, pretty much anything you can implement in raw TensorFlow, you can also implement in Keras, likely at a fraction of the development effort.

Keras is also implemented in R, which means that we can use it directly in any trading algorithm developed on the Zorro Automated Trading Platform, since Zorro has seamless integration with an R session.3

What’s next?

In the deep learning experiments that follow in Part 2 and beyond, we’ll use the R implementation of Keras with TensorFlow backend. We’ll be exploring fully connected feedforward networks, various recurrent architectures including the Gated Recurrent Unit (GRU) and Long Short-Term Memory (LSTM), and even convolutional neural networks which normally find application in computer vision and image classification.

Stay tuned.