The post The Law of Large Numbers – Part 2 appeared first on Robot Wealth.

]]>This is Part 2 in our **Practical Statistics for Algo Traders** blog series—don’t forget to check out **Part 1** if you haven’t already.

Even if you’ve never heard of it, the Law of Large Numbers is something that you understand intuitively, and probably employ in one form or another on an almost daily basis. But human nature is such that we sometimes apply it poorly, often to great detriment. Interestingly, psychologists found strong evidence that, despite the intuitiveness and simplicity of the law, humans make ** systematic errors** in its application. It turns out that we all tend to make the same mistakes – even trained statisticians who not only should know better, but do!

In 1971, two Israeli psychologists, Amos Tversky and Daniel Kahneman,1 published *“Belief in the law of small numbers“, *reporting that

People have erroneous intuitions about the laws of chance. In particular, they regard a sample randomly drawn from a population as highly representative, that is, similar to the population in all essential characteristics.

So what is this Law of Large Numbers? What are the consequences of a misplaced belief in the law of **small** numbers? And what does it all mean for algo traders? Well, to answer these questions, we first need to talk about burgers.

Put simply, the Law of Large Numbers states that if we select a sample from an entire population, the mean of our sample approaches the mean of the population as we increase our sample size. Said differently, the greater our sample size, the less uncertainty we have regarding our conclusions about the population.

We all understand this law on an intuitive level. For instance, say you’re looking at reviews of burger joints in your local area. You come across a place that only has two reviews, both of them rating the restaurant 5 out of 5. There’s another place that has an average rating of 4.7, but it has 200 reviews.

You know instinctively that the place rated 4.7 is the most likely of the two to dish up a fantastic burger, even though it’s average rating is less than the perfect 5 of the first restaurant. That’s the law of large numbers in action.

How many reviews would it take before you started considering that there was a good chance that the first burger joint served better burgers than the second? 5? 10? 100?

Let’s assume that the burger joint with the perfect record after two reviews is actually destined for a long-term average rating of just over 4. We could simulate several thousand reviews whose aggregate characteristics match this assumption with the following R code:

library(ggplot2) # burger joint destined for a long-term average rating of around 4 out of 5 probs <- c(0.025, 0.05, 0.125, 0.5, 0.35) probs <- probs/(sum(probs)) ratings <- c(1, 2, 3, 4, 5) p <- sum(probs*ratings) reviews <- sample(1:5, 5000, replace=TRUE, prob=probs) mean(reviews) ggplot() + aes(reviews) + geom_histogram(binwidth=1, col="black", fill="blue", alpha = .75) + labs(title="Histogram of Ratings", x="Rating", y="Count")

And here’s a histogram of the simulated reviews – note that the rating most often received was 4 out of 5:

The output of the simulation gives us a “population” of reviews for our burger joint. Next we’re interested in the uncertainty associated with a small number of reviews – a “sample” drawn from the “population”. How representative are our samples of the population? In particular, how many reviews do we need in order to reflect the actual mean of 4?

To answer that, we can turn again to simulation. The following R code samples the synthetic reviews we created above repeatedly for various sample sizes, and then plots the results in a scatter plot. You can see the true average rating as a black line, and the the other restaurant’s 200-review average as a dashed red line:

# average rating given num_reviews ratings num_reviews <- sample(c(1:100), 10000, replace=TRUE) average_rating <- c() for(i in c(1:length(num_reviews))) { average_rating[i] <- mean(sample(reviews, num_reviews[i], replace=FALSE)) } # plot ggplot() + aes(x=num_reviews, y=average_rating) + geom_point(color="blue", alpha=0.5) + geom_hline(yintercept=p, color="black", size=1, show.legend = TRUE) + geom_hline(yintercept=4.7, color="red", linetype="dashed", size=1) + labs(title="Convergence of Sample Mean to Population Mean", x="Number of Reviews", y="Average Rating") + annotate("text", x=80, y=p-0.1, label="True average rating", size=8) + annotate("text", x=85, y=4.9, label="Other restaurant's average rating", size=8)

We can see that as the sample size grows, the spread in the average rating decreases, and starts to converge around the true average rating. At a sample size of 100 reviews, we could conceivably end up with an average rating of anywhere between about 3.75 and 4.25.

But look at what happens when our sample size is small! Even with 50 reviews, it’s possible that the sample’s average grossly over- or under-estimates the true average.

We can see that with a sample of 10 reviews, it’s possible that we end up with a sample average that exceeds the average of the gourmet, 4.7-star restaurant. But even out to about 25 reviews, we could still end up with an average rating that isn’t all that distinguishable either.

Even when we have 100 reviews, there exists a level of uncertainty around the sample average. But the uncertainty with 100 reviews is much less than with 10 reviews. But the point is it still exists! We run into problems because, according to Kahneman and Tversky, we tend to grossly misjudge this uncertainty, in many cases ignoring it altogether!

Personally, I try to incorporate uncertainty into my thinking about most things in life, not just burgers and trading. But Kahneman and Tversky make the point that even when we do this, we tend to muck it up! A more robust solution is to use a quantitative approach to factoring uncertainty into decision making. Bayesian reasoning is a wonderful paradigm for doing precisely this, but that’s a topic for another time. Here, I merely want to share with you some examples and applications related to trading.

So to conclude our treatise on burger reviews, if we are comparing burger joints under a 5-star review system, eyeballing the scatterplot above suggests we need about 25 reviews for a new restaurant whose (at the time unknown) long-term average is 4 stars before we can be fairly sure that it’s burgers won’t be quite as tasty as our tried and tested 4.7-star Big Kahuna burger joint.

As much as I’m sure you enjoy thinking about the statistics of burger review systems, let’s turn our attention to trading. In particular, I want to show you how our intuition around the law of large numbers can lead us to make bad decisions in our trading, and what to do about it.

High-frequency trading strategies typically have a much higher Sharpe ratio than low frequency strategies, since the variability of returns is generally much higher in the latter. If you had a high-frequency strategy with a Sharpe ratio in the high single digits, you’d only need to see a week or two of negative returns – perhaps less – to be quite sure that your strategy was broken.

But most of us don’t have the capital or infrastructure to realise a high-frequency strategy. Instead, we trade lower frequency strategies and accept that our Sharpe ratios are going to be lower as well. In my experience, a typical non-professional might consider trading a strategy with a Sharpe between about 1.0 and 2.0.

How long does it take to realise such a strategy’s true Sharpe? And how much could that Sharpe vary when measured on samples of various sizes? The answer, which we’ll get too shortly, might surprise you, or even scare you! Because it turns out that a “large” number may or may not be so large, depending on the context. And that lack of context awareness is precisely where we tend to make our most severe errors in the application of the Law of Large Numbers.

First of all, let’s simulate various realisations of 40 days of trading a strategy with a true Sharpe ratio of 1.5. This is equivalent to around two months of trading.

If we set the strategy’s mean daily return,

mu, to 0.1%, we can calculate the standard deviation of returns,

sigma, that results in a true Sharpe of 1.5:

# backtested strategy has a sharpe of 1.5 # sqrt(252)*mu/sigma = 1.5 mu <- 0.1/100 sigma <- mu*sqrt(252)/1.5

And here’s 5,000 realisations of 40 days of trading a strategy with this performance (under the assumption that daily returns are normally distributed, an inaccurate but convenient simplification that won’t detract too much from the point):

N <- 5000 days <- 40 sharpes <- c() for(i in c(1:N)) { daily_returns <- rnorm(days, mu, sigma) # sharpe of simulated returns sharpes[i] <- sqrt(252)*mean(daily_returns)/sd(daily_returns) }

Whoa! The histogram shows that it isn’t inconceivable (in fact it’s quite likely) that our Sharpe 1.5 strategy could give us an annualised Sharpe of -2 or less over a 40-day period!

What would you do if the strategy you’d backtested to a Sharpe of 1.5 had delivered an annualised Sharpe of -2 over the first two months of trading? Would you turn it off? Tinker with it? Maybe adjust a parameter or two?

You should probably do nothing! At least until you’ve assessed the probability of your strategy delivering the actual results, assuming it’s performance was indeed what you’d backtested it to be. To do that, you can just sum up the number of simulated 40-day Sharpes that were less than or equal to -2, and then divide by the number of Sharpes we simulated:

# probability of getting sharpe of -2 or less in 40 days 100*sum(sharpes <= -2)/N

which works out to about 8.5%.

Let’s now look at the convergence of our Sharpe ratio to the expected Sharpe as we increase the sample size, just as we did in the burger review example above. Here’s the code:

trading_days <- sample(10:500, 5000, replace=TRUE) # sample of 10-1000 trading days sharpes <- c() for(i in c(1:length(trading_days))) { daily_returns <- rnorm(trading_days[i], mu, sigma) sharpes[i] <- sqrt(252)*mean(daily_returns)/sd(daily_returns) } ggplot() + aes(x=trading_days, y=sharpes) + geom_point(color="blue", alpha=0.5) + geom_hline(yintercept=1.5, color="red", linetype="dashed", size=1) + labs(title="Convergence of Sample Mean to Population Mean", x="Number of Trading Days", y="Sharpe") + annotate("text", x=85, y=p-0.1, label="True Sharpe", size=4)

And the output:

Once again we see the sample uncertainty shrink as we increase the sample size, but this time it’s magnitude looks much more frightening. Note the uncertainty even after 500 trading days! This implies that our strategy with a long-term Sharpe of 1.5 could conceivably deliver very small or even negative returns in a two-year period.

If you’ve done a lot of backtesting, you probably understand from experience that a strategy with a Sharpe of 1.5 can indeed have drawdowns that last one or two years. So maybe this result doesn’t surprise you that much. But consider how you’d feel and act in real time if you suffered through such a drawdown after going live with this strategy that you’d painstakingly developed. Would you factor the uncertainty of the sample size into your decision making?

The point is that this time the uncertainty really matters. Maybe you don’t care that much if you thought you were getting a 5-star burger, but ended up eating a 4-star offering. You could probably live with that. But what if you were expecting to realise your Sharpe 1.5 strategy, but after 2 years you’d barely broken even?

Returning to our 40-days of unprofitable trading of our allegedly profitable strategy. As mentioned above, there’s an 8.5% chance of getting an annualised Sharpe of -2 from this scenario. Maybe that’s enough to convince you that your strategy is not actually going to deliver a Sharpe of 1.5. Maybe you’d be willing to stick it out until the probability dropped below 5%. It’s up to you, and in my opinion should depend at least to some extent on your prior beliefs about your strategy.2 For instance, if you had a strong conviction that your strategy was based on a real market anomaly, maybe you’d stick to your guns longer than if you had simply data-mined a pattern in a price chart with no real rationalisation for it’s profitability. This is an important point, and I’ll touch on it again towards the end of the article.

No doubt you’ve already realised that the backtest itself is unlikely to be a true representation of the strategy’s real performance. Due to it’s finite history, the backtest itself is just a “sample” from the true “population”! So how much confidence can you have in your backtest anyway?

In the next article, I’ll show you a method for incorporating both our prior beliefs about our strategy’s backtest and the new information from the 40 trading days to construct credible limits on our strategy’s likely true performance. As you might imagine from the scatterplot above, that interval will likely be quite wide, so there’s really no way around acknowledging the inherent uncertainty in the problem of whether or not to continue trading our strategy.

We’ve seen that with small sample sizes, we can observe wild departures from an expected value – particularly with a Sharpe 1.5 strategy. Probably worryingly to many traders out there is the fact that, as it turns out, even two years of trading might constitute a “small sample”. Depending on your goals and expectations, that’s a long time to be wondering.

So what can be done? Well, there are two main options:

- Only trade strategies with super-high Sharpes that enable statistical uncertainty to shrink quickly.
- Acknowledge that statistical uncertainty is a part of life as a low frequency trader and find other ways to cope with it.

Option 1 isn’t going to be feasible for the people for whom this article is written. So let’s explore option 2.

While statistical approaches often don’t provide definitive answers to the questions that many traders need answered, market experience can at least partially fill the gaps. Above I touched on the idea that if we had a rational basis for a trade, we’d treat the statistical uncertainty around it’s out of sample performance differently than if we had simply data-mined a chart pattern or used an arbitrary technical analysis rule.

Intuition around what tends to work and what doesn’t, believe it or not, actually starts to come with experience in the markets. Of course, even the most savvy market expert gets it wrong a lot, but market experience can certainly tip the balance in your favour. While you’re acquiring this experience, one of the most sensible things you can do is to focus on trades that can be rationalised in some way. That is, trades that you speculate have an economic, financial, structural, behavioural, or some other reason for existing. Sometimes (quite often in my personal experience!) your hypothesis about the basis of the trade will be false, but at least you give yourself a better chance if the trade had a hypothetical reason for being.

Another good idea is to execute a trade using small positions as widely as possible. Of course, no effect will likely “work” across all markets, but many good ideas can be profitable in more than a single product or financial instrument, or traded using a cross-sectional approach. If there really is an edge in the trade, executing it widely increases the chances of realising it’s profit expectancy, and you get some diversification benefits as well. This idea of scaling a trade across many markets using small position sizing is one of the great benefits of automated trading.

Finally, it’s important to keep an open mind with any trade. Don’t become overly wedded to a particular idea, as it’s very likely that it won’t work forever. Far more likely is that it will work well sometimes, and not so well at other times. The other side of this is that if you remove a component from a portfolio of strategies, it is often a good idea to “keep an eye on it” to see if it comes back (automation can be useful here too). But once again, deciding on whether to remove or reinstate a component is as much art as science.

So what does this look like in real life? Well here’s an example taken from a prop firm that I know very well. The firm has a research and execution team, who design strategies, validate them and implement the production code to get them to market. Then there’s the operations guys who decide at any given time what goes into the portfolio, and where and how big the various strategies are traded. They use some quantitative tools, but they also use a hefty dose of judgement in making these decisions. That judgement is undoubtedly a significant source of alpha for the firm, and the team has over 50 years of combined experience in the markets from which to make these judgements.

These ideas sound sensible enough, but the elephant in the room is the implied reliance on judgement and discretion, which might feel uncomfortable to systematic traders (to be completely honest, up until a couple of years ago, I’d have felt that same discomfort). The problem is, anyone can learn to do statistics, build time-series models, run tests for cointegration, and all the other things that quants do. But good judgement and intuition is much harder to come by, and is generally only won through experience. And that takes time, and many of the lessons are learned through making mistakes.

Here at Robot Wealth HQ, we talk a lot about how we can help our members short-cut this process of gathering experience. Our goal is to pass on not only the technical and quantitative skills, but also the market knowledge and experience that helps us succeed in the markets. We decided that the best way to do that is to develop and trade a portfolio inside the community, where our members can follow along with the research and decision making processes that go into building and running a systematic portfolio. We’re already doing this with a crypto portfolio, and we’re about to get started on our ETF strategies.

Humans tend to make errors of judgement when it comes to drawing conclusions about a sample’s representativeness of the wider population from which it is drawn. In particular, we tend to underestimate the uncertainty of an expected value given a particular sample size. There are times when the implications of these errors of judgement aren’t overly severe, but in a trading context, they can result in disaster. From placing too much faith in a backtest, to tinkering with a strategy before it’s really justified, errors of judgement imply trading losses or missed opportunties.

We also saw that a “significant sample size” (where significant implies large enough that the sample is likely representative of the population) for typical retail level, low-frequency trading strategies can take so much time to acquire that it becomes almost useless in a practical sense. Here at Robot Wealth, we believe that systematic trading is one of those endeavours that requires a breadth of skills and experience, and that success is found where practical statistics and data science skills intersect with market experience.

The need for experience and judgement to compliment good analysis skills is one of the most important realisations I had when I moved from amateur trading into the professional space. That experience doesn’t come easily or quickly, but we believe that by demonstrating exactly what we do to build and trade a portfolio, we can help you acquire it as quickly as possible.

The post The Law of Large Numbers – Part 2 appeared first on Robot Wealth.

]]>The post Practical Statistics for Algo Traders appeared first on Robot Wealth.

]]>Well, you’re not alone. The reality is that classical statistics is difficult, time-consuming and downright confusing. Fundamentally, we use statistics to answer a question – but when we use classical methods to answer it, half the time we forget what question we were seeking an answer to in the first place.

But guess what? There’s another way to get our questions answered without resorting to classical statistics. And it’s one that will generally appeal to the practical, hands-on problem solvers that tend to be attracted to algo trading in the long run.

Specifically, algo traders can leverage their programming skills to get answers to tough statistical questions – without resorting to classical statistics. In the words of Jake van der Plas, whose awesome PyCon 2016 talk inspired some of the ideas in this post, “if you can write a for loop, you can do statistics.”

In this post and the ones that follow, I want to show you some examples of how simulation and resampling methods lend themselves to intuitive computational solutions to problems that are quite complex when posed in the domain of classical statistics. Let’s get started.

The example that we’ll start with is relatively simple and more for illustrative purposes than something that you’ll use a lot in a trading context. But it sets the scene for what follows and provides a useful place to start getting a sense for the intuition behind the methods I’ll show you later.

You’ve probably heard the story of Ed Thorp and Claude Shannon. The former is a mathematics professor and hedge fund manager; the latter was a mathematician and engineer referred to as “the father of information theory”, and whose discoveries underpin the digital age in which we live today (he’s kind of a big deal).

When they weren’t busy changing the world, these guys would indulge in another great hobby: beating casinos at games of chance. Thorp is known for developing a system of card counting to win at Blackjack. But the story I find even more astonishing is that together, Thorp and Shannon developed the first wearable computer, whose sole purpose was to beat the game of roulette. According to a 2013 article describing the affair,

Roughly the size of a pack of cigarettes, the computer itself had 12 transistors that allowed its wearer to time the revolutions of the ball on a roulette wheel and determine where it would end up. Wires led down from the computer to switches in the toes of each shoe, which let the wearer covertly start timing the ball as it passed a reference mark. Another set of wires led up to an earpiece that provided audible output in the form of musical cues – eight different tones represented octants on the roulette wheel. When everything was in sync, the last tone heard indicated where the person at the table should place their bet. Some of the parts, Thorp says, were cobbled together from the types of transmitters and receivers used for model airplanes.

So what’s all this got to do with hacking statistics? Well, nothing really, except that it provides context for an interesting example. Say we were a pit boss in a big casino, and we’d been watching a roulette player sitting at the table for hours, amassing an unusually large pile of chips. A review of the casino’s closed circuit television revealed that the player had played 150 games of roulette and won 7 of those. What are the chances that the player’s run of good luck is an indication of cheating?

To answer that question, we firstly need to understand the probabilities of the game of roulette. There are 37 numbers on the roulette wheel (0 to 36), so the probability of choosing the correct number on any given spin is 1 in 37.3For a correct guess, the house pays out $36 for every $1 wagered. So the payout is slightly less than the expectancy, which of course ensures that the house wins in the long run.

In order to use classical statistics to work out the probability that our player was cheating, we would firstly need to recognise that our player’s run of good luck could be modelled with the binomial probability distribution:

\[P(X_{wins}) = {{Y}\choose{X}} {P_{win}}^X {P_{loss}}^{Y-X}\]

where \( {{Y}\choose{X}}\) is the number of ways to arrive at \(X\) wins from \(Y\) games and is given by \(\frac{(Y)!}{X!(Y!-X!)}\)

Here are some R functions for implementing these equations:2

f <- function(n) { "calculate factorial of n" if(n == 0) return(1) num <- c(1:n) if(length(num) == 0) { return(1) } return(prod(num)) } binom <- function(x, y) { "calculate number of ways to arrive at x outcomes from y attempts" return(f(y)/(f(x)*f(y-x))) } binom_prob <- function(x, y, p) { "calculate the probability of getting x outcomes from y attempts when P(x)=p" return(binom(x, y)*p^x*(1-p)^(y-x)) }

And here’s how to calculate the probability of winning 7 out of 150 games of roulette:

n_played <- 150 n_won <- 7 p_win = 1./37 binom_prob(n_won, n_played, p_win)

This returns a value of 0.062, which means there is about a 6% of chance of winning 7 out of 150 games of roulette.

But wait, we’re not done yet! We’ve actually found the probability of winning *exactly* 7 out of 150 games, but we really want to know the probability of winning *at least* 7 out of 150 games. So we actually need to sum up the probabilities associated with winning 7, 8, 9, 10, … etc games. This number is the *p-value*, which is used in statistics to measure the validity of the *null hypothesis*, which is the idea we are trying to *disprove – *in our case, that the player *isn’t *cheating.

Confused? You’re not alone. Classical statistics is full of these double negatives and it’s one of the reasons that it’s so easy to forget what question we were even trying to answer in the first place. Before we come to a simpler approach, here’s a function for calculating the p-value for our roulette player of possibly dubious integrity (or commendable ingenuity, depending on your point of view):

binom_pval <- function(n_won, n_played, p_win) { "calculate the p-value of a given result using binomial probability distribution" p <- 0 for(n in c(n_won:n_played)) { p <- p + binom_prob(n, n_played, p_win) } return(p) } binom_pval(n_won, n_played, p_win)

In our case, the p-value comes out at 0.114, or 11.4%. We should settle on a cutoff p-value *prior* to performing our analysis, below which we reject the null hypothesis that our gambler isn’t cheating. In many fields, a p-value cutoff of 0.05 is used, but I’ve always felt that was somewhat arbitrary. Better in my opinion to avoid thinking in such black and white terms and consider what a particular p-value means in your specific context.3

In any event, our p-value tells us that there is an 11.4% chance that the player could have realised 7 wins from 150 games of roulette by chance alone. You can draw your own conclusions regarding what this means in this particular context, but if I were the pit boss scrutinising this gambler, I’d find it hard to justify throwing them out of the casino.

Finally, here’s a plot of the probability of winning

n_wongames out of 150, with a vertical line at 7 wins:

# plot distribution n_won <- c(0:15) p <- c() for(n in n_won) { p[n+1] <- binom_prob(n, n_played, p_win) } plot(n_won, p, type='S', col='blue', main='Probabilty of n wins from 150 games') abline(v=7, col='red')

You just saw the classic approach to solving what was actually a very simple problem. But if you didn’t know the formula for the binomial probability distribution, it would be hard to know where to start. It’s also very easy to get tripped up with p-values and their confusing double-negative terminology. I think you can probably see some evidence for my claim that we can easily end up forgetting the question we were trying to answer in the first place! And this was a very simple problem – things get *much* worse from here.

The good news is, there’s an easier way. We could watch someone play 150 games of roulette, then write down the number of games they won. We could then watch another 150 games and write down that result. If we did this many times, we would be able to plot a histogram showing the frequency of each result. If we watched many sequences of 150 games, we could expect the observed frequencies to start approaching the true frequencies.

But who has time to watch a few thousand sequences of 150 roulette games? Better to leverage our programming skills and *simulate* a few thousand such sequences.

Here’s a really simple roulette simulator that simulates sequences of roulette games, and returns the number of winning games in each sequence. We can use this simulator to generate sound statistical insights about our gambler.

The great thing about this simulator is that you can build it just by knowing a little about the game of roulette – it doesn’t matter if you’ve never heard of the binomial probability function, you can use the simulator to get robust answers to statistical questions.

# roulette simulator roulette_sim <- function(num_sequences, num_games) { lucky_number <- 12 games_won_per_sequence <- c() for(n in c(1:num_sequences)) { spins <- sample(0:36, num_games, replace=TRUE) games_won_per_sequence[n] <- sum(spins==lucky_number) } return(games_won_per_sequence) }

Most of the work is being done in the line

spins <- sample(0:36, num_games, replace=TRUE)which we are using to simulate a single sequence of

num_gamesspins of the roulette wheel. The

sample()function randomly selects numbers between 0 and 36

num_gamestimes and stores the results in the

spinsvariable. Then, the line

games_won_per_sequence[n] <- sum(spins==lucky_number)calculates the number of spins in the sequence that came up with our

lucky_numberand stores the result in the vector

games_won_per_sequence. I used the number 12 as the

lucky_numberparameter, which is what I would choose if I were forced to choose a lucky number, but any number in the range 0:36 will do, as they all have an equal likelihood of turning up in any given “spin”.

Let’s simulate 10,000 sequences of 150 games and plot the result in a histogram. Simply do:

# plot histogram of simulated 150-game sequences hist(roulette_sim(10000, 150), col='blue')

And you’ll end up with a histogram of games won that looks like this:4

Hmmm…the shape of our histogram looks very much like the shape of the binomial distribution that we plotted above using the classic approach. Interesting! Could it be that our simulation is indeed a decent representation of reality?

We can also calculate an empirical p-value from our simulation results by calculating the proportion of times we won at least seven games. Here’s a general function for calculating the empirical p-value, and an example of using it to calculate our gambler’s p-value:

sim_pval <- function(num_sequences, num_games, val) { games_won_per_sequence <- roulette_sim(num_sequences, num_games) return(sum(games_won_per_sequence >= val)/num_sequences) } pval <- sim_pval(10000, 150, 7)

When I ran this code, I got a p-value of 11.3, compared with a p-value of 11.4 calculated above using the classic approach. You’ll get a slightly different result every time you run this code, but the more sequences you simulate (the

num_sequencesparameter), the more the empirical result will converge to the theoretical one.

My intent with this article was to convince you that you can get statistically sound insights without resorting to the complexities of classic statistics. Personally, I find myself going around in circles and expending great energy for little reward when I try to solve a problem with the classic approach. On the other hand, I find that I get real insights and real intuition into a problem through simulation.

Simulation however is just one way you can hack statistics, and it won’t be applicable in all situations. For instance, in this example we happen to have a precise *generative model *for the phenomenon we wish to explore – namely, the probability of winning a game of roulette. In most trading situations, we normally have only data, or at best some assumptions about the underlying generative model. In the follow-up articles, I’ll give you examples of hacks you can apply in your trading research.

Apparently Thorp and Shannon’s roulette computer could predict which *octant* of the wheel the ball would end up in. That means that they could reduce the possible outcomes to five numbers of the thirty-seven total possibilities, increasing their odds of winning from 1/37 to 1/5. That means that from a sequence of 150 games, Thorp and Shannon might expect to win a staggering 30 times.

If we simulate the probability of Thorp and Shannon winning 30 of 150 games of roulette by chance:

# p-value for Thorp and Shannon pval <- sim_pval(10000, 150, 30)

we end up with a p-value of zero! That is, there is no conceivable possibility of winning 30 of 150 games of roulette by chance alone. In reality, of course the real probability isn’t zero, but apparently 10,000 simulations isn’t enough to detect a single occurrence of this many wins! Resorting to the analytical solution,

# analytical p-value for Thorp and Shannon pval <- binom_pval(30, 150, 1./37)

we find that the probability of 30 winning spins from 150 is 1.2e-17!

So how did Thorp and Shannon evade detection? Can we assume that the pit bosses back in the 1960s weren’t concerning themselves with the possibility that someone might be cheating? Actually, if you read their story, you find that Thorp and Shannon were plagued by the vagaries of the device itself, dealing with constant breakdowns and malfunctions that limited their ability to really exploit their edge.

Still, it’s a brilliant story and you really have to admire their ingenuity, not to mention their guts in taking on the casinos at their own game.

The post Practical Statistics for Algo Traders appeared first on Robot Wealth.

]]>The post Simulating Variable FX Swaps in Zorro and Python appeared first on Robot Wealth.

]]>This post shows you how to simulate variable FX swaps in both Python and the Zorro trading automation software platform.

The swap (also called the roll) is the cost of financing an FX position. It is typically derived from the central bank interest rate differential of the two currencies in the exchange rate being traded, plus some additional fee for your broker. Most brokers apply it on a daily basis, and typically apply three times the regular amount on a Wednesday to account for the weekend. Swap can be both credited to and debited from a trader’s account, depending on the actual position taken.

Swap can have a big impact on strategies with long hold periods, such as the typical momentum strategy. Therefore, accurately accounting for it is important in such cases. Zorro’s default swap calculation relies on a constant derived from the Assets List used in the simulation, which is fine for most situations, but might lead to unrealistic results when the hold period is very long.

Here’s some code for simulating historical swaps. It takes historical central bank data from the Bank of International Settlements, via Quandl. I’ve included code for the historical interest rates of the G8 countries – to get others, you just need the relevant Quandl code.

For the Zorro version, you’ll also need Zorro S, as the Quandl bridge is not available in the free version of Zorro. However, at the end of this article, I’ve also included a Python script for downloading the data from Quandl that you can save and then import into your backtesting platform. The advantage of the Zorro version is that you can access the relevant data from within a trading script via direct link to the Quandl API. That’s super convenient and all but eliminates the need to do any data wrangling at all. The advantage of the Python version is that it is completely free, but using the data in a trading script requires a little more messing around.

In order to access data from Quandl within Zorro, you’ll need a Quandl API key (get it from the Quandl website) and enter it in your ZorroFix.ini or Zorro.ini file.

Here’s the Zorro script:

/* Download historical central bank policy rates from Quandl and use to calculate historical swaps. Zorro's FX swap is interest per day per 10000 units traded, in account currency. */ #include <contract.c> var calculate_roll_long(var base_ir, var quote_ir, var broker_fee) { /*Calculates Zorro roll long in units of quote currency*/ var ird = (base_ir - quote_ir)/100; return 10000*ird/365 - broker_fee; } var calculate_roll_short(var base_ir, var quote_ir, var broker_fee) { /*Calculates Zorro roll short in units of quote currency*/ var ird = (quote_ir - base_ir)/100; return 10000*ird/365 - broker_fee; } function run() { set(PLOTNOW); PlotWidth = 800; PlotHeight1 = 400; PlotHeight2 = 250; StartDate = 20100101; EndDate = 20180630; // daily policy rates of major central banks, from Bank of International Settlements, via Quandl var usd_ir = dataFromQuandl(1, "%Y-%m-%d,f", "BIS/PD_DUS", 1); var jpy_ir = dataFromQuandl(2, "%Y-%m-%d,f", "BIS/PD_DJP", 1); var aud_ir = dataFromQuandl(3, "%Y-%m-%d,f", "BIS/PD_DAU", 1); var eur_ir = dataFromQuandl(4, "%Y-%m-%d,f", "BIS/PD_DXM", 1); var cad_ir = dataFromQuandl(5, "%Y-%m-%d,f", "BIS/PD_DCA", 1); var chf_ir = dataFromQuandl(6, "%Y-%m-%d,f", "BIS/PD_DCH", 1); var nzd_ir = dataFromQuandl(7, "%Y-%m-%d,f", "BIS/PD_DNZ", 1); var gbp_ir = dataFromQuandl(8, "%Y-%m-%d,f", "BIS/PD_DGB", 1); // What the broker takes in addition to the interest rate differential // Will vary by broker, by pair, and even by direction! Make a conservative assumption. var broker_fee = 0.5; // EUR/USD roll in AUD example asset("EUR/USD"); //calculate roll long in units of quote currency var rl = calculate_roll_long(eur_ir, usd_ir, broker_fee); // convert to units of account currency - here the account currency is AUD // not required if account currency is the same as the quote currency string current_asset = Asset; // store name of currently selected asset asset("AUD/USD"); // switch to ACCT_CCY/QUOTE_CCY var p = priceClose(); asset(current_asset); // switch back to original asset RollLong = rl/p; // adjust roll long calculation and set Zorro's RollLong variable //calculate roll short in units of quote currency var rs = calculate_roll_short(eur_ir, usd_ir, broker_fee); // convert to units of account currency - here the account currency is AUD // not required if account currency is the same as the quote currency RollShort = rs/p; // adjust roll short calculation and set Zorro's RollShort variable // plot roll in units of account currency plot("Roll Long", RollLong, NEW, BLUE); plot("Roll Short", RollShort, 0, RED); }

One major thing to remember is that your FX broker won’t charge/pay swaps based on the exact interest rate differential. In practice, they might take some additional fat for themselves, or even adjust their actual swaps on the basis of perceived upside/downside volatility – and these may not even be symmetrical! The short story is that the broker’s cut will vary by broker, FX pair, and even by direction! You can verify that yourself by searching various brokers’ websites for their current swap rates.

So the upshot of all that is that if you want to include an additional broker fee in your simulation, recognise that it will be an estimate, do some research on what brokers are currently charging, and err on the conservative side. In the code above, the broker fee is set in line 44; you can also set this to zero if you like.

The trickiest part is converting the interest rate differential of the base-quote currencies to Zorro’s

RollLongand

RollShortvariables – but the advantage is that once you get that right, Zorro will take care of simulating the roll for you – you literally won’t have to do another thing! These variables represent the swap in account currency per 10,000 traded FX units. Most of that conversion is taken care of the in the

calculate_roll_long()and

calculate_roll_short()functions in the code above. But these functions output the swap in units of the

The code also contains an example of converting the EUR/USD roll for an account denominated in AUD. This is accomplished from line 46.

Here’s the output of running the script. You can see how the swap for long and short trades has changed over time. At some point in 2014, it became a less expensive proposition to sell the EUR against the USD rather than buy it. You can also see that the value of the swap is constantly changing; that’s because the calculation considers the contemporaneous exchange rate of the account currency (AUD) against the quote currency (USD) of the pair being traded.

Here’s a python script for downloading the same data set as used above (albeit with a longer history) from Quandl, and a function for calculating the swap. This time, the function calculates the swap per standard FX lot, which is 100,000 units of the quote currency (the Zorro script above calculates the swap per 10,000 units which is required for Zorro’s

RollLongand

RollShortvariables).

import pandas as pd import matplotlib.pyplot as plt import quandl cad = quandl.get("BIS/PD_DCA") jpy = quandl.get("BIS/PD_DJP") chf = quandl.get("BIS/PD_DCH") aud = quandl.get("BIS/PD_DAU") gbp = quandl.get("BIS/PD_DGB") nzd = quandl.get("BIS/PD_DNZ") eur = quandl.get("BIS/PD_DXM") usd = quandl.get("BIS/PD_DUS") # this is the effective fed funds rate def calculate_rolls(base, quote, broker_fee): ird = 100000*(base - quote)/(100*365) - broker_fee ird.columns = ["IRD"] ird.fillna(method="ffill", inplace=True) ird["roll_long"] = ird["IRD"] - broker_fee ird["roll_short"] = -ird["IRD"] - broker_fee return ird

Plotting the historical effective fed funds rate, you can see that the data set might have some problems prior to about 1985. You may need to smooth the data or remove outliers to use it effectively.

ax = usd.plot(grid=True) ax.legend(["USD Effective Fed Funds Rate"])

We can simulate and plot the historical swap of the AUD/CAD exchange rate as follows:

broker_fee = 5 # how much does the broker take per lot of the quote currency? aud_cad = calculate_rolls(aud, cad, broker_fee) aud_cad[["roll_long", "roll_short"]].dropna().plot(grid=True)

Again, you can see some potential data issues prior to about 1990.

The cost of financing a long-term FX position can have a significant impact on the overall result of the trade. This post demonstrated a simple and inexpensive way to simulate the historical variable financing costs for FX.

*Data is the basis of everything we do as quant traders. Inside the Robot Wealth community, we show our members how to use this and other data for trading systems research in a way that goes much deeper than the basics we touched on here. But data is just one of the many algorithmic trading fundamentals we cover inside Class to Quant. Not only are our members improving their trading performance with our beginner to advanced courses, but together they’re building functioning strategies inside our community as part of our Algo Laboratory projects. If you’re interested and want to find out more, try Class to Quant for 30 days risk free. I’d love to meet you inside.

The post Simulating Variable FX Swaps in Zorro and Python appeared first on Robot Wealth.

]]>The post Fun with the Cryptocompare API appeared first on Robot Wealth.

]]>As nice as the user-interface is, what I really like about Cryptocompare is its API, which provides programmatic access to a wealth of crypto-related data. It is possible to drill down and extract information from individual exchanges, and even to take aggregated price feeds from all the exchanges that Cryptocompare is plugged into – and there are quite a few!

When it comes to interacting with Cryptocompare’s API, there are already some nice Python libraries that take care of most of the heavy lifting for us. For this post, I decided to use a library called

cryptocompare. Check it out on Git Hub here.

You can install the current stable release by doing

pip install cryptocompare, but I installed the latest development version direct from Git Hub, as only that version had support for minutely price history at the time of writing.

To install the dev version from Git Hub, do:

pip install git+https://github.com/lagerfeuer/cryptocompare.git

This version will limit you to one month’s worth of daily price data and one week’s worth of hourly data. If you’re feeling adventurous, you can install the version that I forked into my own Git Hub account and modified to increase those limits. To do that, you’ll need to do:

pip install git+https://github.com/kplongdodd/cryptocompare.git

Now that we’ve got our library of API functions, let’s take a look at what we can do with Cryptocompare!

To get a list of all the coins available on Cryptocompare, we can use the following Python script:

import numpy as np import pandas as pd import cryptocompare as cc # list of coins coin_list = cc.get_coin_list() coins = sorted(list(coin_list.keys()))

At the time of writing, this returned a list of 2,609 coins! By comparison, there are around 2,800 stocks listed on the New York Stock Exchange.

Let’s focus on the biggest players in crypto-world: the coins with the largest market capitalisation.

We can get price data for a list of coins using the function

cryptocompare.get_price()and if we specify

full=True, the API will return a whole bunch of data for each coin in the list, including last traded price, 24-hour volume, number of coins in circulation, and of course market capitalisation.

Cryptocompare’s API will only allow us to pass it a list of coins that contains no more than 300 characters at any one time. To get around that limitation, we’ll pass lists of 50 coins at a time, until we’ve passed our entire list of all available coins.

The API returns a json string, which we can interpret as a dictionary in Python. Note that the outer-most keys in the resulting dictionary are

'RAW'and

'DISPLAY'which hold the raw data and data formatted for better displaying respectively. In our case, we prefer to work with the raw data, so we’ll keep it and discard the rest.

Here’s the code for accomplishing all that:

# get data for all available coins coin_data = {} for i in range(len(coins)//50 + 1): # limited to a list containing at most 300 characters # coins_to_get = coins[(50*i):(50*i+50)] message = cc.get_price(coins_to_get, curr='USD', full=True) coin_data.update(message['RAW'])

coin_datanow contains a whole bunch of dictionaries-within-dictionaries that hold our data. Each outer key corresponds to a coin symbol, and looks like this:

'ZXT': {'USD': {'CHANGE24HOUR': 0, 'CHANGEDAY': 0, 'CHANGEPCT24HOUR': 0, 'CHANGEPCTDAY': 0, 'FLAGS': '4', 'FROMSYMBOL': 'ZXT', 'HIGH24HOUR': 2.01e-06, 'HIGHDAY': 2.01e-06, 'LASTMARKET': 'CCEX', 'LASTTRADEID': '1422076', 'LASTUPDATE': 1491221170, 'LASTVOLUME': 998, 'LASTVOLUMETO': 0.0020059799999999997, 'LOW24HOUR': 2.01e-06, 'LOWDAY': 2.01e-06, 'MARKET': 'CCCAGG', 'MKTCAP': 0, 'OPEN24HOUR': 2.01e-06, 'OPENDAY': 2.01e-06, 'PRICE': 2.01e-06, 'SUPPLY': 0, 'TOSYMBOL': 'USD', 'TOTALVOLUME24H': 0, 'TOTALVOLUME24HTO': 0, 'TYPE': '5', 'VOLUME24HOUR': 0, 'VOLUME24HOURTO': 0, 'VOLUMEDAY': 0, 'VOLUMEDAYTO': 0}},

That

'USD'key is common to all the coins in

coin_dataand it specifies the counter-currency in which prices are displayed. That key is going to be troublesome when we turn our dictionary into a more analysis-friendly data structure, like a pandas

DataFrame, so let’s get rid of it:

# remove 'USD' level for k in coin_data.keys(): coin_data[k] = coin_data[k]['USD']

Now we can go ahead and create a

DataFramefrom our

coin_datadictionary and sort it by market capitalization:

coin_data = pd.DataFrame.from_dict(coin_data, orient='index') coin_data = coin_data.sort_values('MKTCAP', ascending=False)

All good so far, but interrogating this data by doing

coin_data['MKTCAP'].head(20)reveals that the coin with the highest market cap is something called AMO:

coin_data['MKTCAP'].head(20) Out[3]: AMO 1.928953e+13 WBTC* 1.421202e+11 BTC 1.156607e+11 BITCNY 6.769687e+10 ETH 5.327990e+10 NPC 3.108324e+10 XRP 2.222890e+10 XUC 1.644538e+10 BCH 1.623158e+10 EOS 1.103000e+10 VERI 7.803077e+09 PRPS 7.342302e+09 LTC 6.087482e+09 MTN 5.000000e+09 TRX 4.731000e+09 XLM 4.628553e+09 ADA 4.485383e+09 DCN 3.928000e+09 IOT 3.863547e+09 VEN 3.330000e+09

Wouldn’t we expect that honour to go to Bitcoin, with symbol BTC? And what about all those other coins that you’ve probably never heard of? What’s going on here?

It turns out that Cryptocompare includes data for coins that haven’t yet gone to ICO, and it appears that in such cases, the market capitalisation calculation is done using the pre-ICO price of the coin, and its total possible supply of coins.

That’s going to skew things quite significantly, so let’s exclude any coins from our list that haven’t traded in the last 24 hours. We can get this information from the

TOTALVOLUME24Hfield, which is the total amount the coin has been traded in 24 hours against all its trading pairs:

# exclude coins that haven't traded in last 24 hours # TOTALVOLUME24H is the amount the coin has been traded # in 24 hours against ALL its trading pairs coin_data = coin_data[coin_data['TOTALVOLUME24H'] != 0]

coin_data['MKTCAP'].head()now looks a lot more sensible:

coin_data['MKTCAP'].head() Out[4]: BTC 1.156607e+11 ETH 5.327990e+10 XRP 2.222890e+10 XUC 1.644538e+10 BCH 1.623158e+10

We can get the last month’s historical daily data for the 100 top coins by market cap, stored as a dictionary of DataFrames, by doing the following:

top_coins = coin_data[:100].index df_dict = {} for coin in top_coins: hist = cc.get_historical_price_day(coin, curr='USD') if hist: hist_df = pd.DataFrame(hist['Data']) hist_df['time'] = pd.to_datetime(hist_df['time'], unit='s') hist_df.index = hist_df['time'] del hist_df['time'] df_dict[coin] = hist_df

And we can access the data for any coin in the dictionary by doing

df_dict[coin]where coin is the symbol of the coin we interested in, such as ‘BTC’. Now that we have our data, we can do some fun stuff!

You will need to use the version of

cryptocomparefrom my Git Hub repo (see above) in order to get enough to data to reproduce the examples below. In that case, once you’ve downloaded my version, just replace line 5 in the script above with

hist = cc.get_historical_price_day(coin, curr='USD', limit=2000)

First, let’s pull out all the closing prices from each

DataFramein our dictionary:

# pull out closes closes = pd.DataFrame() for k, v in df_dict.items(): closes[k] = v['close'] # re-order by market cap closes = closes[coin_data.index[:100]]

Plot some prices from 2017, an interesting year for cryptocurrency, to say the least:

# some cool stuff we can do with our data import matplotlib.pyplot as plt import seaborn as sns # plot some prices closes.loc['2017', ['BTC', 'ETH', 'LTC']].plot()

Plot some returns series from the same period:

# plot some returns closes.loc['2017', ['BTC', 'ETH', 'LTC']].pct_change().plot()

# plot correlation matrx sns.heatmap(closes.loc['2017', ['BTC', 'ETH', 'LTC', 'XRP', 'XUC', 'BCH', 'EOS', 'VERI', 'TRX',]].pct_change().corr())

And finally, a scatterplot matrix showing distributions on the diagonal:

# scatter plot matrix sns.pairplot(closes.loc['2018', ['BTC', 'ETH', 'XRP', 'VERI', 'LTC']].pct_change().dropna())

There’s lots more interesting analysis you can do with data from Cryptocompare, before we even do any backtesting, for example:

- Value of BTC and other major coins traded through the biggest exchanges over time – which exchanges are dominating?
- Top coins traded by fiat currency – do some fiats gravitate towards certain cryptocurrencies?
- Are prices significantly different at the same time across exchanges – that is, are arbitrage opportunities present?5

In this post, I introduced the Cryptocompare API and some convenient Python tools for interacting with it. I also alluded to the depth and breadth of data available: over 2,000 coins, some going back several years, broken down by exchange and even counter-currency. I also showed you some convenient base-Python and pandas data structures for managing and interrogating all that data. In future blog posts, we’ll use this data to backtest some crypto trading strategies.

*Data is the basis of everything we do as quant traders. Inside the Robot Wealth community, we show our members how to use this and other data for trading systems research in a way that goes much deeper than the basics we touched on here. But data is just one of the many algorithmic trading fundamentals we cover inside Class to Quant. Not only are our members improving their trading performance with our beginner to advanced courses, but together they’re building functioning strategies inside our community as part of our Algo Laboratory projects. If you’re interested and want to find out more, try Class to Quant for 30 days risk free. I’d love to meet you inside.

The post Fun with the Cryptocompare API appeared first on Robot Wealth.

]]>The post ETF Rotation Strategies in Zorro appeared first on Robot Wealth.

]]>Lately our Class to Quant members have been looking to implement rotation-style ETF and equities strategies in Zorro, but just like your old high-school essays, starting is the biggest barrier. These types of strategies typically scan a universe of instruments and select one or more to hold until the subsequent rebalancing period. Zorro is my go-to choice for researching and even executing such strategies: its speed makes scanning even large universes of stocks quick and painless, and its scripting environment facilitates fast prototyping and iteration of the algorithm itself – once you’ve wrestled it for a while (get our free Zorro for Beginners video course here).

I’m going to walk you through a general design paradigm for constructing strategies like this with Zorro, and demonstrate the entire process with a simple rotation algorithm based on Gary Antonacci’s Dual Momentum. By the end you should have the skills needed to build a similar strategy yourself. Let’s begin!

To construct a rotation style strategy in Zorro, we’d follow these general design steps:

- Construct your universe of instruments by adding them to an assets list CSV file. There are examples in Zorro’s History folder, and I’ll put one together for you below.
- Set up your rebalancing period using Zorro’s time and date functions.
- Tell Zorro to reference the asset list you just created using the
assetList

command. - Loop through each instrument in the list and perform whatever calculations or analysis your strategy requires for the selection of which instruments to hold.
- Compare the results of the calculations/analysis performed in the prior step and construct the positions for the next period.

That’s pretty much it! Of course, the details of each step might differ slightly depending on the algorithm, and you will also need some position sizing and risk management, but in general, following these steps will get you 90% of the way there.

Not happy trading with a 90% complete strategy? No problem, let’s look at what this looks like in practice.

This example is based on Gary Antonacci’s Dual Momentum. We will simplify Gary’s slightly more nuanced version to the following: if US equities outperformed global equities ** and** its return was positive, hold US equities. If global equities outperformed US equities

Gary has done a mountain of research on Dual Momentum and found that it has outperformed for decades. In particular, it has tended to kick you out of equities during extended bear markets, while still getting you in for most of the bull markets. Check out Gary’s website for more information and consider getting hold of a copy of his book – you can read my review here.

Our simplified version of the strategy will use a universe of three ETFs that track US equities, global equities and short-term bonds. We will use the returns of these ETFs for both generation of our trading signals and actual trading (Gary’s approach is slightly more nuanced than that – again check out his website and book for more details).

Our asset list contains the universe of instruments we wish to scan. In our case, we only need three ETFs. We’ll choose SPY for our US equities instrument, EFA for our global equities and SHY for our bonds ETF.

Zorro’s asset lists are CSV files that contain a bunch of parameters about the trading conditions of each instrument. This information is used in Zorro’s simulations, so it’s important to make it as accurate as possible. In many cases, Zorro can populate these files for us automatically by simply connecting to a broker, but in others, we need to do it manually (explained in our Zorro video course).

Our asset list for this strategy will look like this:

Name,Price,Spread,RollLong,RollShort,PIP,PIPCost,MarginCost,Leverage,LotAmount,Commission,Symbol SPY,269.02,0.1,0,0,0.01,0.01,0,1,1,0.02, SHY,83.61,0.1,0,0,0.01,0.01,0,1,1,0.02, EFA,69.44,0.1,0,0,0.01,0.01,0,1,1,0.02,

You can see that most of the parameters are actually the same for each instrument, so we can use copy and paste to make the construction of this file less tedious than it would otherwise be. For other examples of such files, just look in Zorro’s History folder.

Save this file as a CSV file called AssetsDM.csv and place it in your History folder (which is where Zorro will go looking for it shortly).

Here we are going to rebalance our portfolio every month. We decided to avoid the precise start/end of the month and rebalance on the third trading day of the month. You can experiment with this parameter to get a feel for how much it affects the strategy.

Simply wrap the trading logic in the following

if()statement:

if(tdm() == 3) { ... }

In the initial run of the script, we want Zorro to reference the newly created asset list. Also, if we don’t have data for these instruments, we want to download it in the initial run. We’ll use Alpha Vantage end-of-day data, which can be accessed directly from within Zorro scripts. These lines of code take care of that for us:

if(is(INITRUN)) { assetList("History\\AssetsDM.csv"); string Name; while(Name = loop(Assets)) { assetHistory(Name, FROM_AV); } }

Note that this assumes you’ve entered your Alpha Vantage API key in the Zorro.ini or ZorroFix.ini configuration files, which live in Zorro’s base directory. If you don’t have an Alpha Vantage API key head over to the Alpha Vantage website to claim one.

For our dual momentum strategy, we need to know the return of each instrument over the portfolio formation period. So we can loop through each asset in our list, calculate the return, and store it in an array for later use.

If you intend on using Zorro’s optimizer, perform the loop operation using a construct like:

for(i=0; Name=Assets[i]; i++) { ... }

If you don’t intend on using the optimizer, you can safely use the more convenient

while(loop(Assets))construct.

The reason we don’t use the latter in an optimization run is that the

loop()function is handled differently in Zorro’s Train mode, and will actually run a separate simulation for each instrument in the loop. This is perfect in the instance we want to trade a particular algorithm across multiple, known instruments – something like a moving average crossover traded on each stock in the S&P500, where we wanted to optimize the moving average periods separately for each instrument2. But in an algorithm that compares and selects instruments from a universe of instruments, optimizing some parameter set on each one individually wouldn’t make sense.

This is actually a really common mistake when developing these type of strategies in Zorro, but if you understand the behavior of

loop()in Zorro’s Train mode, it’s one that you probably won’t make again.

Here’s the code for performing the looped return calculations:

if(tdm()==3) { asset_num = 0; while(loop(Assets)) { asset(Loop1); Returns[asset_num] = (priceClose(0)-priceClose(DAYS))/priceClose(DAYS); asset_num++; } ...

Recalling our dual momentum trading logic, we firstly check if US equities outperformed global equities. If so, we then check that their absolute return was positive. If so, then we hold US equities. If global bonds outperformed US equities, we check that their absolute return was positive. If so, then we hold global equities. If neither US equities nor global equities had a positive return, we hold bonds.

If you stop and think about that logic, we are really just holding the instrument with the highest return in the formation period, with the added condition that for the equities instruments, they also had a positive absolute return. We could implement that trading logic like so:

// sort returns lowest to highest int* idx = sortIdx(Returns, asset_num); // exit any positions where asset is ranked in bottom 2 and is not bonds int i; for(i=0;i<2;i++) { asset(Assets[idx[i]]); if(Asset != "SHY") { if(NumOpenLong > 0) { printf("\nAsset to close: %s", Asset); exitLong(); } } } // asset to hold asset(Assets[idx[2]]); /* check if asset is bonds, if so buy if not, if return of highest ranked asset is positive, buy otherwise, switch to bonds and buy */ if(Asset == "SHY") { // don't apply time series momentum to bonds enterLong(); } else if(Returns[idx[2]] > 0) //time-series momentum condition { enterLong(); asset("SHY"); exitLong(); } else { // switch to bonds and buy asset("SHY"); enterLong(); } }

This is probably the most confusing part of the script, so let’s talk about it in some detail. Firstly, the line

int* idx = sortIdx(Returns, asset_num)

returns an array of the indexes of the

Returnsarray, sorted from lowest to highest. Say our

Returnsarray held the numbers 4, -2, 2. Our array

idxwould contain 1, 2, 0 because the item at

Returns[1]is the lowest number, followed by the number at

Returns[2], with

Returns[0]being the highest number. This might seem confusing, but it will provide us with a convenient way to access the highest ranked instrument directly from the Assets array, which holds the names of the instruments in the order called by our

loop()function.

In lines 5-17, we firstly use this feature to exit any open positions that aren’t the highest ranked asset – provided those lower ranked assets aren’t bonds. Remember, we might want to hold a bond position, even if it isn’t the highest ranked asset. So we won’t exit any open bond positions just yet.

Next, in line 20, we switch the highest ranked instrument. If that instrument is bonds, we don’t bother checking the absolute return condition (it doesn’t apply to bonds) and go long. If that instrument is one of the equities ETFs, we check the absolute return condition. If that turns out to be true, we enter a long position in that ETF, then switch to bonds and exit any open position we may have been holding.

Finally, if the absolute return condition on our top-ranked equities ETFs wasn’t true, we switch to bonds and enter a long position.

In this case we are simply going to be fully invested with all of our starting capital and any accrued profits in the currently selected instrument. Here’s the code for accomplishing that:

Capital = 10000; Margin = Capital+WinTotal-LossTotal;

Note that this is only possible because we are trading these instruments with no leverage (leverage is defined in the asset list above). If we were using leverage, we’d obviously have to reduce the amount of margin invested in a given position.

Finally, here’s the complete code listing for our simple Dual Momentum algorithm. In order for the script to run, remember to save a copy of the asset list in Zorro’s History folder, and enter your Alpha Vantage API key in the Zorro.ini or ZorroFix.ini configuration files.

/* Dual momentum in Zorro */ #define NUM_ASSETS 3 #define DAYS 252 int asset_num; function run() { set(PLOTNOW); PlotWidth = 1200; StartDate = 20040101; EndDate = 20170630; BarPeriod = 1440; LookBack = DAYS; MaxLong = 1; if(is(INITRUN)) { assetList("History\\AssetsDM.csv"); string Name; while(Name = loop(Assets)) { assetHistory(Name, FROM_AV); } } var Returns[NUM_ASSETS]; int position_diff; if(tdm()==3) { asset_num = 0; while(loop(Assets)) { asset(Loop1); Returns[asset_num] = (priceClose(0)-priceClose(DAYS))/priceClose(DAYS); asset_num++; } Capital = 10000; Margin = Capital+WinTotal-LossTotal; // sort returns lowest to highest int* idx = sortIdx(Returns, asset_num); // exit any positions where asset is ranked in bottom 2 and is not bonds int i; for(i=0;i<2;i++) { asset(Assets[idx[i]]); if(Asset != "SHY") { if(NumOpenLong > 0) { printf("\nAsset to close: %s", Asset); exitLong(); } } } // asset to hold asset(Assets[idx[2]]); /* check if asset is bonds, if so buy if not, if return of highest ranked asset is positive, buy otherwise, switch to bonds and buy */ if(Asset == "SHY") { // don't apply time series momentum to bonds enterLong(); } else if(Returns[idx[2]] > 0) //time-series momentum condition { enterLong(); asset("SHY"); exitLong(); } else { // switch to bonds and buy asset("SHY"); enterLong(); } } }

Over the simulation period, the strategy returns a Sharpe Ratio of 0.52. That’s pretty healthy for something that trades so infrequently. In terms of gross returns, the starting capital of $10,000 was almost tripled, and the maximum drawdown was approximately $4,700. One of the main limitations of the strategy is that by design, it is highly concentrated, taking only single position at a time.

Here’s the equity curve:

Rotation style strategies* require a slightly different design approach than strategies for whom the tradable subset of instruments is static. By following the five broad design principles described here, you can leverage Zorro’s speed, power and flexibility to develop these types of strategies. Good luck and happy profits!

*This is just one of the many algorithmic trading fundamentals we cover inside Class to Quant. Not only are our members improving their trading performance with our beginner to advanced courses, but together they’re building functioning strategies inside our community as part of our Algo Laboratory projects. If you’re interested and want to find out more, try Class to Quant for 30 days risk free. I’d love to meet you inside.

**Where to from here?**

*Check out my review of Gary Antonacci’s Dual Momentum, and explore some other variations written in R*

- Get our free Zorro for Beginners video series, and go from beginner to Zorro trader in just 90 minutes
*If you’re ready to go deeper and get more practical tips and tricks on building robust trading systems, as well as joining our strong community of traders, check out our flagship offer Class to Quant.*

The post ETF Rotation Strategies in Zorro appeared first on Robot Wealth.

]]>The post Deep Learning for Trading Part 4: Fighting Overfitting with Dropout and Regularization appeared first on Robot Wealth.

]]>This is the fourth in a multi-part series in which we** ****explore and compare various deep learning tools and techniques for market forecasting using Keras and TensorFlow**.

In Part 1, we introduced Keras and discussed some of the major obstacles to using deep learning techniques in trading systems, including a warning about attempting to extract meaningful signals from historical market data. If you haven’t read that article, it is highly recommended that you do so before proceeding, as the context it provides is important.

Part 2 provides a walk-through of setting up **Keras and Tensorflow for R** using either the default **CPU-based configuration**, or the more complex and involved (but well worth it) **GPU-based configuration **under the Windows environment**.**

Part 3 is an **introduction to the model building, training and evaluation process in Keras**. We train a simple feed forward network to predict the direction of a foreign exchange market over a time horizon of one hour and assess its performance.

.

In the last post, we trained a densely connected feed forward neural network to forecast the direction of the EUR/USD exchange rate over a time horizon of one hour. We landed on a model that predicted slightly better than random on out of sample data. We also saw in our learning plots that our network started to overfit badly at around 40 epochs. In this post, I’m going to demonstrate some tools to help fight overfitting and push your models further. Let’s get started.

Regularization is a commonly used technique to mitigate overfitting of machine learning models, and it can also be applied to deep learning. Regularization essentially constrains the complexity of a network by penalizing larger weights during the training process. That is, by adding a term to the loss function that grows as the weights increase.

Keras implements two common types of regularization: ** **

- L1, where the additional cost is proportional to the
**absolute value**of the weight coefficients - L2, where the additional cost is proportional to the
**square**of the weight coefficients

These are incredibly easy to implement in Keras: simply pass

regularizer_l2(regularization_factor)or

regularizer_l2(regularization_factor)to the

kernal_regularizerargument in a Keras layer instance (details on how to do this below), where

regularization_factor * abs(weight_coefficient)or

regularization_factor * weight_coefficient^2is added to the total loss, depending on the type of regularization chosen.

Note that in Keras speak,

'kernel'refers to the weights matrix created by a layer. Regularization can also be applied to the bias terms via the argument

bias_regularizerand the output of a layer by

activity_regularizer.

When we add regularization to a network, we might find that we need to train it for more epochs in order to reach convergence. This implies that the network might benefit from a higher learning rate during early stages of model training.2

However, we also know that sometimes a network can benefit from a smaller learning rate at later stages of the training process. Think of it like the model’s loss being stuck halfway down the global minimum, bouncing from one side of the loss surface to the other with each weight update. By reducing the learning rate, we can make the subsequent weight updates less dramatic, which enables the loss to ‘fall’ further down towards the true global minimum.

By using another Keras callback, we can automatically adjust our learning rate downwards when training reaches a plateau:

reduce_lr <- callback_reduce_lr_on_plateau(monitor = "val_acc", factor = 0.9, patience = 10, verbose = 1, mode = "auto", epsilon = 0.005, min_lr = 0.00001)

This tells Keras to reduce the learning rate by a factor of 0.9 whenever validation accuracy doesn’t improve for

patienceepochs. Also note the

epsilonparameter, which controls the threshold for measuring the new optimum. Setting this to a higher value results in fewer changes to the learning rate. This parameter should be on a scale that is relevant to the metric being tracked, validation accuracy in this case.

Here’s the code for an L2 regularized feed forward network with both

reduce_lr_on_plateauand

model_checkpointcallbacks (data import and processing is the same as in the previous post):

###### FFN with weight regularization ##### model.reg <- keras_model_sequential() model.reg %>% layer_dense(units = 150, kernel_regularizer = regularizer_l2(0.001), activation = 'relu', input_shape = ncol(X_train)) %>% layer_dense(units = 150, kernel_regularizer = regularizer_l2(0.001), activation = 'relu') %>% layer_dense(units = 150, kernel_regularizer = regularizer_l2(0.001), activation = 'relu') %>% layer_dense(units = 1, activation = 'sigmoid') summary(model.reg) model.reg %>% compile( loss = 'binary_crossentropy', optimizer = optimizer_rmsprop(lr=0.001), metrics = c('accuracy') ) filepath <- "C:/Users/Kris/Research/DeepLearningForTrading/model_reg.hdf5" # set up your own filepath checkpoint <- callback_model_checkpoint(filepath = filepath, monitor = "val_acc", verbose = 1, save_best_only = TRUE, save_weights_only = FALSE, mode = "auto") reduce_lr <- callback_reduce_lr_on_plateau(monitor = "val_acc", factor = 0.9, patience = 20, verbose = 1, mode = "auto", epsilon = 0.005, min_lr = 0.00001) history.reg <- model.reg %>% fit( X_train, Y_train, epochs = 100, batch_size = nrow(X_train), validation_data = list(X_val, Y_val), shuffle = TRUE, callbacks = list(checkpoint, reduce_lr) ) # plot training loss and accuracy plot(history.reg) max(history.reg$metrics$val_acc) # load and evaluate best model rm(model.reg) model.reg <- keras:::keras$models$load_model(filepath) model.reg %>% evaluate(X_test, Y_test)

Plotting the training curves now gives us three plots – loss, accuracy and learning rate:

This particular training process resulted in an out of sample accuracy of 53.4%, slightly better than our original unregularized model. You can experiment with more or less regularization, as well as applying regularization to the bias terms and layer outputs.

Dropout is another commonly used tool to fight overfitting. Whereas regularization is used throughout the machine learning ecosystem, dropout is specific to neural networks. Dropout is the random zeroing (“dropping out”) of some proportion of a layer’s outputs during training. The theory is that this helps prevents pairs or groups of nodes from learning random relationships that just happen to reduce the network loss on the training set (that is, result in overfitting). Hinton and his colleagues, the discoverers of dropout, showed that it is generally superior to other forms of regularization and improves model performance on a variety of tasks. Read the original paper here.2

Dropout is implemented in Keras as it’s own layer,

layer_dropout(), which applies dropout on it’s

rateparameter. In practice, dropout rates between 0.2 and 0.5 are common, but the optimal values for a particular problem and network configuration need to be determined through appropriate cross validation.

At the risk of getting ahead of ourselves, when applying dropout to recurrent architectures (which we’ll explore in a future post), we need to apply the same pattern of dropout at every timestep, otherwise dropout tends to hinder performance rather than enhance it.3

Here’s an example of how we build a feed forward network with dropout in Keras:

###### FFN with dropout ##### model.drop <- keras_model_sequential() model.drop %>% layer_dense(units = 150, activation = 'relu', input_shape = ncol(X_train)) %>% layer_dropout(rate = 0.3) %>% layer_dense(units = 150, activation = 'relu') %>% layer_dropout(rate = 0.3) %>% layer_dense(units = 150, activation = 'relu') %>% layer_dropout(rate = 0.3) %>% layer_dense(units = 1, activation = 'sigmoid') summary(model.drop) model.drop %>% compile( loss = 'binary_crossentropy', optimizer = optimizer_rmsprop(lr=0.001), metrics = c('accuracy') ) filepath <- "C:/Users/Kris/Research/DeepLearningForTrading/model_drop.hdf5" # set up your own filepath checkpoint <- callback_model_checkpoint(filepath = filepath, monitor = "val_acc", verbose = 1, save_best_only = TRUE, save_weights_only = FALSE, mode = "auto") reduce_lr <- callback_reduce_lr_on_plateau(monitor = "val_acc", factor = 0.9, patience = 20, verbose = 1, mode = "auto", epsilon = 0.005, min_lr = 0.00001) history.drop <- model.drop %>% fit( X_train, Y_train, epochs = 150, batch_size = nrow(X_train), validation_data = list(X_val, Y_val), shuffle = TRUE, callbacks = list(checkpoint, reduce_lr) ) # plot training loss and accuracy plot(history.drop) max(history.drop$metrics$val_acc) # load and evaluate best model rm(model.drop) model.drop <- keras:::keras$models$load_model(filepath) model.drop %>% evaluate(X_test, Y_test)

Training the model using the same procedure as we used in the L2-regularized model above, including the reduce learning rate callback, we get the following training curves:

One of the reasons dropout is so useful is that it enables the training of larger networks by reducing their propensity to overfit. Here’s the training curves for a similar model but this time eight layers deep:

Notice that it doesn’t overfit significantly worse than the shallower model. Also notice that it didn’t really learn any new, independent relationships from the data – this is evidenced by the failure to beat the previous model’s validation accuracy. Perhaps 53% is the upper out of sample accuracy limit for this data set and this approach to modeling it.

With dropout, you can also afford to use a larger learning rate, which means it is a good idea to make use of the

reduce_lr_on_plateaucallback and kick off training with a higher learning rate, which can always be decayed as learning stalls.

Finally, one important consideration when using dropout is constraining the size of the network weights, particularly when a large learning rate is used early in training. In the Hinton

Keras makes that easy thanks to the

kernel_constraintparameter of

layer_dense():

max_weight_constraint <- 5 model.drop <- keras_model_sequential() model.drop %>% layer_dense(units = 150, activation = 'relu', kernel_constraint = constraint_maxnorm(max_value = max_weight_constraint), input_shape = ncol(X_train)) %>% layer_dropout(rate = 0.3) %>% layer_dense(units = 150, activation = 'relu', kernel_constraint = constraint_maxnorm(max_value = max_weight_constraint)) %>% layer_dropout(rate = 0.3) %>% layer_dense(units = 150, activation = 'relu', kernel_constraint = constraint_maxnorm(max_value = max_weight_constraint)) %>% layer_dropout(rate = 0.3) %>% layer_dense(units = 1, activation = 'sigmoid')

This model provided an ever-so-slight bump in validation accuracy:

And quite a stunning test-set equity curve:

# get predictions on test set and plot simple, frictionless PnL preds <- model.drop %>% predict_proba(X_test) threshold <- 0.5 trades <- ifelse(preds >= threshold, Y_test_raw, ifelse(preds <= 1-threshold, -Y_test_raw, 0)) plot(cumsum(trades), type='l')

Interestingly, every experiment I performed in writing this post resulted in a positive out of sample equity curve. The results were all slightly different, even when using the same model setup, which reflects the non-deterministic nature of the training process (two identical networks trained on the same data can result in different weights, depending on the initial, pre-training weights of each network). Some equity curves were better than others, but they were all positive.

Here are some examples:

Of course, as mentioned in the last post, the edge of these models disappears when we apply retail spreads and broker commissions, but the frictionless equity curves demonstrate that deep learning, even using a simple feed-forward architecture, can extract predictive information from historical price action, at least for this particular data set, and that tools like regularization and dropout can make a difference to the quality of the model’s predictions.

Before we get into advanced model architectures, in the next unit I’ll show you:

- One of the more cutting edge architectures to get the most out of a densely connected feed forward network.
- How to interrogate and visualize the training process in real time.

This post demonstrated how to fight overfitting with regularization and dropout using Keras’ sequential model paradigm. While we further refined our previously identified slim edge in predicting the EUR/USD exchange rate’s direction, in practical terms, traders with access to retail spreads and commission will want to consider longer holding times to generate more profit per trade, or will need a more performant model to make money with this approach.

**Where to from here?**

*To find out***why AI is taking off in finance**, check out these insights from my days as an AI consultant to the finance industry*If this***walk-through**was useful for you, you might like to check out another how-to article on running trading algorithms on Google Cloud Platform*If the***technical details of neural networks**are interesting for you, you might like our introductory article*Be sure to check out Part 1, Part 2, and Part 3 of this series on deep learning applications for trading.**If you’re ready to go deeper and get more practical tips and tricks on building robust trading systems, consider becoming a Robot Wealth member.*

The post Deep Learning for Trading Part 4: Fighting Overfitting with Dropout and Regularization appeared first on Robot Wealth.

]]>The post Deep Learning for Trading Part 3: Feed Forward Networks appeared first on Robot Wealth.

]]>This is the third in a multi-part series in which we** ****explore and compare various deep learning tools and techniques for market forecasting using Keras and TensorFlow**.

In Part 1, we introduced Keras and discussed some of the major obstacles to using deep learning techniques in trading systems, including a warning about attempting to extract meaningful signals from historical market data. If you haven’t read that article, it is highly recommended that you do so before proceeding, as the context it provides is important. Read Part 1 here.

Part 2 provides a walk-through of setting up **Keras and Tensorflow for R** using either the default **CPU-based configuration**, or the more complex and involved (but well worth it) **GPU-based configuration **under the Windows environment**. **Read Part 2 here.

Part 3 is an **introduction to the model building, training and evaluation process in Keras**. We train a simple feed forward network to predict the direction of a foreign exchange market over a time horizon of hour and assess its performance.

.

Now that you can train your deep learning models on a GPU, the fun can really start. By the end of this series, we’ll be building interesting and complex models that predict multiple outputs, handle the sequential and temporal aspects of time series data, and even use custom cost functions that are particularly relevant to financial data. But before we get there, we’ll start with the basics.

In this post, we’ll build our first neural network in Keras, train it, and evaluate it. This will enable us to understand the basic building blocks of Keras, which is a prerequisite for building more advanced models.

There are numerous possible ways to formulate a market forecasting problem. For the sake of this example, we will forecast the direction of the EUR/USD exchange rate over a time horizon of one hour. That is, our model will attempt to classify the next hour’s market direction as either up or down.

Our data will consist of hourly EUR/USD exchange rate history obtained from FXCM (**IMPORTANT**: read the caveats and limitations associated with using past market data to predict the future here). Our data covers the period 2010 to 2017.

Our features will simply consist of a number of variables related to price action:

- Change in hourly closing price
- Change in hourly highest price
- Change in hourly lowest price
- Distance between the hourly high and close
- Distance between the hourly low and close
- Distance between the hourly high and low (the hourly range)

We will use several past values of these variables, as well as the current values, to predict the target. We’ll also include the hour of day as a feature in the hope of capturing intraday seasonality effects.** **

Training of neural networks normally proceeds more efficiently if we scale our input features to force them into a similar range. There are various scaling strategies throughout the deep learning literature (see for example Geoffrey Hinton’s Neural Networks for Machine Learning course), but scaling remains something of an art rather than a one-size-fits all type problem.

The standard approach to scaling involves normalizing the *entire* data set using the mean and standard deviation of each feature in the *training* set. This prevents data leakage from the test and validation sets into the training set, which can produce overly optimistic results. The problem with this approach for financial data is that it often results in scaled test or validation data that winds up being way outside the range of the training set. This is related to the problem of non-stationarity of financial data and is a significant issue. After all, if a model is asked to predict on data that is very different to its training data, it is unlikely to produce good results.

One way around this is to scale data relative to the recent past. This ensures that the test and validation data is always on the intended scale. But the downside is that we introduce an additional parameter to our model: the amount of data from the recent past that we use in our scaling function. So we end up introducing another problem to solve an existing one.

Like I said, feature scaling is something of an art form, particularly when dealing with data as poorly behaved as financial data!

We’ll do our model building and experimentation in R, but first we need to generate our data. There is a Zorro script named ‘keras_data_gen.c’ for creating our targets and scaled features, and for exporting that data to a CSV file in this download link.. The script will allow you to code your own features and targets, use different scaling strategies, and generate data for different instruments. Just make the changes, then click ‘Train’ on the Zorro GUI to export the data to file. If you’d prefer to just get your hands on the data used in this post, it’s also available via the download link..

Our target is the direction of the market over a period of one hour, which implies a classification problem. The target exported in the script is the actual dollar amount made or lost by going long the market at 0.01 lots, exclusive of trading costs. We need to convert this to a factor reflecting the market’s movement either up or down. More on this below.

Let’s import our data into R and take a closer look. First, here’s a time series plot of the first ten days of our scaled features:

You can see that our features are roughly on the same scale. Notice the first feature, V1, which corresponds to the hour of the day. It has been scaled using a slightly different approach to the other variables to ensure that the cyclical nature of that variable is maintained. See the code in the download link above for details.

Next, here’s a scatterplot matrix of our variables and target (the first ten days of data only):

Now that we’ve got our data, we’ll see if we can extract any predictive information using deep learning techniques. In this post, we’ll look at fully connected feed-forward networks, which are kind of the like the ‘Hello World’ example of deep learning. In later posts, we’ll explore some more interesting networks.

A fully connected feed forward network is one in which every neuron in a particular layer is connected to every neuron in the subsequent layer, and in which information flows in one direction only, from input to output.

Here’s a schematic of such a network with an input layer, two hidden layers and an output layer consisting of a single neuron (source: datasciencecentral.com):

It makes sense that our network would likely benefit from using not only the features for the current time step, but also a number of prior values as well, in order to predict the target. That means that we need to create features out of lagged values of our raw feature variables.

Thankfully, that’s easily accomplished using base R’s

embed()function, which also automatically drops the NA values which arise in the first \(n\) observations, where \(n\) is the number of lags to use as features. Here’s a function which returns an expanded data set consisting of the current features as well as their

lagslagged values. It assumes that the target is in the final column (and doesn’t embed lagged values of the target) and drops the relevant NA values from the target column.

# function for creating features from lagged variables lag_variables_to_features <- function(data, num_lags=1) { d <- embed(data[, -ncol(data)], num_lags+1) # this automatically drops NA, assumes target in last column d <- cbind(d, data[(num_lags+1):nrow(data), ncol(data)]) # add column for target, dropping num_lags return(d) }

Let’s test the function and take a look at its output:

# test lagging function set.seed(503) dat <- replicate(3, rnorm(10, 0, 1)) dat # [,1] [,2] [,3] # [1,] 0.355125070 -0.42202083 2.2040012 # [2,] -0.778893409 -0.03744167 0.4128119 # [3,] -0.757356957 -0.20609016 1.0322519 # [4,] 2.329800607 2.01835389 0.7804746 # [5,] 0.283974926 -0.60559854 2.5843431 # [6,] 1.281025216 -0.28414168 0.2339200 # [7,] -0.002363249 0.96044445 1.3501947 # [8,] 1.033770690 0.74774752 -0.4097266 # [9,] -0.431933268 -0.01286499 -0.3662180 # [10,] -0.342867464 -0.71862991 -1.0912861 dat <- lag_variables_to_features(dat, 2) dat # [,1] [,2] [,3] [,4] [,5] [,6] [,7] # [1,] -0.757356957 -0.20609016 -0.778893409 -0.03744167 0.355125070 -0.42202083 1.0322519 # [2,] 2.329800607 2.01835389 -0.757356957 -0.20609016 -0.778893409 -0.03744167 0.7804746 # [3,] 0.283974926 -0.60559854 2.329800607 2.01835389 -0.757356957 -0.20609016 2.5843431 # [4,] 1.281025216 -0.28414168 0.283974926 -0.60559854 2.329800607 2.01835389 0.2339200 # [5,] -0.002363249 0.96044445 1.281025216 -0.28414168 0.283974926 -0.60559854 1.3501947 # [6,] 1.033770690 0.74774752 -0.002363249 0.96044445 1.281025216 -0.28414168 -0.4097266 # [7,] -0.431933268 -0.01286499 1.033770690 0.74774752 -0.002363249 0.96044445 -0.3662180 # [8,] -0.342867464 -0.71862991 -0.431933268 -0.01286499 1.033770690 0.74774752 -1.0912861

You can see that the function returns a new dataset with the current features and their last two lagged values, while the target remains unchanged in the final column. Note that the two rows that wind up with NA values are automatically dropped.

Essentially, this approach makes new features out of lagged values of each feature. But here’s the thing about feed forward networks: they don’t distinguish between more recent values of our features and older values. Obviously the network differentiates between the different features that we create out of lagged values, and has the ability to discern relationships between them, but it doesn’t explicitly factor the sequential nature of the data.

That’s one of the major limitations of fully connected feed forward networks applied to time series forecasting exercises, and one of the motivators of recurrent architectures, which we will get to soon enough.

Now that we can process our input data, we can start experimenting with the model building process. The best place to start is Keras’ sequential model, which is essentially a paradigm for constructing deep neural networks, one layer at a time, under the assumption that the network consists of a linear stack of layers and has only a single set of inputs and outputs. You’ll find that this assumption holds for the majority of networks that you build, and it provides a very modular and efficient method of experimenting with such networks. We’ll use the sequential model quite a lot over the coming posts before getting into some more complex models that don’t fit this paradigm.

In Keras, the model building and exploration workflow typically consists of the following steps:

- Define the input data and the target. Split the data into training, validation and test sets.
- Define a stack of layers that will be used to predict the target from the input. This is the step that defines the network architecture.
- Configure the model training process with an appropriate loss function, optimizer and various metrics to be monitored.
- Train the model by repeatedly exposing it to the training data and updating the network weights according to the loss function and optimizer chosen in the previous step.
- Evaluate the model on the test set.

Let’s go through each step.

Here’s some code for loading and processing our data. It firstly loads the data set we created with our Zorro script from above and creates a new data set consisting of the current value of each feature, as well as the seven recent lagged variables. That is, we have a total of eight timesteps for each feature. And since we started with 7 features, we have a total of 56 input variables.

We also split the dataset into a training, validation and testing set. Here, I arbitrarily chose to use 50% of the data for training, 25% for validation and 25% for testing. Note that since the time aspect of our data is critical, we should ensure that our training, validation and testing data are not randomly sampled as is standard procedure in many non-sequential applications. Rather, the training, validation and test sets should come from chronological time periods.

Note that we convert our target into a binary outcome, which enables us to build a classifier.

Recall that we scaled our features at the same time as we generated them, so no need to do any feature scaling here.

## load, process and split data ## # load path <- "C:/Users/Kris/Data/" XY <- read.csv(paste0(path, 'EURUSD_L_2010_2017.csv'), header = F) XY <- as.matrix(XY) # create lags lags <- 7 proc <- lag_variables_to_features(XY, lags) # split into training, validation and test sets train_length <- floor(0.5*nrow(proc)) val_length <- floor(0.25*nrow(proc)) X_train <- proc[1:train_length, -ncol(proc)] Y_train_raw <- proc[1:train_length, ncol(proc)] Y_train <- ifelse(Y_train_raw > 0, 1, 0) X_val <- proc[(train_length+1):(train_length+val_length), -ncol(proc)] Y_val_raw <- proc[(train_length+1):(train_length+val_length), ncol(proc)] Y_val <- ifelse(Y_val_raw > 0, 1, 0) X_test <- proc[(train_length+val_length+1):nrow(proc), -ncol(proc)] Y_test_raw <- proc[(train_length+val_length+1):nrow(proc), ncol(proc)] Y_test <- ifelse(Y_test_raw > 0, 1, 0)

Next we define the stack of layers that will become our model. The syntax might seem quirky at first, but once you’re used to it, you’ll find that you can build and experiment with different architectures very quickly.

The syntax of the sequential model uses the pipeline operator

%>%which you might be familiar with if you use the

dplyrpackage. In essence, we define a model using the sequential paradigm, and then use the pipeline operator to define the order in which layers are stacked. Here’s an example:

model <- keras_model_sequential() model %>% layer_dense(units = 150, activation = 'relu', input_shape = ncol(X_train)) %>% layer_dense(units = 150, activation = 'relu') %>% layer_dense(units = 150, activation = 'relu') %>% layer_dense(units = 1, activation = 'sigmoid')

This defines a fully connected feed forward network with three hidden layers, each of which consists of 150 neurons with the rectified linear (

'relu') activation function. If you need a refresher on activation functions, check out this post on neural network basics.

layer_dense()defines a fully connected layer – that is, one in which each input is connected to every neuron in the layer. Note that for the first layer, we need to define the input shape, which is simply the number of features in our data set. We only need to do this on the first layer; each subsequent layer gets its input shape from the output of the prior layer.

layer_dense()has many arguments in addition to the activation function that we specified here, including the weight initialization scheme and various regularization settings. We use the defaults in this example.

Keras implements many other layers, some of which we’ll explore in subsequent posts.

In this example, our network terminates with an output layer consisting of a single neuron with the sigmoid activation function. This activation function converts the output to a value between 0 and 1, which we interpret as the probability associated with the positive class in a binary classification problem (in this case, the value 1, corresponding to an up move).

To get an overview of the model, call

summary(model)and observe the output:

___________________________________________________________________________________________________ Layer (type) Output Shape Param # =================================================================================================== dense_1 (Dense) (None, 150) 8550 ___________________________________________________________________________________________________ dense_2 (Dense) (None, 150) 22650 ___________________________________________________________________________________________________ dense_3 (Dense) (None, 150) 22650 ___________________________________________________________________________________________________ dense_4 (Dense) (None, 1) 151 =================================================================================================== Total params: 54,001 Trainable params: 54,001 Non-trainable params: 0 ___________________________________________________________________________________________________ >

This model architecture could be better described as ‘wide’ as opposed to ‘deep’ and it consists of around 54,000 trainable parameters. This is more than the number of observations in our data set, and has implications for the ability of our network to overfit.

Configuration of the training process is accomplished via the

keras::compile()function, in which we specify a loss function, an optimizer, and a set of metrics to monitor during training. Keras implements a suite of loss functions, optimizers and metrics out of the box, and in this example we’ll choose some sensible defaults:

model %>% compile( loss = 'binary_crossentropy', optimizer = optimizer_rmsprop(lr=0.0001), metrics = c('accuracy') )

The

'binary_crossentropy'loss function is standard for binary classifiers and the

rmsprop()optimizer is nearly always a good choice. Here we specify a learning rate of 0.0001, but finding a sensible value typically requires some experimentation. Finally, we tell Keras to keep track of our model’s accuracy, as well as the loss during the training process.

An important consideration regarding loss functions for financial prediction is that the standard loss functions rarely capture the realities of trading. For example, consider a regression model that predicts a price change over some time horizon trained using the mean absolute error of the predictions. Say the model predicted a price change of 20 ticks, but the actual outcome was 10 ticks. In practical trading terms, such an outcome would result in a profit of 10 ticks – not a terrible outcome at all. But that result is treated the same as a prediction of 5 ticks that resulted in an actual outcome of -5 ticks, which would result in a loss of 5 ticks in a trading model. That’s because the loss function is only concerned with the magnitude of the difference between the predicted and actual outcomes – but that doesn’t tell the full story. Clearly, we’d likely to penalize the latter error more than the former. To do that, we need to implement our own custom loss functions. I’ll show you how to do that in a later post, but for now it’s important to be cognizant of the limitations of our model training process.

We can train our model using

keras::fit(), which exposes our model to subsequent batches of training data, updating the network’s weights after each batch. Training progresses for a specified number of epochs and performance is monitored on both the training and validation sets.

We would normally like to stop training at the number of epochs that maximize the model’s performance on the validation set. That is, at the point just before the network starts to overfit. The problem is we can’t know

To combat this,

keras::fit()implements the concept of a callback, which is simply a function that performs some task at various points throughout the training process. There are a number of callbacks available in Keras out of the box, and it is also possible to implement your own.

In this example we’ll use the

model_checkpoint()callback, which we configure to save the network and it’s weights at the end of any epoch whose weight update results in improved validation performance. After training is complete, we can then load our best model for evaluation on the test set.

First, here’s how to configure the checkpoint callback (just set up the relevant filepath for your setup):

filepath <- "C:/Users/Kris/Research/DeepLearningForTrading/model.hdf5" checkpoint <- callback_model_checkpoint(filepath = filepath, monitor = "val_acc", verbose = 1, save_best_only = TRUE, save_weights_only = FALSE, mode = "auto")

And here’s how to configure

keras:fit()for a short training run of 75 epochs, with the model checkpoint callback:

history <- model %>% fit( X_train, Y_train, epochs = 75, batch_size = nrow(X_train), validation_data = list(X_val, Y_val), shuffle = TRUE, callbacks = list(checkpoint) )

After training is complete, we can plot the loss and accuracy of the training and validation sets at each epoch by simply calling

plot(history), which results in the following plot:

We can see that loss on the training set continuously decreases while accuracy almost continuously increases as training progresses. That is expected given the power of our network to overfit. But note the small decrease in validation loss and the bump in validation accuracy that we also get out to about 40 epochs before stalling.

A validation accuracy of a little under 53% is certainly not the sort of result that would turn heads in the classic applications of deep learning, like image classification. But trading is an interesting application, because we don’t necessarily need the same sort of performance to make money. But is a validation accuracy of 53% enough to give us some out of sample profits? Let’s find out by evaluating our model on the test set.

Here’s how to remove the fully trained model, load the model with the highest validation accuracy and evaluate it on the test set, with the output shown below the code:

rm(model) model <- keras:::keras$models$load_model(filepath) model %>% evaluate(X_test, Y_test) # output: # 12004/12004 [==============================] - 2s 197us/step # $loss # [1] 0.691 # $acc # [1] 0.523

We end up with a test set accuracy that is only slightly worse than our validation accuracy.

But accuracy is one thing, profitability is another. To assess the profitability of our model on the test set, we need the actual predictions on the test set. We can get the predicted classes via

predict_classes(), but I prefer to look at the actual output of the sigmoid function in the final layer of the model. That enables you to use a prediction threshold in your decision making, for example only entering a long trade when the output is greater than 0.6, say.

Here’s how to get the test set predictions and implement some simple, frictionless trading logic that assigns the target as an individual trade’s profit or loss when the prediction is greater than some threshold (equivalent to a buy) and the negative of the target when the prediction is less than 1 minus the threshold (equivalent to a sell) :

preds <- model %>% predict_proba(X_test) threshold <- 0.5 trades <- ifelse(preds >= threshold, Y_test_raw, ifelse(preds <= 1-threshold, -Y_test_raw, 0)) plot(cumsum(trades), type='l')

This results in the following equity curve (the y-axis is measured in dollars of profit from buying and selling the minimum position size of 0.01 lots):

I think that’s quite an amazing equity curve that demonstrates the potential of even a very small edge. However, note that adding typical retail transaction costs would destroy this small edge, which suggests that longer holding periods are more sensible targets, or that higher accuracies are required in practice.

Also note that you might get different results depending on the initial weights used in your network, as the weights aren’t guaranteed to converge to the same values when initialized to different values. If you repeat the training and evaluation process a number of times, you’ll find that validation accuracies in the range of 52-53% occur most of the time, but while most produce profitable out of sample equity curves, the range of performance is actually quite significant. This implies that there might be benefit in combining the predictions of multiple models using ensemble methods.

Also note that you might get different results depending on the initial weights used in your network, as the weights aren’t guaranteed to converge to the same values when initialized to different values. If you repeat the training and evaluation process a number of times, you’ll find that validation accuracies in the range of 52-53% occur most of the time, but while most produce profitable out of sample equity curves, the range of performance is actually quite significant. This implies that there might be benefit in combining the predictions of multiple models using ensemble methods.

Before we get into advanced model architectures, in the next unit I’ll show you:

- How to fight overfitting and push your models to generalize better.
- One of the more cutting edge architectures to get the most out of a densely connected feed forward network.
- How to interrogate and visualize the training process in real time.

This post demonstrated how to process multivariate time series data for use in a feed forward neural network, as well as how to construct, train and evaluate such a network using Keras’ sequential model paradigm. While we uncovered a slim edge in predicting the EUR/USD exchange rate, in practical terms, traders with access to retail spreads and commission will want to consider longer holding times to generate more profit per trade, or will need a more performant model to make money with this approach.

**Where to from here?**

*To find out***why AI is taking off in finance**, check out these insights from my days as an AI consultant to the finance industry*If this***walk-through**was useful for you, you might like to check out another how-to article on running trading algorithms on Google Cloud Platform*If the***technical details of neural networks**are interesting for you, you might like our introductory article*Be sure to check out Part 1 and Part 2 of this series on deep learning applications for trading.*

The post Deep Learning for Trading Part 3: Feed Forward Networks appeared first on Robot Wealth.

]]>The post Deep Learning for Trading Part 2: Configuring TensorFlow and Keras to run on GPU appeared first on Robot Wealth.

]]>This is the second in a multi-part series in which we** ****explore and compare various deep learning tools and techniques for market forecasting using Keras and TensorFlow**.

In Part 1, we introduced Keras and discussed some of the major obstacles to using deep learning techniques in trading systems, including a warning about attempting to extract meaningful signals from historical market data. If you haven’t read that article, it is highly recommended that you do so before proceeding, as the context it provides is important. Read Part 1 here.

Part 2 provides a walk-through of setting up **Keras and Tensorflow for R** using either the default **CPU-based configuration**, or the more complex and involved (but well worth it) **GPU-based configuration **under the Windows environment**.**

Stay tuned for Part 3 of this series which will be published next week.

No doubt you know that a computer’s Central Processing Unit (CPU) is its primary computation module. CPUs are designed and optimized for rapid computation on small amounts of data and as such, elementary arithmetic operations on a few numbers is blindingly fast. However, CPUs tend to struggle when asked to operate on larger amounts of data, for example performing matrix operations on large arrays. And guess what: **the computational nuts and bolts of deep learning is all about such matrix operations.** That’s bad news for a CPU.

The rendering of computer graphics relies on these same types of operations, and Graphical Processing Units (GPUs) were developed to optimize and accelerate them. GPUs typically consist of hundreds or even thousands of cores, enabling massive parallelization. **This makes GPUs a far more suitable hardware for deep learning than the CPU.**

Of course, you can do deep learning on a CPU. And this is fine for small scale research projects or just getting a feel for the technique. But for doing any serious deep learning research, access to a GPU will provide an enormous boost in productivity and shorten the feedback loop considerably. Instead of waiting days for a model to train, you might only have to wait hours. Instead of waiting hours, you’ll only have to wait minutes.

When selecting a GPU for deep learning, the most important characteristic is the **memory bandwidth** of the unit, not the number of cores as one might expect. That’s because it typically takes more time to read the data from memory than to perform the actual computations on that data! So if you want to do fast deep learning research, be sure to check the memory bandwidth of your GPU. By way of comparison, my (slightly outdated) NVIDIA GTX 970M has a memory bandwidth of around 120 GB/s. The GTX 980Ti clocks in at around 330 GB/s!

If you don’t have access to a GPU, or if you just want to try out some deep learning in Keras before committing to a full-blown deep learning research project, then the CPU installation is the right one for you. It will only take a couple of minutes and a few lines of code, as opposed to an hour or so and a deep dive into your system for the GPU option.

Here’s how to install Keras to run TensorFlow on the CPU.

At the time of writing, the Keras R package could be installed from CRAN, but I preferred to install directly from GitHub. To do so, you need to first install the

devtoolspackage, and then do

devtools::install_github("rstudio/keras")

Then, load the Keras package and make use of the convenient

install_keras()function to install both Keras and TensorFlow:

library(keras) install_keras()

That’s it! You now have the CPU-based versions of Keras and TensorFlow ready to go, which is fine if you are just starting out with deep learning and want to explore it at a high level. If you don’t want the GPU-based versions just yet, then I’m afraid that’s all we have for you until the next post!

Installing versions of Keras and TensorFlow compatible with NVIDIA GPUs is a little more involved, but is certainly worth doing if you have the appropriate hardware and intend to do a decent amount of deep learning research. The speed up in model training is* really* significant.

Here’s how to install and configure the NVIDIA GPU-compatible version of Keras and TensorFlow for R under Windows.

First, you need to work out if you have a compatible NVIDIA GPU installed on your Windows machine. To do so, open your NVIDIA Control Panel. Typically, it’s located under

C:\Program Files\NVIDIA Corporation\Control Panel Client, but on recent Windows versions you can also find it by right-clicking on the desktop and selecting ‘NVIDIA Control Panel’, like in the screenshot below:

When the control panel opens, click on the System Information link in the lower left corner, circled in the screenshot below:

This will bring up the details of your NVIDIA GPU. Note your GPU’s model name (here mine is a GeoForce GTX 970M, which you can see under the ‘Items’ column): While you’re at it, check how your GPU’s memory bandwidth stacks up (remember this parameter is the limiting factor of the GPU’s speed on deep learning tasks).

Next, head over to NVIDIA’s GPU documentation, located at https://developer.nvidia.com/cuda-gpus. You’ll need to find your GPU model on this page and work out its Compute Capability Number. This needs to be 3.0 or higher to be compatible with TensorFlow. You can see in the screenshot below that my particular GPU model has a Compute Capability of 5.2, which means that I can use it to train deep learning models in TensorFlow. Hooray for productivity.

In practice, my GPU model is now a few years old and there are much better ones available today. But still, using this GPU provides far superior model training times than using a CPU.

Next, you’ll need to download and install NVIDIA’s CUDA Toolkit. CUDA is NVIDIA’s parallel computing API that enables programming on the GPU. Thus, it provides the framework for harnessing the massive parallel processing capabilities of the GPU. At the time of writing, the release version of TensorFlow (1.4) was compatible with version 8 of the CUDA Toolkit (**NOT version 9**, which is the current release), which you’ll need to download via the CUDA archives here.4

You’ll also need to get the latest drivers for your particular GPU from NVIDIA’s driver download page. Download the correct driver for your GPU and then install it.

Finally, you’ll need to get NVIDIA’s CUDA Deep Neural Network library (cuDNN). cuDNN is essentially a library for deep learning built using the CUDA framework and enables computational tools like TensorFlow to access GPU acceleration. You can read all about cuDNN here. In order to download it, you will need to sign up for an NVIDIA developers account.

Having activated your NVIDIA developers account, you’ll need to download the correct version of cuDNN. **The current release of TensorFlow (version 1.4) requires cuDNN version 6**. However, the latest version of cuDNN is 7, and it’s not immediately obvious how to acquire version 6. You’ll need to head over to this page, and under the text on ‘What’s New in cuDNN 7?’ click the Download button. After agreeing to some terms and conditions, you’ll then be able to select from numerous versions of cuDNN. Make sure to get the version of cuDNN that is compatible with your version of CUDA (version 8), as there are different sub-versions of cuDNN for each version of CUDA.2

Confusing, no? I’ve circled the correct (at the time of writing) cuDNN version in the screenshot below (click for a clearer image):

Once you’ve downloaded the cuDNN zipped file, extract the contents to a directory of your choice.

%PATH%variable

We also need to add the paths to the CUDA and cuDNN libraries to the Windows

%PATH%variable so that TensorFlow can find them. To do so, open the Windows Control Panel, then click on

Then, when the System Properties window opens, click on

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\libnvvp

C:\ProgramData\cuDNN_6_8\bin

Here’s a screenshot of the three windows and the relevant buttons involved in this process (click for a larger image):

Having followed those steps, you’re finally in a position to install Keras and configure it to run TensorFlow on the GPU. From a fresh R or R-Studio session, install the Keras package if you haven’t yet done so, then load it and run

install_keras()with the argument

tensorflow = 'gpu':

devtools::install_github("rstudio/keras") library(keras) install_keras(tensorflow = 'gpu')

The installation process might take quite some time, but don’t worry, you’ll get that time back and a whole lot more in faster training of your deep learning experiments.

**That’s it! Congratulations! You are now ready to perform efficient deep learning research on your GPU! We’ll dive into that in the next unit.**

When I first set this up, I found that Keras was throwing errors that it couldn’t find certain TensorFlow modules. Eventually I worked out that it was because I already had a version of TensorFlow installed in my main conda environment thanks to some Python work I’d done previously. If you have the same problem, explicitly setting the conda environment immediately after loading the Keras package should resolve it:

library(keras) use_condaenv("r-tensorflow")

Also note that the compatible versions of CUDA and cuDNN may change as new versions of TensorFlow are released. It is worth double checking the correct versions at tensorflow.org.

**Where to from here?**

*If this***walk-through**was useful for you, you might like to check out another how-to article on running trading algorithms on Google Cloud Platform*If the***technical details of neural networks**are interesting for you, you might like our introductory article*To find out***why AI is taking off in finance**, check out these insights from my days as an AI consultant to the finance industry

The post Deep Learning for Trading Part 2: Configuring TensorFlow and Keras to run on GPU appeared first on Robot Wealth.

]]>The post Deep Learning for Trading Part 1: Can it Work? appeared first on Robot Wealth.

]]>This is the first in a multi-part series in which we** **

Part 2 provides a walk-through of setting up **Keras and Tensorflow for R** using either the default **CPU-based configuration**, or the more complex and involved (but well worth it) **GPU-based configuration **under the Windows environment**.**

In the last few years, deep learning has gone from being an interesting but impractical academic pursuit to a ubiquitous technology that touches many aspects of our lives on a daily basis – including in the world of trading. This meteoric rise has been fuelled by a perfect storm of:

- Frequent breakthroughs in deep learning research which regularly provide better tools for training deep neural networks
- An explosion in the quantity and availability of data
- The availability of cheap and plentiful compute power
- The rise of open source deep learning tools that facilitate both the practical application of the technology
*and*innovative research that drives the field ever forward

Deep learning excels at discovering complex and abstract patterns in data and has proven itself on tasks that have traditionally required the intuitive thinking of the human brain to solve. That is, **deep learning is solving problems that have thus far proven beyond the ability of machines**.

Therefore, it is incredibly tempting to **apply deep learning to the problem of forecasting the financial markets**. And indeed, certain research indicates that this approach has potential. For example, the Financial Hacker found an edge in predicting the EUR/USD exchange rate using a deep architecture stacked with an autoencoder. Here at Robot Wealth, we compared the performance of numerous machine learning algorithms on a financial prediction task, and deep learning was the clear outperformer.

However, as anyone who has used deep learning in a trading application can attest, the problem is not nearly as simple as just feeding some market data to an algorithm and using the predictions to make trading decisions. Some of the common issues that need to be solved include:

- Working out a sensible way to
**frame the forecasting problem**, for example as a classification or regression problem. **Scaling data**in a way that facilitates training of the deep network.- Deciding on an appropriate
**network architecture**. **Tuning the hyperparameters of the network and optimization algorithm**such that the network converges sensibly and efficiently. Depending on the architecture chosen, there might be a couple of dozen hyperparameters that affect the model, which can provide a significant headache.**Coming up with a cost function**that is applicable to the problem.- Dealing with the problem of an ever-changing market.
**Market data tends to be non-stationary**, which means that a network trained on historical data might very well prove useless when used with future data. - There may be
**very little signal**in historical market data with respect to the future direction of the market. This makes sense intuitively if you consider that the market is impacted by more than just its historical price and volume. Further, pretty much everyone who trades a particular market will be looking at its historical data and using it in some way to inform their trading decisions. That means that market data alone may not give an individual much of a unique edge.

The first five issues listed above are common to most machine learning problems and their resolution represents a big part of what applied data science is all about. The implication is that while these problems are not trivial, they are by no means deal breakers.

On the other hand, problems 6 and 7 may very well prove to thwart the best attempts at using deep learning to turn past market data into profitable trading signals. **No machine learning algorithm or artificial intelligence can make good future predictions if its training data has no relationship to the target being predicted**, or if that relationship changes significantly over time.3

Said differently, feeding market data to a machine learning algorithm is only useful to the extent that the past is a predictor of the future. And we all know what they say about past performance and future returns.

In deep learning trading systems that I’ve taken to market, I’ve always used additional data, not just historical, regularly sampled price and volume data and transformations thereof. While there does appear to be a slim edge in using deep learning to extract signals from past market data, that edge may not be significant enough to overcome transaction costs. And even if it does, it may not be significant enough to justify the risk and effort required to take it to market. On the other hand, supplementing historical market data with innovative, uncommon data sets has proven more effective – at least in my experience. 2

In this series of posts, we explore and compare various deep learning tools and techniques in relation to market forecasting using the Keras package. We will do so using only historical market data, so the results need to be interpreted considering the discussion above.

We expect deep learning to uncover a slim edge using historical market data, but the purpose of this analysis is to compare different deep learning tools in relation to market forecasting, not necessarily to build a market-beating trading system. That I leave to you – perhaps you can supplement the models we explore here with some creative or uncommon data or other tools to find a real edge.

Keras is a high-level API for building and training neural networks. Its strength lies in its ability to facilitate fast and efficient research, which of course is very important for systematic traders, particularly those of the DIY persuasion for whom time is often the limiting factor to success. Keras is easy to learn and its syntax is particularly friendly. Keras also plays nicely with CPUs and GPUs and can integrate with the TensorFlow, Theano and CNTK backends – without limiting the flexibility of those tools. For example, pretty much anything you can implement in raw TensorFlow, you can also implement in Keras, likely at a fraction of the development effort.

Keras is also implemented in R, which means that we can use it directly in any trading algorithm developed on the Zorro Automated Trading Platform, since Zorro has seamless integration with an R session.3

In the deep learning experiments that follow in Part 2 and beyond, we’ll use the R implementation of Keras with TensorFlow backend. We’ll be exploring fully connected feedforward networks, various recurrent architectures including the Gated Recurrent Unit (GRU) and Long Short-Term Memory (LSTM), and even convolutional neural networks which normally find application in computer vision and image classification.

Stay tuned.

The post Deep Learning for Trading Part 1: Can it Work? appeared first on Robot Wealth.

]]>The post From Potential to Proven: Why AI is Taking Off in the Finance World appeared first on Robot Wealth.

]]>Amongst value managers, I saw scepticism become replaced with a sense of anxiety over being late to the party. The first question I was asked by nearly every value manager I met over the last year or so was: “what is everyone else doing with machine learning?”

This sense of FOMO is arising now because general knowledge of the potential of machine learning has reached a critical mass amongst the decision makers and management across the industry.

Despite the seclusion inherent in our industry, where ‘secret sauce’ is closely guarded, the fruits of the labour of the early adopters are gaining ever-increasing public exposure, shifting the perception of the technology from ‘potential’ to ‘proven’.

In short, finance is catching up to the many other industries where this technology is already in common use.

When my consulting company first started applying and recommending machine learning solutions to financial problems, we encountered mixed attitudes from the industry. While a few were enthusiastic adopters who could see the potential, the attitude that machine learning was less than useful – even dangerous – and dismissals of the technology as ‘voodoo science’ were incredibly common.

**Surprisingly, these attitudes often came from other quant researchers. **

Within the quant community, I’ve witnessed first-hand this attitude gradually giving way to one of recognition of machine learning as a useful tool. I’ve even noted some folks who decried the approach now calling themselves ‘machine learning experts’ on their business cards and LinkedIn profiles. Times really have changed, and they changed in an astonishingly short space of time.

More recently, I’ve seen an * even more* significant change, as participants

Amid the growing consensus that alpha is discoverable in alternative data, our own work and the work of others suggests that alpha from such sources may be uncorrelated with traditional factors like value and momentum. Perhaps, for the time being at least, they can coexist and even provide new dimensions of diversification.

Alpha generation has always been about information advantage – either having access to uncommon insights gained through ingenuity *or* common insights acted upon before everyone else.

Machine learning and artificial intelligence is simply the modern evolution of a repeating historical pattern in the context of today’s big data world. For example, interpreting satellite imagery of a retailer’s car park reveals insight about its sales figures before they are released to the market. Deriving sentiment from Twitter or Weibo and relating it to an asset’s returns provides an uncommon insight gained through ingenuity.

**Artificial intelligence excels at tasks like these to the point that such AI is rapidly becoming a commodity.**

As the pool of data (be it alternative, big, structured or unstructured) continues its exponential growth, machine learning and artificial intelligence tools will increasingly be adopted for processing and unravelling it – simply because they are the best tools for the job.

JP Morgan believes there will come a time when they are the *only* tools for the job.

**My experience tells me that that time has already arrived – fund managers who are slow to the party would do well to get on board to not only build competitive advantage, but to maintain what they’ve already got.**

*Have you witnessed a shift in the way machine learning and artificial intelligence is viewed and used in the finance industry? I’d love to hear about other people’s experiences in the comments.*

The post From Potential to Proven: Why AI is Taking Off in the Finance World appeared first on Robot Wealth.

]]>
Robot Wealth Members can access the script that produced these results via the Strategies and Tools section of their dashboard.