The post Backtesting Bias: <br>Feels Good, Until You Blow Up appeared first on Robot Wealth.

]]>Knowing exactly what causes exploitable inefficiencies would make predicting market behaviour and building profitable trading strategies a fairly cushy gig, right?

If you’re an engineer or scientist reading this, you are probably nodding along, hoping I’ll say the financial markets show some kind of* domino effect* for capitalists. That you can model them with the kinds of analytical methods you’d throw at a construction project or the petri dish. But unfortunately, trying to shoehorn the markets into formulas is a futile exercise… like stuffing Robot Wealth’s frontman Kris into a suit.

Since the markets aren’t strictly deterministic, this makes testing your new and exciting strategy ideas a bit tricky. We’d all love to know for sure whether our ideas will be profitable before we throw real money at them. But, since you can’t realistically apply mathematical equations to your strategy to derive its future performance, you’ll need to resort to the next best thing — **experimentation.**

By experimenting with your trading strategy during development, you’re left wide open to some *fatal errors* when assessing its potential future performance. These can, and will, cost you time, money and many, many headaches.

**In this post, you’re going to learn how to identify and avoid the more common, often expensive simulation biases so you can build more robust, more profitable systematic trading strategies. Let’s go!**

As far as I know, you’ve got two options when it comes to testing systematic trading strategies:

- You can find out the
**actual**performance of your strategy - You can find out the
**likely**performance of your strategy

The first option simply involves throwing real money at a live version of your strategy and seeing if it does anything exciting. The second is **testing your strategy**, usually on past data, before following option 1 later down the pipeline. If like us, you *don’t* enjoy needlessly throwing away wads of your capital, you’ll first test your strategy via simulation, assessing whether it is likely to perform as well in the future as it did in the past.

In financial trading, such a simulation of past performance is called a **backtest.**

Despite what you’ll read online in what I’ll call *retail trading folklore*, backtesting your strategies is not as simple as aligning signals with entry and exit prices and summoning the results. Simply doing a backtest is one thing, but gaining accurate, actionable data that will help you keep more of your capital is another

Such simplistic approaches will undoubtedly lead to *some ***backtest bias**.

In this post, we’re going to focus on a few ways that biases in your **development methodology** can hold you back from successful trading. Many of these effects are subtle yet profound: they can and will creep into your development process and can have disastrous effects on strategy performance.

If you could travel back in time, you’d probably be quite tactical about your “creative” work wherever you landed. Google? Your idea. *Predicting* future events to become a local deity? Hold my feather crown.

In trading, you might want to hold off on all that. Look-ahead bias is introduced by allowing future knowledge to affect your decisions around historical scenarios or events. As a trader running backtests, this bias impacts your trade decisions by **acting upon knowledge that** **would not have been available** at the time the original trade decision was taken.

What does this look like in practice?

A popular example is executing an intra-day trade on the basis of the day’s closing price, when that closing price is not actually known until the end of the day.

Even if you use backtesting software that’s designed against lookahead bias, you need to be careful. A subtle but potentially serious mistake is to use the entire simulation period to calculate a trade parameter (for example, a portfolio optimization parameter) which is then retrospectively applied at the beginning of the simulation.

This error is so common that you must always double check for it. And triple check if your backtest looks really good.

If you run a backtest producing annual returns in the ballpark of **thousands of percent**, don’t quit your day job – you’ve likely stumbled across overfitting bias. Apart from being comedic firewood for your favourite FX forum, these backtests are useless for real, systematic trading purposes.

Check out the following plots to see overfitting in action. The blue squares in the figure below are an artificially generated quadratic function with some noise added to distort the underlying signal. The lines represent various models fitted to the data points. The red line is a linear regression line; the green, blue and orange lines are quadratic, cubic and quartic functions respectively. Apart from the linear regression line, these all do a decent job of modelling the data, for this region in the parameter space.

The pink line is a high-order polynomial regression: notice that it fits this data best of all:

But do these models hold up out-of-sample? What I’m really asking is, how well do they generalize to data that was not used in the model-fitting process?

Well, the next plot shows the performance of the quadratic, cubic and quartic functions in a new region of the observed variable space, meaning an out-of-sample data set. In this case, the quadratic function is *clearly the best performer*, and we know that it most closely matches the underlying generating function – **this is an example of a well-fit model.**

The other models do a pretty crummy job of predicting the value of the function for this new, unseen region of parameter space, even though they looked pretty attractive on the in-sample data.

The best model on the in-sample data set, the high-order polynomial, does a terrible job of modeling this out-of-sample region. In fact, in order to see it, we have to look at a completely different portion of the y-axis, and even use a logarithmic scale to make sense of it:

This model is predicting hugely negative values of our function when we know that it could *never* generate a single negative value (thanks to the quadratic term in the underlying function). The function looks nothing like a quadratic function: it is more like a hyperbolic function. Or bad modern art.

This misrepresentation of the underlying process is a classic example of overfitting, and it’ll have you banging your head against the wall a lot in your early days of algo trading. In fact, you’ll face this problem *every day *in your strategy development, you just learn how to eat its punches with experience.

Overfitting bias affects strategies that are tested on in-sample data. The same data is used to optimize and then test the strategy. Common sense will tell you that a strategy will perform well on the data with which it was optimized – *that’s the whole point of optimization! *What’s more, exhaustively searching the parameter space and choosing a local performance maximum will undoubtedly lead to overfitting and failure in an out-of-sample test.

It is crucial to understand that the purpose of the in-sample data set is *not* to measure the performance of a strategy. The in-sample data is used to ** develop** the strategy and find parameter values that may be suitable. At best, you should consider the in-sample results to be

*Avoid using the in-sample results to benchmark likely future performance of a strategy.*

Look again at the figures above. The model with the best in-sample performance was the high-order polynomial shown by the magenta line. That said, its out-of-sample performance was as enviable as stepping on a plug. The quadratic, cubic and fourth-order models all performed reasonably well in the in-sample test, but the quadratic model was the clear star performer in the out-of-sample test. Obviously, you can infer little about performance on unseen data using in-sample testing.

Here’s the real insidious part….

When you fit a model (a trading strategy) to a noisy data set (and financial data is a rave), **you risk fitting your model to the noise, rather than the underlying signal.** The underlying signal is the anomaly or price effect that you believe provides profitable trading opportunities, and this signal is what you are actually trying to capture with your model.

Noise gets between you and the money. It’s a random process, and it’s unlikely to repeat itself exactly the same way. If you fit your model to the noise, you’ll end up with a random model. Unless you enjoy paying for your broker’s 12oz rib-eye steak, this isn’t something you should ever trade.

So what’s the overarching lesson from all this?

Well, in-sample data is only useful in the following ways:

- Finding out whether a strategy can be profitable and under what conditions
- Determining which parameters have a significant impact on performance
- Determining sensible ranges over which parameters might be optimized
- Debugging the strategy, that is, ensuring trades are being entered as expected

Given the topic of this post, you’ll notice something missing from that list: *measuring the performance of a trading strategy. *

Any estimate of performance you derive from an in-sample test is plagued with overfitting bias and is likely to be an optimistic estimate – unless your entire development process is watertight…but that’s a story for another time.

The solution to overfitting bias is adopting a sensible approach to the markets and strategy development. This includes:

- Keeping trades simple. The fewer the number of fittable parameters, the better.
- Favouring trades that can be rationalised in a sentence, over blindly data mining for trading rules.
- Optimising for robustness, not in-sample performance (more on this later).
- Avoid the temptation to be precise in your model specification. Market data is noisy and fickle, and any signal is weak.
- Avoid trades that will, at best, marginally cover retail trading costs, such as scalping.

**Like an annoying uncle who won’t stop shouting at the 6 o’clock news, I’ll repeat myself: don’t use in-sample data to measure your strategy’s performance unless you have been very very very careful!**

Like taxes, this one is unavoidable. So, rather than spending an eternity trying to eliminate this bias entirely, just be aware of it and accept that, generally, your strategies won’t perform as well in the markets as they did in your simulations.

You’ll commonly introduce data-mining bias when selecting the best performer from a bunch of strategy variants, variables or markets to continue developing. If you’re persistent enough in trying strategies and markets, you’ll eventually find one that performs well simply due to luck.

Think of it this way. Say you develop a trend following strategy in FX. The strategy crushes its backtest on EUR/USD but flops on USD/JPY. Any sensible person would trade the EUR/USD market, right? Sure, but you’ve just introduced selection bias into your process. Now, your estimate of the strategy’s performance is upwardly biased. Should you throw the strategy in the bin? Not necessarily. Maybe it performed well on EUR/USD specifically for good reason. But nevertheless, some selection bias has crept into your development process.

There are statistical tests to account for data mining bias, including comparing the performance of the strategy with a distribution of random performances. You can find examples of this in the Robot Wealth Advanced Course.

You can also use out-of-sample data, but this quickly becomes problematic as we only have a finite amount of historical data on which to develop.

So your best bet to overcome selection bias is simply adopting a sensible, measured approach to strategy development, as outlined above.

As a rule of thumb, you want to build **robust trading strategies** that exploit real market anomalies or inefficiencies. What’s more, you want to do this with an approach grounded in **simplicity.** The more complex your approach, the more likely you are to fall into the traps we’ve talked about above. Either way, it’s surprisingly easy to find strategies that appear to do well, but whose performance turns out to be due to luck or randomness. That’s part of the game.

You have probably noticed that I introduced the concept of a “sensible approach to strategy development”. But what does that look like? We’ve covered it at a conceptual level, which is useful in its own right. That said, we think that teaching our approach in detail is much better accomplished through our Algo Bootcamps. In Bootcamp, you can learn first-hand by watching us apply the process in real time and participating in the development of real trading strategies. There are some very smart people inside our community, too, who can keep you on the right track.

This blog post originally appears as an excerpt from our *Algorithmic* *Trading with Zorro* course, which will be available in the coming weeks.

The post Backtesting Bias: <br>Feels Good, Until You Blow Up appeared first on Robot Wealth.

]]>The post Momentum Is Dead! Long Live Momentum! appeared first on Robot Wealth.

]]>As you might expect, we found evidence suggesting that risk premia are time-varying. If we could somehow predict this variation, we could use that prediction to adjust the weightings of our portfolio and quite probably improve the strategy’s performance.

This might sound simple enough, but we actually found compelling evidence both *for* and *against* our ability to time risk premia returns.

We’re always telling our Bootcamp participants that developing trading and investment strategies requires the considered balancing of evidence in the face of uncertainty. In this case, we decided that there was enough evidence to suggest that we could weakly predict time-varying risk premia returns, at least to the extent that slight weight adjustments in accordance with these predictions might provide value.

The strategy was already decent enough, so we were loath to add additional complexity that could bite us later. There was compelling evidence that our predictions could add value. But there was also a troubling deterioration in the quality of these predictions over time. In the end, we added only a very slight weight adjustment on the basis of these predictions.

Why am I telling you all this?

Well, I am *really* curious as to whether you would have made the same decision as we did. In this post, I’ll provide a bunch of our findings and let you make up your own mind. The best decision for us at the time was to only incorporate a very small timing aspect in our risk premia strategy and move on to something else. But I don’t think everyone would agree. This stuff is always context-dependent, and we all have a different context, but still, I’d love to hear what you would have done in the comments.

As I mentioned above, our strategy was already looking quite decent before we started exploring ways to time the market. Here’s a long-term backtest, before costs (many of our ETFs weren’t around for the entirety of this backtest, so we had to create synthetic asset data from indexes, mutual funds, and other relevant sources):

The strategy had a backtested Sharpe ratio of 1.22 and a Compound Annual Growth Rate (CAGR) of 6.6%. If we could lever it up 2x costlessly (which of course we can’t) we can bump up the CAGR to over 12%:

Over the same period, the S&P500 delivered a CAGR of around 8.3% at a Sharpe of approximately 0.6.

Two anomalies seem to pop up over and over again in the markets: *momentum* and *value*. The AQR paper *Value and Momentum Everywhere* is a good summary for the uninitiated. Essentially, the authors demonstrate a momentum and value effect *within* every asset class they look at, as well as *across* asset classes. This suggests that we might be able to use relative momentum and value rankings across the assets in our risk premia universe as a simple prediction of future returns.

We ended up ignoring the value effect for now (we ran out of time, and the strategy was good enough to get into the market at the end of the Bootcamp, but we’ll likely revisit this in the future), and instead focused on the momentum effect across our risk premia universe.

The thing with momentum is that we don’t really know exactly what it is or how to calculate it. So we deferred to the simplest approach we could think of to estimate it: the rate of change of price over some formation period.

We performed a classic rank-based factor analysis by:

- Calculating our momentum estimate.
- Ranking each of our assets according to this estimate.
- Looking at subsequent returns over some holding period for each rank.

So our momentum analysis is really subject to two parameters: the formation period used in the momentum estimate, and the holding period used to assess the momentum factor’s relationship with future returns.

We looked at all combinations of 1, 3, 6, 9 and 12 month formation periods and 1, 3, 6, 9 and 12 month hold periods. We found a clear and persistent momentum effect, at least on average over the whole sample.

Here’s a selection of factor rank bar plots showing the mean future return by momentum rank (1 representing the highest momentum, 8 the lowest):

We actually saw some sort of momentum effect in every combination of formation and holding period that we looked at.

Next, we tried to quantify the strength of this cross-sectional momentum effect. We did that by looking at the difference in annualised returns between the top and bottom *n* assets by momentum rank (again, for a combination of formation and holding periods).

Here are some plots that show the difference in annualised returns between the top and the bottom *n* assets by momentum rank. The formation period (in months) is on the x-axis. The holding period (in months) is on the y-axis. The colour represents the magnitude of outperformance of the top-ranked asset.

First, for *n = 2:*

And for *n = 4*:

We see that pretty much across the board, *assets with higher recent momentum tend to outperform those with lower recent momentum*, again on average over the whole sample of our data.

We can also see that the effect is *greater the shorter the holding period. *This is unsurprising, but from a strategy development point of view is somewhat disappointing, because the shorter the holding period, the more frequently we’d need to adjust our positions to capitalise on the effect and the higher our cost of trading. Nothing comes for free, apparently.

To summarise our findings to this point, we see a strong momentum effect for formation periods of 3 to 12 months. And the effect is stronger for shorter holding periods.

You might, therefore, be convinced (and indeed many of our Bootcamp participants were) that we should only hold the assets in our risk premia universe with the highest 3-12 month momentum.

But so far we’ve only looked at the *mean* momentum outperformance over the entire 20-year data set. Markets are dynamic and noisy, and looking at summary statistics like the mean can hide important information.

Therefore, before we made any decisions, we looked at the consistency of momentum outperformance over time.

Here are some plots of 3-month and 6-month momentum outperformance over time for formation periods 3 to 12.

The dots represent mean outperformance of top-ranked assets over the holding period *annualised over the given year*. The lines are LOESS curves.

These plots suggest that the momentum train has been running out of steam for a number of years now. That is, we see a clear decline in the cross-sectional momentum effect over the sample period.

The momentum effect over the *whole sample* is significant. But the *decaying performance* suggests caution in trading the effect aggressively.

At this point, many of our Bootcamp participants were wondering why we’d still bother looking at momentum given the clear decaying performance in the previous charts.

But at this point, we were still taking it seriously because it has worked *exceptionally* well for *as long as we have history available.* But there is a real question to whether the increased turnover and potential reduction in diversification as we rotate into high-momentum assets is justified given the decaying performance.

Here’s a backtest for a strategy which, every month:

- ranks each asset according to its trailing six-month returns
- selects the top four assets and weights each in inverse proportion to its volatility over the previous three months

This backtest has a before-cost CAGR of 8.9% at a Sharpe ratio of 1.14. This is a higher return than our baseline strategy at a similar, though slightly lower, Sharpe ratio – probably due to a reduction in diversification.

We can get some insight into what this strategy is doing by looking at its asset weights over time:

Compare this to the asset weights of our baseline strategy:

Chalk and cheese. The momentum strategy has higher returns and better drawdown control. It’s lower Sharpe comes by way of increased concentration (reduced diversification), and it turns out that it has over *5x the turnover*.

We weren’t overly impressed by this trade-off. Specifically, we weren’t sold on the idea that there’s enough evidence to convince us to run a momentum strategy at the expense of diversification (what do you think? Let us know in the comments). However, despite its decay over the last decade or two, the historic momentum outperformance is *remarkable*. You won’t see a much bigger anomaly than that. We could therefore certainly entertain overweighting assets with high relative momentum and underweighting those with low relative momentum, based on the evidence we’ve seen to date.

Intuitively, we prefer a more subtle way to incorporate the momentum effect, one that adjusts portfolio weights slightly based on our estimate of relative (cross-sectional) momentum. That way, we’re always holding *some* of each asset in our universe, but we might be underweight when an asset class has been underperforming relative to the others.

It’s possible to get super-complicated with this (Black-Litterman, Bootstrapping, etc.). Knowing that any improvement is likely to be marginal above our already-decent strategy, we decided not to try anything too complicated here. We simply adjusted our baseline asset weights slightly depending on the relative momentum factor.

Here’s how that backtests:

This gives a CAGR of 7.4% at a Sharpe of 1.3. Here are the asset weights:

The portfolio is consistently well diversified and we’ve increased returns and Sharpe ratio over the baseline strategy. However, it turns over about 2x more than the baseline strategy. We feel that this is a much more attractive trade-off. Do you?

In our recent Bootcamp, we took a deep dive on the momentum effect and tried to make a sensible decision about incorporating it into our risk premia strategy. The evidence for the momentum effect includes:

- A wealth of empirical evidence in favour of momentum over many years
- On average, a clear and persistent momentum effect (noting that
*when looking at averages much detail is hidden*) - On average, clear outperformance of top-ranked assets over bottom

The evidence against includes:

- Outperformance is more pronounced for shorter hold periods, which implies more frequent rebalancing and higher costs
- Recent deterioration across the board

How can we weigh up this evidence in the context of our risk premia strategy? Here’s a summary of our thought process:

- We are confident that exposure to risk premia is a good idea that is rewarded over the long term.
- Every time we are not in the market we are giving up exposure to that risk premia.
- So we need to be pretty confident in our timing ability to get out of the market.
- We are not
*that*confident. - We can see an obvious and clear momentum effect (at least in the past).
- This effect has deteriorated in the past decade or two.
- We give up quite a lot to access the momentum effect if we make binary decisions to get in or out of an asset. Specifically, we give up exposures to certain risk premia at certain times, and we give up diversification benefit on our portfolio variances.
- We also increase turnover significantly.
- We can, to an extent, have our cake and eat it too by adjusting baseline asset weights based on the cross-sectional momentum factor, rather than making binary in-out decisions.
- Using this approach, we give up some of the momentum effect, but we retain the benefits of diversification as well as constant exposure to the risk premia.

That thought process seems to logically suggest that the momentum adjustment approach makes the most sense in the context of our risk premia strategy. This also implies that we need to give some thought to *how* we implement these adjustments.

In the end, we decided that a simple adjustment to the baseline weights based on our estimate of the momentum factor is sufficient – it affords us simple and easy access to the momentum effect without compromising our exposure to risk premia or the benefits of diversification. Previously, we alluded to some more complex approaches for adjusting these weights, such as Black-Litterman, which would probably allow us to squeeze out a couple more drops of performance.

But that’s not the best use of our time given the bigger picture of our broader trading operation. First and foremost, we’re not building a risk premia strategy – we’re helping our members and Bootcamp participants build out their trading capability. At the early stages, we stand to gain *a lot* from adding additional edges to our portfolio. We would probably gain *something* from a more complex momentum tilt on our risk premia strategy, but it’s going to be nowhere near as beneficial as diversifying across strategies. So we opted for a simple approach that gets us into the market and hot on the trail of active alpha strategies to add to the portfolio.

This bigger picture will change. When our portfolio is more mature, it will likely make a lot of sense to revisit the risk premia strategy and try to squeeze a little more out. There might well come a time when this is our biggest or most sensible opportunity. But that time isn’t now.

- Our strategy is based on exposure to risk premia for the long term.
- We think we might be able to gain some benefit from trying to time our risk premia exposures.
- Momentum has been a remarkable anomaly for a long time.
- Its performance has deteriorated for a decade or two.
- Strategy design is all about weighing evidence in the face of uncertainty.
- Context matters both at the strategy level and the bigger picture trading operation level.
- For our specific context, we found a way to incorporate momentum timing into our risk premia strategy with sensible trade-offs.

One of the most fun things about independent trading is not only weighing the strategy-level evidence that you collect yourself, but deciding what it actually means for *your* specific situation. No one can tell you what the right answer is – partly because it doesn’t exist, and partly because everyone’s context is different. *You* have to make a decision at some point and take action based on *your* best judgment. It’s the ultimate exercise in backing yourself and taking responsibility for your own decisions. That’s also why I think trading isn’t for everyone – not everyone is comfortable taking on that level of responsibility. But if you do, then trading is the best game in town.

The tricky part about weighing evidence and making smart trading decisions in the face of uncertainty is that *it takes experience to do it well*. You get that experience by getting kicked around in the markets for a few years – which isn’t particularly enjoyable or financially rewarding. In our Bootcamp program, we pass on the experience and intuition that we fought hard for over many years, minus the battle scars that we picked up along the way.

If that sounds like something you could benefit from, join the waiting list for our next Bootcamp program.

The post Momentum Is Dead! Long Live Momentum! appeared first on Robot Wealth.

]]>The post Harvesting Risk Premia appeared first on Robot Wealth.

]]>*This article is part of a series derived from our most recent Algo Boot Camp, in which we developed a strategy for harvesting risk premia. We have allocated proprietary capital to the strategy, and many of our members are trading it too.*

*In our Boot Camps we develop trading strategies in collaboration with the Robot Wealth community over an 8 week period. **The Boot Camp format is proving incredibly useful for teaching our members how to research, develop, think about the markets and execute real trading strategies. They get to watch us do it every step of the way, and watch every decision we make.*

*In our next Boot Camp, we’ll be developing a portfolio of active FX strategies. Find out more about Robot Wealth’s Algo Boot Camps, including how you can be a part of the next one, here.*

Trading and investing doesn’t have to be complicated. Check out this chart:

The blue line shows returns from US Stocks from 1900 to today. That’s a 48,000x increase in nominal value.

The yellow line shows returns from US Bonds from 1900 to today. That’s a 300x increase in nominal value.

So it’s pretty obvious what we need to do in order to make money in the markets. Assuming I have a fairly long investment horizon, I buy the stocks, I buy the bonds. I go to the beach.

But of course…it’s not quite that simple. Unless you’re a robot.

If you are human with normal human fears, feelings and lifestyle and income uncertainties, then we can’t discuss the rewards of buying stocks and bonds without discussing the risks.

The reason that stocks tend to go up in value over the long run is that they have a tendency to go down in value – sometimes quite considerably – in the short and medium term.

Look again at the chart above. Notice the logarithmic y-axis. That’s the best way to look at long term asset prices. But it does tend to misrepresent what the experience of holding US stock exposure over that period would actually have been like.

Check out this chart, which takes the blip in the red square, corresponding to the GFC, and plots the S&P 500 in dollar terms.

That 50% decline looks benign in the long-term chart, but how would you really feel if your million-dollar stock portfolio was suddenly worth $500k?

Obviously, it’s not very much fun to watch half of your asset value crumble in front of your eyes.

It bears repeating: the reason stocks go up in the long term, is that they tend to go down (sometimes violently) in the short and medium term. That’s a highly unattractive quality for an investment asset – so holders demand some kind of *reward* or *premium* for taking on that risk.

But it’s not just stocks. Any asset whose fundamental value is dependent on uncertain factors – or “risk” – tends to increase in value over the long term, more than the interest you would receive on the same amount of money. Rather than saying that investors are compensated for investing in particular assets, we instead say that investors are compensated for * taking on risk * – hence the concept of “risk premia.”

Under this paradigm, investing becomes an exercise in risk management. And good risk management requires a decent understanding of the risks being taken, coupled with some intuition around why reward should flow to the investor for taking on a particular risk.

If this sounds weird, consider that pretty much any investment you might make is based around you anticipating some reward or payoff, knowing that there’s some level of risk involved. For instance, say you purchase a government bond. In this case, you know with a fairly high level of certainty what the reward will be at maturity. The risks that you bear in making this investment are the chance of the government defaulting, as well as the volatility in the price of the bond between the purchase time and maturity (this is risky in the sense that if you needed to liquidate prior to maturity, volatility exposes you to the risk of making a loss on the sale).

If you instead invested in a stock, you may have a much less certain idea of the expected reward. In addition, the risks associated with stock investing are usually greater than buying bonds – just look at the historical volatility of stock indexes compared with bond markets.

The different risk-reward profiles of these investments (including their uncertainty) should give pause to the investor to consider their approach. Is one investment superior to the other? Should you put all your eggs in one basket? Is there an optimal allocation into both investments?

These questions are really the crux of risk premia investing and no doubt you can see that an understanding of the risks associated with each investment is key to any investment decision.

In a practical sense, being long risk premia means buying and holding assets that are exposed to various *risk factors.*

A risk factor is simply a class of risks that explain (or partially explain) the reward associated with buying and holding an asset. One model of common risk factors might include:

**real interest rates**: the risk of exposure to changing inflation-adjusted interest rates – in simple terms, this consists of the risk of incurring opportunity cost, and all investable assets carry this risk**inflation**: the risk that cash received from an investment won’t be worth what you thought it would thanks to prices rising relative to the value of cash**credit:**the risk that a counter-party is unable to meet the terms of an agreement**liquidity:**the risk that there won’t be a counter-party to whom to sell your asset, without incurring significant costs**growth:**the risk of uncertainty in economic growth, and macroeconomic conditions changing unexpectedly**political:**risks associated with changing regulation and political instability

We can think of a particular asset as being composed of various risk factors. For instance a US government bond is mostly going to be exposed to inflation risk. It carries little to no credit risk, since the US Treasury is almost sure to pay you back. A stock, on the other hand, is going to be exposed to all sorts of risk including economic, political, inflation and liquidity risk.

Here’s a chart that demonstrates this concept of “assets as risk factors:”

Being “long risk premia” is equivalent to being long some combination of risk factors. * But what is the optimal combination?* Does an optimal combination even exist from a trader’s perspective? We explored these questions throughout the Boot Camp, but we can start the discussion by thinking about the different conditions that generally give rise to premia for taking on different types of risk.

Asking “when are different risks rewarded?” is the same as asking “when do different assets go up in value?”

If you think about it, it makes sense that different risks tend to be rewarded under different conditions. For instance, during a market crash or recession, bonds have tended to outperform in the past. That’s another way of saying that taking on inflation risk is rewarded. During an equities bull market, taking on economic risk is rewarded through rising stock prices and dividends.

Here’s a couple of charts showing the type of risks that are have generally been rewarded under various market conditions:

h/t: GestaltU – Dynamic Asset Allocation for Practitioners

Years | Environment | Factors Most Rewarded |
---|---|---|

1980 – 1991 | Post-Inflation | Inflation |

1992 – 1999 | Equity Bull Market | Credit, Growth, Liquidity |

2000 – 2003 | Tech Collapse | Real Interest Rates, Liquidity |

2004 – 2007 | Equity Bull Market | Growth, Political |

2008 – 2011 | GFC | Real Interest Rates |

2011 – 2018 | Long Equity Recovery | Real Interest Rates, Credit |

As with all things related to the markets, hindsight is a wonderful thing. Anyone can look back and work out which risks were rewarded in the past. The real trick is predicting which risks will be rewarded in the future.

Of course, no one has a crystal ball, so there’s always uncertainty around our forecasts. * Often we are more interested in managing this uncertainty than we are in absolute returns*.

Therefore, the goal often becomes to construct a portfolio of various risk factors in pursuit of a good trade-off between future reward and uncertainty.

There are two broad approaches to constructing portfolios of risk factors:

This involves constructing a portfolio that aims to deliver a certain level of performance regardless of the prevailing conditions. The Bridgewater All Weather fund is an example of this approach:

The investment objective and policy of the Fund are to provide attractive returns with relatively limited risks, with no material bias to perform better or worse in any particular type of economic environment. The portfolio is expected to perform approximately as well in rising or falling inflation periods, or in periods of strong or weak economic growth.

- The All Weather Story, Bridgewater, 2016

Such portfolios will typically have proportionately significant dollar exposure to long and intermediate term government bonds, smaller dollar exposure to equities, and a minor allocation to gold and possibly other commodities. But many variations on this theme exist.

The significant exposure to low volatility, positive carrying, fixed income assets tends to give these sorts of portfolios a relatively smooth performance curve, at the expense of the additional upside that’s possible from exposure to equities.

This involves moving into and out of various risk exposures based on some signal or forecast. The well-known Dual Momentum strategy is a simple, yet extreme, example of this approach, as it shifts the entire allocation between US equities, international equities and government bonds. Most variants of tactical allocation instead involve re-weighting the portfolio’s allocation to be overweight certain assets at certain times, while still maintaining some allocation to other factors.

Many variations on the tactical allocation theme exist, and a significant proportion of the funds management industry is based on this approach.

But here’s the thing about tactical allocation: it’s * hard. *The premise of this approach is that a skilled manager can outperform a permanent allocation using clever timing and selection of factors. But using this approach, it’s all too easy to mis-time active decisions and wind up with something that underperforms a more strategic, low-turnover portfolio. So you can end up spending a lot of time and effort for little, or even negative reward.

Throughout the recent Boot Camp, we explored both approaches and developed an algorithm to manage our own long risk premia portfolio which combines both of these approaches. In the next blog articles, we’ll share with you some of the insights that we gained along the way.

The post Harvesting Risk Premia appeared first on Robot Wealth.

]]>The post The Law of Large Numbers – Part 2 appeared first on Robot Wealth.

]]>This is Part 2 in our **Practical Statistics for Algo Traders** blog series—don’t forget to check out **Part 1** if you haven’t already.

Even if you’ve never heard of it, the Law of Large Numbers is something that you understand intuitively, and probably employ in one form or another on an almost daily basis. But human nature is such that we sometimes apply it poorly, often to great detriment. Interestingly, psychologists found strong evidence that, despite the intuitiveness and simplicity of the law, humans make ** systematic errors** in its application. It turns out that we all tend to make the same mistakes – even trained statisticians who not only should know better, but do!

In 1971, two Israeli psychologists, Amos Tversky and Daniel Kahneman,1 published *“Belief in the law of small numbers“, *reporting that

People have erroneous intuitions about the laws of chance. In particular, they regard a sample randomly drawn from a population as highly representative, that is, similar to the population in all essential characteristics.

So what is this Law of Large Numbers? What are the consequences of a misplaced belief in the law of **small** numbers? And what does it all mean for algo traders? Well, to answer these questions, we first need to talk about burgers.

Put simply, the Law of Large Numbers states that if we select a sample from an entire population, the mean of our sample approaches the mean of the population as we increase our sample size. Said differently, the greater our sample size, the less uncertainty we have regarding our conclusions about the population.

We all understand this law on an intuitive level. For instance, say you’re looking at reviews of burger joints in your local area. You come across a place that only has two reviews, both of them rating the restaurant 5 out of 5. There’s another place that has an average rating of 4.7, but it has 200 reviews.

You know instinctively that the place rated 4.7 is the most likely of the two to dish up a fantastic burger, even though it’s average rating is less than the perfect 5 of the first restaurant. That’s the law of large numbers in action.

How many reviews would it take before you started considering that there was a good chance that the first burger joint served better burgers than the second? 5? 10? 100?

Let’s assume that the burger joint with the perfect record after two reviews is actually destined for a long-term average rating of just over 4. We could simulate several thousand reviews whose aggregate characteristics match this assumption with the following R code:

library(ggplot2) # burger joint destined for a long-term average rating of around 4 out of 5 probs <- c(0.025, 0.05, 0.125, 0.5, 0.35) probs <- probs/(sum(probs)) ratings <- c(1, 2, 3, 4, 5) p <- sum(probs*ratings) reviews <- sample(1:5, 5000, replace=TRUE, prob=probs) mean(reviews) ggplot() + aes(reviews) + geom_histogram(binwidth=1, col="black", fill="blue", alpha = .75) + labs(title="Histogram of Ratings", x="Rating", y="Count")

And here’s a histogram of the simulated reviews – note that the rating most often received was 4 out of 5:

The output of the simulation gives us a “population” of reviews for our burger joint. Next we’re interested in the uncertainty associated with a small number of reviews – a “sample” drawn from the “population”. How representative are our samples of the population? In particular, how many reviews do we need in order to reflect the actual mean of 4?

To answer that, we can turn again to simulation. The following R code samples the synthetic reviews we created above repeatedly for various sample sizes, and then plots the results in a scatter plot. You can see the true average rating as a black line, and the the other restaurant’s 200-review average as a dashed red line:

# average rating given num_reviews ratings num_reviews <- sample(c(1:100), 10000, replace=TRUE) average_rating <- c() for(i in c(1:length(num_reviews))) { average_rating[i] <- mean(sample(reviews, num_reviews[i], replace=FALSE)) } # plot ggplot() + aes(x=num_reviews, y=average_rating) + geom_point(color="blue", alpha=0.5) + geom_hline(yintercept=p, color="black", size=1, show.legend = TRUE) + geom_hline(yintercept=4.7, color="red", linetype="dashed", size=1) + labs(title="Convergence of Sample Mean to Population Mean", x="Number of Reviews", y="Average Rating") + annotate("text", x=80, y=p-0.1, label="True average rating", size=8) + annotate("text", x=85, y=4.9, label="Other restaurant's average rating", size=8)

We can see that as the sample size grows, the spread in the average rating decreases, and starts to converge around the true average rating. At a sample size of 100 reviews, we could conceivably end up with an average rating of anywhere between about 3.75 and 4.25.

But look at what happens when our sample size is small! Even with 50 reviews, it’s possible that the sample’s average grossly over- or under-estimates the true average.

We can see that with a sample of 10 reviews, it’s possible that we end up with a sample average that exceeds the average of the gourmet, 4.7-star restaurant. But even out to about 25 reviews, we could still end up with an average rating that isn’t all that distinguishable either.

Even when we have 100 reviews, there exists a level of uncertainty around the sample average. But the uncertainty with 100 reviews is much less than with 10 reviews. But the point is it still exists! We run into problems because, according to Kahneman and Tversky, we tend to grossly misjudge this uncertainty, in many cases ignoring it altogether!

Personally, I try to incorporate uncertainty into my thinking about most things in life, not just burgers and trading. But Kahneman and Tversky make the point that even when we do this, we tend to muck it up! A more robust solution is to use a quantitative approach to factoring uncertainty into decision making. Bayesian reasoning is a wonderful paradigm for doing precisely this, but that’s a topic for another time. Here, I merely want to share with you some examples and applications related to trading.

So to conclude our treatise on burger reviews, if we are comparing burger joints under a 5-star review system, eyeballing the scatterplot above suggests we need about 25 reviews for a new restaurant whose (at the time unknown) long-term average is 4 stars before we can be fairly sure that it’s burgers won’t be quite as tasty as our tried and tested 4.7-star Big Kahuna burger joint.

As much as I’m sure you enjoy thinking about the statistics of burger review systems, let’s turn our attention to trading. In particular, I want to show you how our intuition around the law of large numbers can lead us to make bad decisions in our trading, and what to do about it.

High-frequency trading strategies typically have a much higher Sharpe ratio than low frequency strategies, since the variability of returns is generally much higher in the latter. If you had a high-frequency strategy with a Sharpe ratio in the high single digits, you’d only need to see a week or two of negative returns – perhaps less – to be quite sure that your strategy was broken.

But most of us don’t have the capital or infrastructure to realise a high-frequency strategy. Instead, we trade lower frequency strategies and accept that our Sharpe ratios are going to be lower as well. In my experience, a typical non-professional might consider trading a strategy with a Sharpe between about 1.0 and 2.0.

How long does it take to realise such a strategy’s true Sharpe? And how much could that Sharpe vary when measured on samples of various sizes? The answer, which we’ll get too shortly, might surprise you, or even scare you! Because it turns out that a “large” number may or may not be so large, depending on the context. And that lack of context awareness is precisely where we tend to make our most severe errors in the application of the Law of Large Numbers.

First of all, let’s simulate various realisations of 40 days of trading a strategy with a true Sharpe ratio of 1.5. This is equivalent to around two months of trading.

If we set the strategy’s mean daily return,

mu, to 0.1%, we can calculate the standard deviation of returns,

sigma, that results in a true Sharpe of 1.5:

# backtested strategy has a sharpe of 1.5 # sqrt(252)*mu/sigma = 1.5 mu <- 0.1/100 sigma <- mu*sqrt(252)/1.5

And here’s 5,000 realisations of 40 days of trading a strategy with this performance (under the assumption that daily returns are normally distributed, an inaccurate but convenient simplification that won’t detract too much from the point):

N <- 5000 days <- 40 sharpes <- c() for(i in c(1:N)) { daily_returns <- rnorm(days, mu, sigma) # sharpe of simulated returns sharpes[i] <- sqrt(252)*mean(daily_returns)/sd(daily_returns) }

Whoa! The histogram shows that it isn’t inconceivable (in fact it’s quite likely) that our Sharpe 1.5 strategy could give us an annualised Sharpe of -2 or less over a 40-day period!

What would you do if the strategy you’d backtested to a Sharpe of 1.5 had delivered an annualised Sharpe of -2 over the first two months of trading? Would you turn it off? Tinker with it? Maybe adjust a parameter or two?

You should probably do nothing! At least until you’ve assessed the probability of your strategy delivering the actual results, assuming it’s performance was indeed what you’d backtested it to be. To do that, you can just sum up the number of simulated 40-day Sharpes that were less than or equal to -2, and then divide by the number of Sharpes we simulated:

# probability of getting sharpe of -2 or less in 40 days 100*sum(sharpes <= -2)/N

which works out to about 8.5%.

Let’s now look at the convergence of our Sharpe ratio to the expected Sharpe as we increase the sample size, just as we did in the burger review example above. Here’s the code:

trading_days <- sample(10:500, 5000, replace=TRUE) # sample of 10-1000 trading days sharpes <- c() for(i in c(1:length(trading_days))) { daily_returns <- rnorm(trading_days[i], mu, sigma) sharpes[i] <- sqrt(252)*mean(daily_returns)/sd(daily_returns) } ggplot() + aes(x=trading_days, y=sharpes) + geom_point(color="blue", alpha=0.5) + geom_hline(yintercept=1.5, color="red", linetype="dashed", size=1) + labs(title="Convergence of Sample Mean to Population Mean", x="Number of Trading Days", y="Sharpe") + annotate("text", x=85, y=p-0.1, label="True Sharpe", size=4)

And the output:

Once again we see the sample uncertainty shrink as we increase the sample size, but this time it’s magnitude looks much more frightening. Note the uncertainty even after 500 trading days! This implies that our strategy with a long-term Sharpe of 1.5 could conceivably deliver very small or even negative returns in a two-year period.

If you’ve done a lot of backtesting, you probably understand from experience that a strategy with a Sharpe of 1.5 can indeed have drawdowns that last one or two years. So maybe this result doesn’t surprise you that much. But consider how you’d feel and act in real time if you suffered through such a drawdown after going live with this strategy that you’d painstakingly developed. Would you factor the uncertainty of the sample size into your decision making?

The point is that this time the uncertainty really matters. Maybe you don’t care that much if you thought you were getting a 5-star burger, but ended up eating a 4-star offering. You could probably live with that. But what if you were expecting to realise your Sharpe 1.5 strategy, but after 2 years you’d barely broken even?

Returning to our 40-days of unprofitable trading of our allegedly profitable strategy. As mentioned above, there’s an 8.5% chance of getting an annualised Sharpe of -2 from this scenario. Maybe that’s enough to convince you that your strategy is not actually going to deliver a Sharpe of 1.5. Maybe you’d be willing to stick it out until the probability dropped below 5%. It’s up to you, and in my opinion should depend at least to some extent on your prior beliefs about your strategy.2 For instance, if you had a strong conviction that your strategy was based on a real market anomaly, maybe you’d stick to your guns longer than if you had simply data-mined a pattern in a price chart with no real rationalisation for it’s profitability. This is an important point, and I’ll touch on it again towards the end of the article.

No doubt you’ve already realised that the backtest itself is unlikely to be a true representation of the strategy’s real performance. Due to it’s finite history, the backtest itself is just a “sample” from the true “population”! So how much confidence can you have in your backtest anyway?

In the next article, I’ll show you a method for incorporating both our prior beliefs about our strategy’s backtest and the new information from the 40 trading days to construct credible limits on our strategy’s likely true performance. As you might imagine from the scatterplot above, that interval will likely be quite wide, so there’s really no way around acknowledging the inherent uncertainty in the problem of whether or not to continue trading our strategy.

We’ve seen that with small sample sizes, we can observe wild departures from an expected value – particularly with a Sharpe 1.5 strategy. Probably worryingly to many traders out there is the fact that, as it turns out, even two years of trading might constitute a “small sample”. Depending on your goals and expectations, that’s a long time to be wondering.

So what can be done? Well, there are two main options:

- Only trade strategies with super-high Sharpes that enable statistical uncertainty to shrink quickly.
- Acknowledge that statistical uncertainty is a part of life as a low frequency trader and find other ways to cope with it.

Option 1 isn’t going to be feasible for the people for whom this article is written. So let’s explore option 2.

While statistical approaches often don’t provide definitive answers to the questions that many traders need answered, market experience can at least partially fill the gaps. Above I touched on the idea that if we had a rational basis for a trade, we’d treat the statistical uncertainty around it’s out of sample performance differently than if we had simply data-mined a chart pattern or used an arbitrary technical analysis rule.

Intuition around what tends to work and what doesn’t, believe it or not, actually starts to come with experience in the markets. Of course, even the most savvy market expert gets it wrong a lot, but market experience can certainly tip the balance in your favour. While you’re acquiring this experience, one of the most sensible things you can do is to focus on trades that can be rationalised in some way. That is, trades that you speculate have an economic, financial, structural, behavioural, or some other reason for existing. Sometimes (quite often in my personal experience!) your hypothesis about the basis of the trade will be false, but at least you give yourself a better chance if the trade had a hypothetical reason for being.

Another good idea is to execute a trade using small positions as widely as possible. Of course, no effect will likely “work” across all markets, but many good ideas can be profitable in more than a single product or financial instrument, or traded using a cross-sectional approach. If there really is an edge in the trade, executing it widely increases the chances of realising it’s profit expectancy, and you get some diversification benefits as well. This idea of scaling a trade across many markets using small position sizing is one of the great benefits of automated trading.

Finally, it’s important to keep an open mind with any trade. Don’t become overly wedded to a particular idea, as it’s very likely that it won’t work forever. Far more likely is that it will work well sometimes, and not so well at other times. The other side of this is that if you remove a component from a portfolio of strategies, it is often a good idea to “keep an eye on it” to see if it comes back (automation can be useful here too). But once again, deciding on whether to remove or reinstate a component is as much art as science.

So what does this look like in real life? Well here’s an example taken from a prop firm that I know very well. The firm has a research and execution team, who design strategies, validate them and implement the production code to get them to market. Then there’s the operations guys who decide at any given time what goes into the portfolio, and where and how big the various strategies are traded. They use some quantitative tools, but they also use a hefty dose of judgement in making these decisions. That judgement is undoubtedly a significant source of alpha for the firm, and the team has over 50 years of combined experience in the markets from which to make these judgements.

These ideas sound sensible enough, but the elephant in the room is the implied reliance on judgement and discretion, which might feel uncomfortable to systematic traders (to be completely honest, up until a couple of years ago, I’d have felt that same discomfort). The problem is, anyone can learn to do statistics, build time-series models, run tests for cointegration, and all the other things that quants do. But good judgement and intuition is much harder to come by, and is generally only won through experience. And that takes time, and many of the lessons are learned through making mistakes.

Here at Robot Wealth HQ, we talk a lot about how we can help our members short-cut this process of gathering experience. Our goal is to pass on not only the technical and quantitative skills, but also the market knowledge and experience that helps us succeed in the markets. We decided that the best way to do that is to develop and trade a portfolio inside the community, where our members can follow along with the research and decision making processes that go into building and running a systematic portfolio. We’re already doing this with a crypto portfolio, and we’re about to get started on our ETF strategies.

Humans tend to make errors of judgement when it comes to drawing conclusions about a sample’s representativeness of the wider population from which it is drawn. In particular, we tend to underestimate the uncertainty of an expected value given a particular sample size. There are times when the implications of these errors of judgement aren’t overly severe, but in a trading context, they can result in disaster. From placing too much faith in a backtest, to tinkering with a strategy before it’s really justified, errors of judgement imply trading losses or missed opportunties.

We also saw that a “significant sample size” (where significant implies large enough that the sample is likely representative of the population) for typical retail level, low-frequency trading strategies can take so much time to acquire that it becomes almost useless in a practical sense. Here at Robot Wealth, we believe that systematic trading is one of those endeavours that requires a breadth of skills and experience, and that success is found where practical statistics and data science skills intersect with market experience.

The need for experience and judgement to compliment good analysis skills is one of the most important realisations I had when I moved from amateur trading into the professional space. That experience doesn’t come easily or quickly, but we believe that by demonstrating exactly what we do to build and trade a portfolio, we can help you acquire it as quickly as possible.

The post The Law of Large Numbers – Part 2 appeared first on Robot Wealth.

]]>The post Practical Statistics for Algo Traders appeared first on Robot Wealth.

]]>Well, you’re not alone. The reality is that classical statistics is difficult, time-consuming and downright confusing. Fundamentally, we use statistics to answer a question – but when we use classical methods to answer it, half the time we forget what question we were seeking an answer to in the first place.

But guess what? There’s another way to get our questions answered without resorting to classical statistics. And it’s one that will generally appeal to the practical, hands-on problem solvers that tend to be attracted to algo trading in the long run.

Specifically, algo traders can leverage their programming skills to get answers to tough statistical questions – without resorting to classical statistics. In the words of Jake van der Plas, whose awesome PyCon 2016 talk inspired some of the ideas in this post, “if you can write a for loop, you can do statistics.”

In this post and the ones that follow, I want to show you some examples of how simulation and resampling methods lend themselves to intuitive computational solutions to problems that are quite complex when posed in the domain of classical statistics. Let’s get started.

The example that we’ll start with is relatively simple and more for illustrative purposes than something that you’ll use a lot in a trading context. But it sets the scene for what follows and provides a useful place to start getting a sense for the intuition behind the methods I’ll show you later.

You’ve probably heard the story of Ed Thorp and Claude Shannon. The former is a mathematics professor and hedge fund manager; the latter was a mathematician and engineer referred to as “the father of information theory”, and whose discoveries underpin the digital age in which we live today (he’s kind of a big deal).

When they weren’t busy changing the world, these guys would indulge in another great hobby: beating casinos at games of chance. Thorp is known for developing a system of card counting to win at Blackjack. But the story I find even more astonishing is that together, Thorp and Shannon developed the first wearable computer, whose sole purpose was to beat the game of roulette. According to a 2013 article describing the affair,

Roughly the size of a pack of cigarettes, the computer itself had 12 transistors that allowed its wearer to time the revolutions of the ball on a roulette wheel and determine where it would end up. Wires led down from the computer to switches in the toes of each shoe, which let the wearer covertly start timing the ball as it passed a reference mark. Another set of wires led up to an earpiece that provided audible output in the form of musical cues – eight different tones represented octants on the roulette wheel. When everything was in sync, the last tone heard indicated where the person at the table should place their bet. Some of the parts, Thorp says, were cobbled together from the types of transmitters and receivers used for model airplanes.

So what’s all this got to do with hacking statistics? Well, nothing really, except that it provides context for an interesting example. Say we were a pit boss in a big casino, and we’d been watching a roulette player sitting at the table for hours, amassing an unusually large pile of chips. A review of the casino’s closed circuit television revealed that the player had played 150 games of roulette and won 7 of those. What are the chances that the player’s run of good luck is an indication of cheating?

To answer that question, we firstly need to understand the probabilities of the game of roulette. There are 37 numbers on the roulette wheel (0 to 36), so the probability of choosing the correct number on any given spin is 1 in 37.3For a correct guess, the house pays out $36 for every $1 wagered. So the payout is slightly less than the expectancy, which of course ensures that the house wins in the long run.

In order to use classical statistics to work out the probability that our player was cheating, we would firstly need to recognise that our player’s run of good luck could be modelled with the binomial probability distribution:

\[P(X_{wins}) = {{Y}\choose{X}} {P_{win}}^X {P_{loss}}^{Y-X}\]

where \( {{Y}\choose{X}}\) is the number of ways to arrive at \(X\) wins from \(Y\) games and is given by \(\frac{(Y)!}{X!(Y!-X!)}\)

Here are some R functions for implementing these equations:2

f <- function(n) { "calculate factorial of n" if(n == 0) return(1) num <- c(1:n) if(length(num) == 0) { return(1) } return(prod(num)) } binom <- function(x, y) { "calculate number of ways to arrive at x outcomes from y attempts" return(f(y)/(f(x)*f(y-x))) } binom_prob <- function(x, y, p) { "calculate the probability of getting x outcomes from y attempts when P(x)=p" return(binom(x, y)*p^x*(1-p)^(y-x)) }

And here’s how to calculate the probability of winning 7 out of 150 games of roulette:

n_played <- 150 n_won <- 7 p_win = 1./37 binom_prob(n_won, n_played, p_win)

This returns a value of 0.062, which means there is about a 6% of chance of winning 7 out of 150 games of roulette.

But wait, we’re not done yet! We’ve actually found the probability of winning *exactly* 7 out of 150 games, but we really want to know the probability of winning *at least* 7 out of 150 games. So we actually need to sum up the probabilities associated with winning 7, 8, 9, 10, … etc games. This number is the *p-value*, which is used in statistics to measure the validity of the *null hypothesis*, which is the idea we are trying to *disprove – *in our case, that the player *isn’t *cheating.

Confused? You’re not alone. Classical statistics is full of these double negatives and it’s one of the reasons that it’s so easy to forget what question we were even trying to answer in the first place. Before we come to a simpler approach, here’s a function for calculating the p-value for our roulette player of possibly dubious integrity (or commendable ingenuity, depending on your point of view):

binom_pval <- function(n_won, n_played, p_win) { "calculate the p-value of a given result using binomial probability distribution" p <- 0 for(n in c(n_won:n_played)) { p <- p + binom_prob(n, n_played, p_win) } return(p) } binom_pval(n_won, n_played, p_win)

In our case, the p-value comes out at 0.114, or 11.4%. We should settle on a cutoff p-value *prior* to performing our analysis, below which we reject the null hypothesis that our gambler isn’t cheating. In many fields, a p-value cutoff of 0.05 is used, but I’ve always felt that was somewhat arbitrary. Better in my opinion to avoid thinking in such black and white terms and consider what a particular p-value means in your specific context.3

In any event, our p-value tells us that there is an 11.4% chance that the player could have realised 7 wins from 150 games of roulette by chance alone. You can draw your own conclusions regarding what this means in this particular context, but if I were the pit boss scrutinising this gambler, I’d find it hard to justify throwing them out of the casino.

Finally, here’s a plot of the probability of winning

n_wongames out of 150, with a vertical line at 7 wins:

# plot distribution n_won <- c(0:15) p <- c() for(n in n_won) { p[n+1] <- binom_prob(n, n_played, p_win) } plot(n_won, p, type='S', col='blue', main='Probabilty of n wins from 150 games') abline(v=7, col='red')

You just saw the classic approach to solving what was actually a very simple problem. But if you didn’t know the formula for the binomial probability distribution, it would be hard to know where to start. It’s also very easy to get tripped up with p-values and their confusing double-negative terminology. I think you can probably see some evidence for my claim that we can easily end up forgetting the question we were trying to answer in the first place! And this was a very simple problem – things get *much* worse from here.

The good news is, there’s an easier way. We could watch someone play 150 games of roulette, then write down the number of games they won. We could then watch another 150 games and write down that result. If we did this many times, we would be able to plot a histogram showing the frequency of each result. If we watched many sequences of 150 games, we could expect the observed frequencies to start approaching the true frequencies.

But who has time to watch a few thousand sequences of 150 roulette games? Better to leverage our programming skills and *simulate* a few thousand such sequences.

Here’s a really simple roulette simulator that simulates sequences of roulette games, and returns the number of winning games in each sequence. We can use this simulator to generate sound statistical insights about our gambler.

The great thing about this simulator is that you can build it just by knowing a little about the game of roulette – it doesn’t matter if you’ve never heard of the binomial probability function, you can use the simulator to get robust answers to statistical questions.

# roulette simulator roulette_sim <- function(num_sequences, num_games) { lucky_number <- 12 games_won_per_sequence <- c() for(n in c(1:num_sequences)) { spins <- sample(0:36, num_games, replace=TRUE) games_won_per_sequence[n] <- sum(spins==lucky_number) } return(games_won_per_sequence) }

Most of the work is being done in the line

spins <- sample(0:36, num_games, replace=TRUE)which we are using to simulate a single sequence of

num_gamesspins of the roulette wheel. The

sample()function randomly selects numbers between 0 and 36

num_gamestimes and stores the results in the

spinsvariable. Then, the line

games_won_per_sequence[n] <- sum(spins==lucky_number)calculates the number of spins in the sequence that came up with our

lucky_numberand stores the result in the vector

games_won_per_sequence. I used the number 12 as the

lucky_numberparameter, which is what I would choose if I were forced to choose a lucky number, but any number in the range 0:36 will do, as they all have an equal likelihood of turning up in any given “spin”.

Let’s simulate 10,000 sequences of 150 games and plot the result in a histogram. Simply do:

# plot histogram of simulated 150-game sequences hist(roulette_sim(10000, 150), col='blue')

And you’ll end up with a histogram of games won that looks like this:4

Hmmm…the shape of our histogram looks very much like the shape of the binomial distribution that we plotted above using the classic approach. Interesting! Could it be that our simulation is indeed a decent representation of reality?

We can also calculate an empirical p-value from our simulation results by calculating the proportion of times we won at least seven games. Here’s a general function for calculating the empirical p-value, and an example of using it to calculate our gambler’s p-value:

sim_pval <- function(num_sequences, num_games, val) { games_won_per_sequence <- roulette_sim(num_sequences, num_games) return(sum(games_won_per_sequence >= val)/num_sequences) } pval <- sim_pval(10000, 150, 7)

When I ran this code, I got a p-value of 11.3, compared with a p-value of 11.4 calculated above using the classic approach. You’ll get a slightly different result every time you run this code, but the more sequences you simulate (the

num_sequencesparameter), the more the empirical result will converge to the theoretical one.

My intent with this article was to convince you that you can get statistically sound insights without resorting to the complexities of classic statistics. Personally, I find myself going around in circles and expending great energy for little reward when I try to solve a problem with the classic approach. On the other hand, I find that I get real insights and real intuition into a problem through simulation.

Simulation however is just one way you can hack statistics, and it won’t be applicable in all situations. For instance, in this example we happen to have a precise *generative model *for the phenomenon we wish to explore – namely, the probability of winning a game of roulette. In most trading situations, we normally have only data, or at best some assumptions about the underlying generative model. In the follow-up articles, I’ll give you examples of hacks you can apply in your trading research.

Apparently Thorp and Shannon’s roulette computer could predict which *octant* of the wheel the ball would end up in. That means that they could reduce the possible outcomes to five numbers of the thirty-seven total possibilities, increasing their odds of winning from 1/37 to 1/5. That means that from a sequence of 150 games, Thorp and Shannon might expect to win a staggering 30 times.

If we simulate the probability of Thorp and Shannon winning 30 of 150 games of roulette by chance:

# p-value for Thorp and Shannon pval <- sim_pval(10000, 150, 30)

we end up with a p-value of zero! That is, there is no conceivable possibility of winning 30 of 150 games of roulette by chance alone. In reality, of course the real probability isn’t zero, but apparently 10,000 simulations isn’t enough to detect a single occurrence of this many wins! Resorting to the analytical solution,

# analytical p-value for Thorp and Shannon pval <- binom_pval(30, 150, 1./37)

we find that the probability of 30 winning spins from 150 is 1.2e-17!

So how did Thorp and Shannon evade detection? Can we assume that the pit bosses back in the 1960s weren’t concerning themselves with the possibility that someone might be cheating? Actually, if you read their story, you find that Thorp and Shannon were plagued by the vagaries of the device itself, dealing with constant breakdowns and malfunctions that limited their ability to really exploit their edge.

Still, it’s a brilliant story and you really have to admire their ingenuity, not to mention their guts in taking on the casinos at their own game.

The post Practical Statistics for Algo Traders appeared first on Robot Wealth.

]]>The post Simulating Variable FX Swaps in Zorro and Python appeared first on Robot Wealth.

]]>This post shows you how to simulate variable FX swaps in both Python and the Zorro trading automation software platform.

The swap (also called the roll) is the cost of financing an FX position. It is typically derived from the central bank interest rate differential of the two currencies in the exchange rate being traded, plus some additional fee for your broker. Most brokers apply it on a daily basis, and typically apply three times the regular amount on a Wednesday to account for the weekend. Swap can be both credited to and debited from a trader’s account, depending on the actual position taken.

Swap can have a big impact on strategies with long hold periods, such as the typical momentum strategy. Therefore, accurately accounting for it is important in such cases. Zorro’s default swap calculation relies on a constant derived from the Assets List used in the simulation, which is fine for most situations, but might lead to unrealistic results when the hold period is very long.

Here’s some code for simulating historical swaps. It takes historical central bank data from the Bank of International Settlements, via Quandl. I’ve included code for the historical interest rates of the G8 countries – to get others, you just need the relevant Quandl code.

For the Zorro version, you’ll also need Zorro S, as the Quandl bridge is not available in the free version of Zorro. However, at the end of this article, I’ve also included a Python script for downloading the data from Quandl that you can save and then import into your backtesting platform. The advantage of the Zorro version is that you can access the relevant data from within a trading script via direct link to the Quandl API. That’s super convenient and all but eliminates the need to do any data wrangling at all. The advantage of the Python version is that it is completely free, but using the data in a trading script requires a little more messing around.

In order to access data from Quandl within Zorro, you’ll need a Quandl API key (get it from the Quandl website) and enter it in your ZorroFix.ini or Zorro.ini file.

Here’s the Zorro script:

/* Download historical central bank policy rates from Quandl and use to calculate historical swaps. Zorro's FX swap is interest per day per 10000 units traded, in account currency. */ #include <contract.c> var calculate_roll_long(var base_ir, var quote_ir, var broker_fee) { /*Calculates Zorro roll long in units of quote currency*/ var ird = (base_ir - quote_ir)/100; return 10000*ird/365 - broker_fee; } var calculate_roll_short(var base_ir, var quote_ir, var broker_fee) { /*Calculates Zorro roll short in units of quote currency*/ var ird = (quote_ir - base_ir)/100; return 10000*ird/365 - broker_fee; } function run() { set(PLOTNOW); PlotWidth = 800; PlotHeight1 = 400; PlotHeight2 = 250; StartDate = 20100101; EndDate = 20180630; // daily policy rates of major central banks, from Bank of International Settlements, via Quandl var usd_ir = dataFromQuandl(1, "%Y-%m-%d,f", "BIS/PD_DUS", 1); var jpy_ir = dataFromQuandl(2, "%Y-%m-%d,f", "BIS/PD_DJP", 1); var aud_ir = dataFromQuandl(3, "%Y-%m-%d,f", "BIS/PD_DAU", 1); var eur_ir = dataFromQuandl(4, "%Y-%m-%d,f", "BIS/PD_DXM", 1); var cad_ir = dataFromQuandl(5, "%Y-%m-%d,f", "BIS/PD_DCA", 1); var chf_ir = dataFromQuandl(6, "%Y-%m-%d,f", "BIS/PD_DCH", 1); var nzd_ir = dataFromQuandl(7, "%Y-%m-%d,f", "BIS/PD_DNZ", 1); var gbp_ir = dataFromQuandl(8, "%Y-%m-%d,f", "BIS/PD_DGB", 1); // What the broker takes in addition to the interest rate differential // Will vary by broker, by pair, and even by direction! Make a conservative assumption. var broker_fee = 0.5; // EUR/USD roll in AUD example asset("EUR/USD"); //calculate roll long in units of quote currency var rl = calculate_roll_long(eur_ir, usd_ir, broker_fee); // convert to units of account currency - here the account currency is AUD // not required if account currency is the same as the quote currency string current_asset = Asset; // store name of currently selected asset asset("AUD/USD"); // switch to ACCT_CCY/QUOTE_CCY var p = priceClose(); asset(current_asset); // switch back to original asset RollLong = rl/p; // adjust roll long calculation and set Zorro's RollLong variable //calculate roll short in units of quote currency var rs = calculate_roll_short(eur_ir, usd_ir, broker_fee); // convert to units of account currency - here the account currency is AUD // not required if account currency is the same as the quote currency RollShort = rs/p; // adjust roll short calculation and set Zorro's RollShort variable // plot roll in units of account currency plot("Roll Long", RollLong, NEW, BLUE); plot("Roll Short", RollShort, 0, RED); }

One major thing to remember is that your FX broker won’t charge/pay swaps based on the exact interest rate differential. In practice, they might take some additional fat for themselves, or even adjust their actual swaps on the basis of perceived upside/downside volatility – and these may not even be symmetrical! The short story is that the broker’s cut will vary by broker, FX pair, and even by direction! You can verify that yourself by searching various brokers’ websites for their current swap rates.

So the upshot of all that is that if you want to include an additional broker fee in your simulation, recognise that it will be an estimate, do some research on what brokers are currently charging, and err on the conservative side. In the code above, the broker fee is set in line 44; you can also set this to zero if you like.

The trickiest part is converting the interest rate differential of the base-quote currencies to Zorro’s

RollLongand

RollShortvariables – but the advantage is that once you get that right, Zorro will take care of simulating the roll for you – you literally won’t have to do another thing! These variables represent the swap in account currency per 10,000 traded FX units. Most of that conversion is taken care of the in the

calculate_roll_long()and

calculate_roll_short()functions in the code above. But these functions output the swap in units of the

The code also contains an example of converting the EUR/USD roll for an account denominated in AUD. This is accomplished from line 46.

Here’s the output of running the script. You can see how the swap for long and short trades has changed over time. At some point in 2014, it became a less expensive proposition to sell the EUR against the USD rather than buy it. You can also see that the value of the swap is constantly changing; that’s because the calculation considers the contemporaneous exchange rate of the account currency (AUD) against the quote currency (USD) of the pair being traded.

Here’s a python script for downloading the same data set as used above (albeit with a longer history) from Quandl, and a function for calculating the swap. This time, the function calculates the swap per standard FX lot, which is 100,000 units of the quote currency (the Zorro script above calculates the swap per 10,000 units which is required for Zorro’s

RollLongand

RollShortvariables).

import pandas as pd import matplotlib.pyplot as plt import quandl cad = quandl.get("BIS/PD_DCA") jpy = quandl.get("BIS/PD_DJP") chf = quandl.get("BIS/PD_DCH") aud = quandl.get("BIS/PD_DAU") gbp = quandl.get("BIS/PD_DGB") nzd = quandl.get("BIS/PD_DNZ") eur = quandl.get("BIS/PD_DXM") usd = quandl.get("BIS/PD_DUS") # this is the effective fed funds rate def calculate_rolls(base, quote, broker_fee): ird = 100000*(base - quote)/(100*365) - broker_fee ird.columns = ["IRD"] ird.fillna(method="ffill", inplace=True) ird["roll_long"] = ird["IRD"] - broker_fee ird["roll_short"] = -ird["IRD"] - broker_fee return ird

Plotting the historical effective fed funds rate, you can see that the data set might have some problems prior to about 1985. You may need to smooth the data or remove outliers to use it effectively.

ax = usd.plot(grid=True) ax.legend(["USD Effective Fed Funds Rate"])

We can simulate and plot the historical swap of the AUD/CAD exchange rate as follows:

broker_fee = 5 # how much does the broker take per lot of the quote currency? aud_cad = calculate_rolls(aud, cad, broker_fee) aud_cad[["roll_long", "roll_short"]].dropna().plot(grid=True)

Again, you can see some potential data issues prior to about 1990.

The cost of financing a long-term FX position can have a significant impact on the overall result of the trade. This post demonstrated a simple and inexpensive way to simulate the historical variable financing costs for FX.

*Data is the basis of everything we do as quant traders. Inside the Robot Wealth community, we show our members how to use this and other data for trading systems research in a way that goes much deeper than the basics we touched on here. But data is just one of the many algorithmic trading fundamentals we cover inside Class to Quant. Not only are our members improving their trading performance with our beginner to advanced courses, but together they’re building functioning strategies inside our community as part of our Algo Laboratory projects. If you’re interested and want to find out more, try Class to Quant for 30 days risk free. I’d love to meet you inside.

The post Simulating Variable FX Swaps in Zorro and Python appeared first on Robot Wealth.

]]>The post Fun with the Cryptocompare API appeared first on Robot Wealth.

]]>As nice as the user-interface is, what I really like about Cryptocompare is its API, which provides programmatic access to a wealth of crypto-related data. It is possible to drill down and extract information from individual exchanges, and even to take aggregated price feeds from all the exchanges that Cryptocompare is plugged into – and there are quite a few!

When it comes to interacting with Cryptocompare’s API, there are already some nice Python libraries that take care of most of the heavy lifting for us. For this post, I decided to use a library called

cryptocompare. Check it out on Git Hub here.

You can install the current stable release by doing

pip install cryptocompare, but I installed the latest development version direct from Git Hub, as only that version had support for minutely price history at the time of writing.

To install the dev version from Git Hub, do:

pip install git+https://github.com/lagerfeuer/cryptocompare.git

This version will limit you to one month’s worth of daily price data and one week’s worth of hourly data. If you’re feeling adventurous, you can install the version that I forked into my own Git Hub account and modified to increase those limits. To do that, you’ll need to do:

pip install git+https://github.com/kplongdodd/cryptocompare.git

Now that we’ve got our library of API functions, let’s take a look at what we can do with Cryptocompare!

To get a list of all the coins available on Cryptocompare, we can use the following Python script:

import numpy as np import pandas as pd import cryptocompare as cc # list of coins coin_list = cc.get_coin_list() coins = sorted(list(coin_list.keys()))

At the time of writing, this returned a list of 2,609 coins! By comparison, there are around 2,800 stocks listed on the New York Stock Exchange.

Let’s focus on the biggest players in crypto-world: the coins with the largest market capitalisation.

We can get price data for a list of coins using the function

cryptocompare.get_price()and if we specify

full=True, the API will return a whole bunch of data for each coin in the list, including last traded price, 24-hour volume, number of coins in circulation, and of course market capitalisation.

Cryptocompare’s API will only allow us to pass it a list of coins that contains no more than 300 characters at any one time. To get around that limitation, we’ll pass lists of 50 coins at a time, until we’ve passed our entire list of all available coins.

The API returns a json string, which we can interpret as a dictionary in Python. Note that the outer-most keys in the resulting dictionary are

'RAW'and

'DISPLAY'which hold the raw data and data formatted for better displaying respectively. In our case, we prefer to work with the raw data, so we’ll keep it and discard the rest.

Here’s the code for accomplishing all that:

# get data for all available coins coin_data = {} for i in range(len(coins)//50 + 1): # limited to a list containing at most 300 characters # coins_to_get = coins[(50*i):(50*i+50)] message = cc.get_price(coins_to_get, curr='USD', full=True) coin_data.update(message['RAW'])

coin_datanow contains a whole bunch of dictionaries-within-dictionaries that hold our data. Each outer key corresponds to a coin symbol, and looks like this:

'ZXT': {'USD': {'CHANGE24HOUR': 0, 'CHANGEDAY': 0, 'CHANGEPCT24HOUR': 0, 'CHANGEPCTDAY': 0, 'FLAGS': '4', 'FROMSYMBOL': 'ZXT', 'HIGH24HOUR': 2.01e-06, 'HIGHDAY': 2.01e-06, 'LASTMARKET': 'CCEX', 'LASTTRADEID': '1422076', 'LASTUPDATE': 1491221170, 'LASTVOLUME': 998, 'LASTVOLUMETO': 0.0020059799999999997, 'LOW24HOUR': 2.01e-06, 'LOWDAY': 2.01e-06, 'MARKET': 'CCCAGG', 'MKTCAP': 0, 'OPEN24HOUR': 2.01e-06, 'OPENDAY': 2.01e-06, 'PRICE': 2.01e-06, 'SUPPLY': 0, 'TOSYMBOL': 'USD', 'TOTALVOLUME24H': 0, 'TOTALVOLUME24HTO': 0, 'TYPE': '5', 'VOLUME24HOUR': 0, 'VOLUME24HOURTO': 0, 'VOLUMEDAY': 0, 'VOLUMEDAYTO': 0}},

That

'USD'key is common to all the coins in

coin_dataand it specifies the counter-currency in which prices are displayed. That key is going to be troublesome when we turn our dictionary into a more analysis-friendly data structure, like a pandas

DataFrame, so let’s get rid of it:

# remove 'USD' level for k in coin_data.keys(): coin_data[k] = coin_data[k]['USD']

Now we can go ahead and create a

DataFramefrom our

coin_datadictionary and sort it by market capitalization:

coin_data = pd.DataFrame.from_dict(coin_data, orient='index') coin_data = coin_data.sort_values('MKTCAP', ascending=False)

All good so far, but interrogating this data by doing

coin_data['MKTCAP'].head(20)reveals that the coin with the highest market cap is something called AMO:

coin_data['MKTCAP'].head(20) Out[3]: AMO 1.928953e+13 WBTC* 1.421202e+11 BTC 1.156607e+11 BITCNY 6.769687e+10 ETH 5.327990e+10 NPC 3.108324e+10 XRP 2.222890e+10 XUC 1.644538e+10 BCH 1.623158e+10 EOS 1.103000e+10 VERI 7.803077e+09 PRPS 7.342302e+09 LTC 6.087482e+09 MTN 5.000000e+09 TRX 4.731000e+09 XLM 4.628553e+09 ADA 4.485383e+09 DCN 3.928000e+09 IOT 3.863547e+09 VEN 3.330000e+09

Wouldn’t we expect that honour to go to Bitcoin, with symbol BTC? And what about all those other coins that you’ve probably never heard of? What’s going on here?

It turns out that Cryptocompare includes data for coins that haven’t yet gone to ICO, and it appears that in such cases, the market capitalisation calculation is done using the pre-ICO price of the coin, and its total possible supply of coins.

That’s going to skew things quite significantly, so let’s exclude any coins from our list that haven’t traded in the last 24 hours. We can get this information from the

TOTALVOLUME24Hfield, which is the total amount the coin has been traded in 24 hours against all its trading pairs:

# exclude coins that haven't traded in last 24 hours # TOTALVOLUME24H is the amount the coin has been traded # in 24 hours against ALL its trading pairs coin_data = coin_data[coin_data['TOTALVOLUME24H'] != 0]

coin_data['MKTCAP'].head()now looks a lot more sensible:

coin_data['MKTCAP'].head() Out[4]: BTC 1.156607e+11 ETH 5.327990e+10 XRP 2.222890e+10 XUC 1.644538e+10 BCH 1.623158e+10

We can get the last month’s historical daily data for the 100 top coins by market cap, stored as a dictionary of DataFrames, by doing the following:

top_coins = coin_data[:100].index df_dict = {} for coin in top_coins: hist = cc.get_historical_price_day(coin, curr='USD') if hist: hist_df = pd.DataFrame(hist['Data']) hist_df['time'] = pd.to_datetime(hist_df['time'], unit='s') hist_df.index = hist_df['time'] del hist_df['time'] df_dict[coin] = hist_df

And we can access the data for any coin in the dictionary by doing

df_dict[coin]where coin is the symbol of the coin we interested in, such as ‘BTC’. Now that we have our data, we can do some fun stuff!

You will need to use the version of

cryptocomparefrom my Git Hub repo (see above) in order to get enough to data to reproduce the examples below. In that case, once you’ve downloaded my version, just replace line 5 in the script above with

hist = cc.get_historical_price_day(coin, curr='USD', limit=2000)

First, let’s pull out all the closing prices from each

DataFramein our dictionary:

# pull out closes closes = pd.DataFrame() for k, v in df_dict.items(): closes[k] = v['close'] # re-order by market cap closes = closes[coin_data.index[:100]]

Plot some prices from 2017, an interesting year for cryptocurrency, to say the least:

# some cool stuff we can do with our data import matplotlib.pyplot as plt import seaborn as sns # plot some prices closes.loc['2017', ['BTC', 'ETH', 'LTC']].plot()

Plot some returns series from the same period:

# plot some returns closes.loc['2017', ['BTC', 'ETH', 'LTC']].pct_change().plot()

# plot correlation matrx sns.heatmap(closes.loc['2017', ['BTC', 'ETH', 'LTC', 'XRP', 'XUC', 'BCH', 'EOS', 'VERI', 'TRX',]].pct_change().corr())

And finally, a scatterplot matrix showing distributions on the diagonal:

# scatter plot matrix sns.pairplot(closes.loc['2018', ['BTC', 'ETH', 'XRP', 'VERI', 'LTC']].pct_change().dropna())

There’s lots more interesting analysis you can do with data from Cryptocompare, before we even do any backtesting, for example:

- Value of BTC and other major coins traded through the biggest exchanges over time – which exchanges are dominating?
- Top coins traded by fiat currency – do some fiats gravitate towards certain cryptocurrencies?
- Are prices significantly different at the same time across exchanges – that is, are arbitrage opportunities present?5

In this post, I introduced the Cryptocompare API and some convenient Python tools for interacting with it. I also alluded to the depth and breadth of data available: over 2,000 coins, some going back several years, broken down by exchange and even counter-currency. I also showed you some convenient base-Python and pandas data structures for managing and interrogating all that data. In future blog posts, we’ll use this data to backtest some crypto trading strategies.

*Data is the basis of everything we do as quant traders. Inside the Robot Wealth community, we show our members how to use this and other data for trading systems research in a way that goes much deeper than the basics we touched on here. But data is just one of the many algorithmic trading fundamentals we cover inside Class to Quant. Not only are our members improving their trading performance with our beginner to advanced courses, but together they’re building functioning strategies inside our community as part of our Algo Laboratory projects. If you’re interested and want to find out more, try Class to Quant for 30 days risk free. I’d love to meet you inside.

The post Fun with the Cryptocompare API appeared first on Robot Wealth.

]]>The post ETF Rotation Strategies in Zorro appeared first on Robot Wealth.

]]>Lately our Class to Quant members have been looking to implement rotation-style ETF and equities strategies in Zorro, but just like your old high-school essays, starting is the biggest barrier. These types of strategies typically scan a universe of instruments and select one or more to hold until the subsequent rebalancing period. Zorro is my go-to choice for researching and even executing such strategies: its speed makes scanning even large universes of stocks quick and painless, and its scripting environment facilitates fast prototyping and iteration of the algorithm itself – once you’ve wrestled it for a while (get our free Zorro for Beginners video course here).

I’m going to walk you through a general design paradigm for constructing strategies like this with Zorro, and demonstrate the entire process with a simple rotation algorithm based on Gary Antonacci’s Dual Momentum. By the end you should have the skills needed to build a similar strategy yourself. Let’s begin!

To construct a rotation style strategy in Zorro, we’d follow these general design steps:

- Construct your universe of instruments by adding them to an assets list CSV file. There are examples in Zorro’s History folder, and I’ll put one together for you below.
- Set up your rebalancing period using Zorro’s time and date functions.
- Tell Zorro to reference the asset list you just created using the
assetList

command. - Loop through each instrument in the list and perform whatever calculations or analysis your strategy requires for the selection of which instruments to hold.
- Compare the results of the calculations/analysis performed in the prior step and construct the positions for the next period.

That’s pretty much it! Of course, the details of each step might differ slightly depending on the algorithm, and you will also need some position sizing and risk management, but in general, following these steps will get you 90% of the way there.

Not happy trading with a 90% complete strategy? No problem, let’s look at what this looks like in practice.

This example is based on Gary Antonacci’s Dual Momentum. We will simplify Gary’s slightly more nuanced version to the following: if US equities outperformed global equities ** and** its return was positive, hold US equities. If global equities outperformed US equities

Gary has done a mountain of research on Dual Momentum and found that it has outperformed for decades. In particular, it has tended to kick you out of equities during extended bear markets, while still getting you in for most of the bull markets. Check out Gary’s website for more information and consider getting hold of a copy of his book – you can read my review here.

Our simplified version of the strategy will use a universe of three ETFs that track US equities, global equities and short-term bonds. We will use the returns of these ETFs for both generation of our trading signals and actual trading (Gary’s approach is slightly more nuanced than that – again check out his website and book for more details).

Our asset list contains the universe of instruments we wish to scan. In our case, we only need three ETFs. We’ll choose SPY for our US equities instrument, EFA for our global equities and SHY for our bonds ETF.

Zorro’s asset lists are CSV files that contain a bunch of parameters about the trading conditions of each instrument. This information is used in Zorro’s simulations, so it’s important to make it as accurate as possible. In many cases, Zorro can populate these files for us automatically by simply connecting to a broker, but in others, we need to do it manually (explained in our Zorro video course).

Our asset list for this strategy will look like this:

Name,Price,Spread,RollLong,RollShort,PIP,PIPCost,MarginCost,Leverage,LotAmount,Commission,Symbol SPY,269.02,0.1,0,0,0.01,0.01,0,1,1,0.02, SHY,83.61,0.1,0,0,0.01,0.01,0,1,1,0.02, EFA,69.44,0.1,0,0,0.01,0.01,0,1,1,0.02,

You can see that most of the parameters are actually the same for each instrument, so we can use copy and paste to make the construction of this file less tedious than it would otherwise be. For other examples of such files, just look in Zorro’s History folder.

Save this file as a CSV file called AssetsDM.csv and place it in your History folder (which is where Zorro will go looking for it shortly).

Here we are going to rebalance our portfolio every month. We decided to avoid the precise start/end of the month and rebalance on the third trading day of the month. You can experiment with this parameter to get a feel for how much it affects the strategy.

Simply wrap the trading logic in the following

if()statement:

if(tdm() == 3) { ... }

In the initial run of the script, we want Zorro to reference the newly created asset list. Also, if we don’t have data for these instruments, we want to download it in the initial run. We’ll use Alpha Vantage end-of-day data, which can be accessed directly from within Zorro scripts. These lines of code take care of that for us:

if(is(INITRUN)) { assetList("History\\AssetsDM.csv"); string Name; while(Name = loop(Assets)) { assetHistory(Name, FROM_AV); } }

Note that this assumes you’ve entered your Alpha Vantage API key in the Zorro.ini or ZorroFix.ini configuration files, which live in Zorro’s base directory. If you don’t have an Alpha Vantage API key head over to the Alpha Vantage website to claim one.

For our dual momentum strategy, we need to know the return of each instrument over the portfolio formation period. So we can loop through each asset in our list, calculate the return, and store it in an array for later use.

If you intend on using Zorro’s optimizer, perform the loop operation using a construct like:

for(i=0; Name=Assets[i]; i++) { ... }

If you don’t intend on using the optimizer, you can safely use the more convenient

while(loop(Assets))construct.

The reason we don’t use the latter in an optimization run is that the

loop()function is handled differently in Zorro’s Train mode, and will actually run a separate simulation for each instrument in the loop. This is perfect in the instance we want to trade a particular algorithm across multiple, known instruments – something like a moving average crossover traded on each stock in the S&P500, where we wanted to optimize the moving average periods separately for each instrument2. But in an algorithm that compares and selects instruments from a universe of instruments, optimizing some parameter set on each one individually wouldn’t make sense.

This is actually a really common mistake when developing these type of strategies in Zorro, but if you understand the behavior of

loop()in Zorro’s Train mode, it’s one that you probably won’t make again.

Here’s the code for performing the looped return calculations:

if(tdm()==3) { asset_num = 0; while(loop(Assets)) { asset(Loop1); Returns[asset_num] = (priceClose(0)-priceClose(DAYS))/priceClose(DAYS); asset_num++; } ...

Recalling our dual momentum trading logic, we firstly check if US equities outperformed global equities. If so, we then check that their absolute return was positive. If so, then we hold US equities. If global bonds outperformed US equities, we check that their absolute return was positive. If so, then we hold global equities. If neither US equities nor global equities had a positive return, we hold bonds.

If you stop and think about that logic, we are really just holding the instrument with the highest return in the formation period, with the added condition that for the equities instruments, they also had a positive absolute return. We could implement that trading logic like so:

// sort returns lowest to highest int* idx = sortIdx(Returns, asset_num); // exit any positions where asset is ranked in bottom 2 and is not bonds int i; for(i=0;i<2;i++) { asset(Assets[idx[i]]); if(Asset != "SHY") { if(NumOpenLong > 0) { printf("\nAsset to close: %s", Asset); exitLong(); } } } // asset to hold asset(Assets[idx[2]]); /* check if asset is bonds, if so buy if not, if return of highest ranked asset is positive, buy otherwise, switch to bonds and buy */ if(Asset == "SHY") { // don't apply time series momentum to bonds enterLong(); } else if(Returns[idx[2]] > 0) //time-series momentum condition { enterLong(); asset("SHY"); exitLong(); } else { // switch to bonds and buy asset("SHY"); enterLong(); } }

This is probably the most confusing part of the script, so let’s talk about it in some detail. Firstly, the line

int* idx = sortIdx(Returns, asset_num)

returns an array of the indexes of the

Returnsarray, sorted from lowest to highest. Say our

Returnsarray held the numbers 4, -2, 2. Our array

idxwould contain 1, 2, 0 because the item at

Returns[1]is the lowest number, followed by the number at

Returns[2], with

Returns[0]being the highest number. This might seem confusing, but it will provide us with a convenient way to access the highest ranked instrument directly from the Assets array, which holds the names of the instruments in the order called by our

loop()function.

In lines 5-17, we firstly use this feature to exit any open positions that aren’t the highest ranked asset – provided those lower ranked assets aren’t bonds. Remember, we might want to hold a bond position, even if it isn’t the highest ranked asset. So we won’t exit any open bond positions just yet.

Next, in line 20, we switch the highest ranked instrument. If that instrument is bonds, we don’t bother checking the absolute return condition (it doesn’t apply to bonds) and go long. If that instrument is one of the equities ETFs, we check the absolute return condition. If that turns out to be true, we enter a long position in that ETF, then switch to bonds and exit any open position we may have been holding.

Finally, if the absolute return condition on our top-ranked equities ETFs wasn’t true, we switch to bonds and enter a long position.

In this case we are simply going to be fully invested with all of our starting capital and any accrued profits in the currently selected instrument. Here’s the code for accomplishing that:

Capital = 10000; Margin = Capital+WinTotal-LossTotal;

Note that this is only possible because we are trading these instruments with no leverage (leverage is defined in the asset list above). If we were using leverage, we’d obviously have to reduce the amount of margin invested in a given position.

Finally, here’s the complete code listing for our simple Dual Momentum algorithm. In order for the script to run, remember to save a copy of the asset list in Zorro’s History folder, and enter your Alpha Vantage API key in the Zorro.ini or ZorroFix.ini configuration files.

/* Dual momentum in Zorro */ #define NUM_ASSETS 3 #define DAYS 252 int asset_num; function run() { set(PLOTNOW); PlotWidth = 1200; StartDate = 20040101; EndDate = 20170630; BarPeriod = 1440; LookBack = DAYS; MaxLong = 1; if(is(INITRUN)) { assetList("History\\AssetsDM.csv"); string Name; while(Name = loop(Assets)) { assetHistory(Name, FROM_AV); } } var Returns[NUM_ASSETS]; int position_diff; if(tdm()==3) { asset_num = 0; while(loop(Assets)) { asset(Loop1); Returns[asset_num] = (priceClose(0)-priceClose(DAYS))/priceClose(DAYS); asset_num++; } Capital = 10000; Margin = Capital+WinTotal-LossTotal; // sort returns lowest to highest int* idx = sortIdx(Returns, asset_num); // exit any positions where asset is ranked in bottom 2 and is not bonds int i; for(i=0;i<2;i++) { asset(Assets[idx[i]]); if(Asset != "SHY") { if(NumOpenLong > 0) { printf("\nAsset to close: %s", Asset); exitLong(); } } } // asset to hold asset(Assets[idx[2]]); /* check if asset is bonds, if so buy if not, if return of highest ranked asset is positive, buy otherwise, switch to bonds and buy */ if(Asset == "SHY") { // don't apply time series momentum to bonds enterLong(); } else if(Returns[idx[2]] > 0) //time-series momentum condition { enterLong(); asset("SHY"); exitLong(); } else { // switch to bonds and buy asset("SHY"); enterLong(); } } }

Over the simulation period, the strategy returns a Sharpe Ratio of 0.52. That’s pretty healthy for something that trades so infrequently. In terms of gross returns, the starting capital of $10,000 was almost tripled, and the maximum drawdown was approximately $4,700. One of the main limitations of the strategy is that by design, it is highly concentrated, taking only single position at a time.

Here’s the equity curve:

Rotation style strategies* require a slightly different design approach than strategies for whom the tradable subset of instruments is static. By following the five broad design principles described here, you can leverage Zorro’s speed, power and flexibility to develop these types of strategies. Good luck and happy profits!

*This is just one of the many algorithmic trading fundamentals we cover inside Class to Quant. Not only are our members improving their trading performance with our beginner to advanced courses, but together they’re building functioning strategies inside our community as part of our Algo Laboratory projects. If you’re interested and want to find out more, try Class to Quant for 30 days risk free. I’d love to meet you inside.

**Where to from here?**

*Check out my review of Gary Antonacci’s Dual Momentum, and explore some other variations written in R*

- Get our free Zorro for Beginners video series, and go from beginner to Zorro trader in just 90 minutes
*If you’re ready to go deeper and get more practical tips and tricks on building robust trading systems, as well as joining our strong community of traders, check out our flagship offer Class to Quant.*

The post ETF Rotation Strategies in Zorro appeared first on Robot Wealth.

]]>The post Deep Learning for Trading Part 4: Fighting Overfitting with Dropout and Regularization appeared first on Robot Wealth.

]]>This is the fourth in a multi-part series in which we** ****explore and compare various deep learning tools and techniques for market forecasting using Keras and TensorFlow**.

In Part 1, we introduced Keras and discussed some of the major obstacles to using deep learning techniques in trading systems, including a warning about attempting to extract meaningful signals from historical market data. If you haven’t read that article, it is highly recommended that you do so before proceeding, as the context it provides is important.

Part 2 provides a walk-through of setting up **Keras and Tensorflow for R** using either the default **CPU-based configuration**, or the more complex and involved (but well worth it) **GPU-based configuration **under the Windows environment**.**

Part 3 is an **introduction to the model building, training and evaluation process in Keras**. We train a simple feed forward network to predict the direction of a foreign exchange market over a time horizon of one hour and assess its performance.

.

In the last post, we trained a densely connected feed forward neural network to forecast the direction of the EUR/USD exchange rate over a time horizon of one hour. We landed on a model that predicted slightly better than random on out of sample data. We also saw in our learning plots that our network started to overfit badly at around 40 epochs. In this post, I’m going to demonstrate some tools to help fight overfitting and push your models further. Let’s get started.

Regularization is a commonly used technique to mitigate overfitting of machine learning models, and it can also be applied to deep learning. Regularization essentially constrains the complexity of a network by penalizing larger weights during the training process. That is, by adding a term to the loss function that grows as the weights increase.

Keras implements two common types of regularization: ** **

- L1, where the additional cost is proportional to the
**absolute value**of the weight coefficients - L2, where the additional cost is proportional to the
**square**of the weight coefficients

These are incredibly easy to implement in Keras: simply pass

regularizer_l2(regularization_factor)or

regularizer_l2(regularization_factor)to the

kernal_regularizerargument in a Keras layer instance (details on how to do this below), where

regularization_factor * abs(weight_coefficient)or

regularization_factor * weight_coefficient^2is added to the total loss, depending on the type of regularization chosen.

Note that in Keras speak,

'kernel'refers to the weights matrix created by a layer. Regularization can also be applied to the bias terms via the argument

bias_regularizerand the output of a layer by

activity_regularizer.

When we add regularization to a network, we might find that we need to train it for more epochs in order to reach convergence. This implies that the network might benefit from a higher learning rate during early stages of model training.2

However, we also know that sometimes a network can benefit from a smaller learning rate at later stages of the training process. Think of it like the model’s loss being stuck halfway down the global minimum, bouncing from one side of the loss surface to the other with each weight update. By reducing the learning rate, we can make the subsequent weight updates less dramatic, which enables the loss to ‘fall’ further down towards the true global minimum.

By using another Keras callback, we can automatically adjust our learning rate downwards when training reaches a plateau:

reduce_lr <- callback_reduce_lr_on_plateau(monitor = "val_acc", factor = 0.9, patience = 10, verbose = 1, mode = "auto", epsilon = 0.005, min_lr = 0.00001)

This tells Keras to reduce the learning rate by a factor of 0.9 whenever validation accuracy doesn’t improve for

patienceepochs. Also note the

epsilonparameter, which controls the threshold for measuring the new optimum. Setting this to a higher value results in fewer changes to the learning rate. This parameter should be on a scale that is relevant to the metric being tracked, validation accuracy in this case.

Here’s the code for an L2 regularized feed forward network with both

reduce_lr_on_plateauand

model_checkpointcallbacks (data import and processing is the same as in the previous post):

###### FFN with weight regularization ##### model.reg <- keras_model_sequential() model.reg %>% layer_dense(units = 150, kernel_regularizer = regularizer_l2(0.001), activation = 'relu', input_shape = ncol(X_train)) %>% layer_dense(units = 150, kernel_regularizer = regularizer_l2(0.001), activation = 'relu') %>% layer_dense(units = 150, kernel_regularizer = regularizer_l2(0.001), activation = 'relu') %>% layer_dense(units = 1, activation = 'sigmoid') summary(model.reg) model.reg %>% compile( loss = 'binary_crossentropy', optimizer = optimizer_rmsprop(lr=0.001), metrics = c('accuracy') ) filepath <- "C:/Users/Kris/Research/DeepLearningForTrading/model_reg.hdf5" # set up your own filepath checkpoint <- callback_model_checkpoint(filepath = filepath, monitor = "val_acc", verbose = 1, save_best_only = TRUE, save_weights_only = FALSE, mode = "auto") reduce_lr <- callback_reduce_lr_on_plateau(monitor = "val_acc", factor = 0.9, patience = 20, verbose = 1, mode = "auto", epsilon = 0.005, min_lr = 0.00001) history.reg <- model.reg %>% fit( X_train, Y_train, epochs = 100, batch_size = nrow(X_train), validation_data = list(X_val, Y_val), shuffle = TRUE, callbacks = list(checkpoint, reduce_lr) ) # plot training loss and accuracy plot(history.reg) max(history.reg$metrics$val_acc) # load and evaluate best model rm(model.reg) model.reg <- keras:::keras$models$load_model(filepath) model.reg %>% evaluate(X_test, Y_test)

Plotting the training curves now gives us three plots – loss, accuracy and learning rate:

This particular training process resulted in an out of sample accuracy of 53.4%, slightly better than our original unregularized model. You can experiment with more or less regularization, as well as applying regularization to the bias terms and layer outputs.

Dropout is another commonly used tool to fight overfitting. Whereas regularization is used throughout the machine learning ecosystem, dropout is specific to neural networks. Dropout is the random zeroing (“dropping out”) of some proportion of a layer’s outputs during training. The theory is that this helps prevents pairs or groups of nodes from learning random relationships that just happen to reduce the network loss on the training set (that is, result in overfitting). Hinton and his colleagues, the discoverers of dropout, showed that it is generally superior to other forms of regularization and improves model performance on a variety of tasks. Read the original paper here.2

Dropout is implemented in Keras as it’s own layer,

layer_dropout(), which applies dropout on it’s

rateparameter. In practice, dropout rates between 0.2 and 0.5 are common, but the optimal values for a particular problem and network configuration need to be determined through appropriate cross validation.

At the risk of getting ahead of ourselves, when applying dropout to recurrent architectures (which we’ll explore in a future post), we need to apply the same pattern of dropout at every timestep, otherwise dropout tends to hinder performance rather than enhance it.3

Here’s an example of how we build a feed forward network with dropout in Keras:

###### FFN with dropout ##### model.drop <- keras_model_sequential() model.drop %>% layer_dense(units = 150, activation = 'relu', input_shape = ncol(X_train)) %>% layer_dropout(rate = 0.3) %>% layer_dense(units = 150, activation = 'relu') %>% layer_dropout(rate = 0.3) %>% layer_dense(units = 150, activation = 'relu') %>% layer_dropout(rate = 0.3) %>% layer_dense(units = 1, activation = 'sigmoid') summary(model.drop) model.drop %>% compile( loss = 'binary_crossentropy', optimizer = optimizer_rmsprop(lr=0.001), metrics = c('accuracy') ) filepath <- "C:/Users/Kris/Research/DeepLearningForTrading/model_drop.hdf5" # set up your own filepath checkpoint <- callback_model_checkpoint(filepath = filepath, monitor = "val_acc", verbose = 1, save_best_only = TRUE, save_weights_only = FALSE, mode = "auto") reduce_lr <- callback_reduce_lr_on_plateau(monitor = "val_acc", factor = 0.9, patience = 20, verbose = 1, mode = "auto", epsilon = 0.005, min_lr = 0.00001) history.drop <- model.drop %>% fit( X_train, Y_train, epochs = 150, batch_size = nrow(X_train), validation_data = list(X_val, Y_val), shuffle = TRUE, callbacks = list(checkpoint, reduce_lr) ) # plot training loss and accuracy plot(history.drop) max(history.drop$metrics$val_acc) # load and evaluate best model rm(model.drop) model.drop <- keras:::keras$models$load_model(filepath) model.drop %>% evaluate(X_test, Y_test)

Training the model using the same procedure as we used in the L2-regularized model above, including the reduce learning rate callback, we get the following training curves:

One of the reasons dropout is so useful is that it enables the training of larger networks by reducing their propensity to overfit. Here’s the training curves for a similar model but this time eight layers deep:

Notice that it doesn’t overfit significantly worse than the shallower model. Also notice that it didn’t really learn any new, independent relationships from the data – this is evidenced by the failure to beat the previous model’s validation accuracy. Perhaps 53% is the upper out of sample accuracy limit for this data set and this approach to modeling it.

With dropout, you can also afford to use a larger learning rate, which means it is a good idea to make use of the

reduce_lr_on_plateaucallback and kick off training with a higher learning rate, which can always be decayed as learning stalls.

Finally, one important consideration when using dropout is constraining the size of the network weights, particularly when a large learning rate is used early in training. In the Hinton

Keras makes that easy thanks to the

kernel_constraintparameter of

layer_dense():

max_weight_constraint <- 5 model.drop <- keras_model_sequential() model.drop %>% layer_dense(units = 150, activation = 'relu', kernel_constraint = constraint_maxnorm(max_value = max_weight_constraint), input_shape = ncol(X_train)) %>% layer_dropout(rate = 0.3) %>% layer_dense(units = 150, activation = 'relu', kernel_constraint = constraint_maxnorm(max_value = max_weight_constraint)) %>% layer_dropout(rate = 0.3) %>% layer_dense(units = 150, activation = 'relu', kernel_constraint = constraint_maxnorm(max_value = max_weight_constraint)) %>% layer_dropout(rate = 0.3) %>% layer_dense(units = 1, activation = 'sigmoid')

This model provided an ever-so-slight bump in validation accuracy:

And quite a stunning test-set equity curve:

# get predictions on test set and plot simple, frictionless PnL preds <- model.drop %>% predict_proba(X_test) threshold <- 0.5 trades <- ifelse(preds >= threshold, Y_test_raw, ifelse(preds <= 1-threshold, -Y_test_raw, 0)) plot(cumsum(trades), type='l')

Interestingly, every experiment I performed in writing this post resulted in a positive out of sample equity curve. The results were all slightly different, even when using the same model setup, which reflects the non-deterministic nature of the training process (two identical networks trained on the same data can result in different weights, depending on the initial, pre-training weights of each network). Some equity curves were better than others, but they were all positive.

Here are some examples:

Of course, as mentioned in the last post, the edge of these models disappears when we apply retail spreads and broker commissions, but the frictionless equity curves demonstrate that deep learning, even using a simple feed-forward architecture, can extract predictive information from historical price action, at least for this particular data set, and that tools like regularization and dropout can make a difference to the quality of the model’s predictions.

Before we get into advanced model architectures, in the next unit I’ll show you:

- One of the more cutting edge architectures to get the most out of a densely connected feed forward network.
- How to interrogate and visualize the training process in real time.

This post demonstrated how to fight overfitting with regularization and dropout using Keras’ sequential model paradigm. While we further refined our previously identified slim edge in predicting the EUR/USD exchange rate’s direction, in practical terms, traders with access to retail spreads and commission will want to consider longer holding times to generate more profit per trade, or will need a more performant model to make money with this approach.

**Where to from here?**

*To find out***why AI is taking off in finance**, check out these insights from my days as an AI consultant to the finance industry*If this***walk-through**was useful for you, you might like to check out another how-to article on running trading algorithms on Google Cloud Platform*If the***technical details of neural networks**are interesting for you, you might like our introductory article*Be sure to check out Part 1, Part 2, and Part 3 of this series on deep learning applications for trading.**If you’re ready to go deeper and get more practical tips and tricks on building robust trading systems, consider becoming a Robot Wealth member.*

The post Deep Learning for Trading Part 4: Fighting Overfitting with Dropout and Regularization appeared first on Robot Wealth.

]]>The post Deep Learning for Trading Part 3: Feed Forward Networks appeared first on Robot Wealth.

]]>This is the third in a multi-part series in which we** ****explore and compare various deep learning tools and techniques for market forecasting using Keras and TensorFlow**.

In Part 1, we introduced Keras and discussed some of the major obstacles to using deep learning techniques in trading systems, including a warning about attempting to extract meaningful signals from historical market data. If you haven’t read that article, it is highly recommended that you do so before proceeding, as the context it provides is important. Read Part 1 here.

Part 2 provides a walk-through of setting up **Keras and Tensorflow for R** using either the default **CPU-based configuration**, or the more complex and involved (but well worth it) **GPU-based configuration **under the Windows environment**. **Read Part 2 here.

Part 3 is an **introduction to the model building, training and evaluation process in Keras**. We train a simple feed forward network to predict the direction of a foreign exchange market over a time horizon of hour and assess its performance.

.

Now that you can train your deep learning models on a GPU, the fun can really start. By the end of this series, we’ll be building interesting and complex models that predict multiple outputs, handle the sequential and temporal aspects of time series data, and even use custom cost functions that are particularly relevant to financial data. But before we get there, we’ll start with the basics.

In this post, we’ll build our first neural network in Keras, train it, and evaluate it. This will enable us to understand the basic building blocks of Keras, which is a prerequisite for building more advanced models.

There are numerous possible ways to formulate a market forecasting problem. For the sake of this example, we will forecast the direction of the EUR/USD exchange rate over a time horizon of one hour. That is, our model will attempt to classify the next hour’s market direction as either up or down.

Our data will consist of hourly EUR/USD exchange rate history obtained from FXCM (**IMPORTANT**: read the caveats and limitations associated with using past market data to predict the future here). Our data covers the period 2010 to 2017.

Our features will simply consist of a number of variables related to price action:

- Change in hourly closing price
- Change in hourly highest price
- Change in hourly lowest price
- Distance between the hourly high and close
- Distance between the hourly low and close
- Distance between the hourly high and low (the hourly range)

We will use several past values of these variables, as well as the current values, to predict the target. We’ll also include the hour of day as a feature in the hope of capturing intraday seasonality effects.** **

Training of neural networks normally proceeds more efficiently if we scale our input features to force them into a similar range. There are various scaling strategies throughout the deep learning literature (see for example Geoffrey Hinton’s Neural Networks for Machine Learning course), but scaling remains something of an art rather than a one-size-fits all type problem.

The standard approach to scaling involves normalizing the *entire* data set using the mean and standard deviation of each feature in the *training* set. This prevents data leakage from the test and validation sets into the training set, which can produce overly optimistic results. The problem with this approach for financial data is that it often results in scaled test or validation data that winds up being way outside the range of the training set. This is related to the problem of non-stationarity of financial data and is a significant issue. After all, if a model is asked to predict on data that is very different to its training data, it is unlikely to produce good results.

One way around this is to scale data relative to the recent past. This ensures that the test and validation data is always on the intended scale. But the downside is that we introduce an additional parameter to our model: the amount of data from the recent past that we use in our scaling function. So we end up introducing another problem to solve an existing one.

Like I said, feature scaling is something of an art form, particularly when dealing with data as poorly behaved as financial data!

We’ll do our model building and experimentation in R, but first we need to generate our data. There is a Zorro script named ‘keras_data_gen.c’ for creating our targets and scaled features, and for exporting that data to a CSV file in this download link. The script will allow you to code your own features and targets, use different scaling strategies, and generate data for different instruments. Just make the changes, then click ‘Train’ on the Zorro GUI to export the data to file. If you’d prefer to just get your hands on the data used in this post, it’s also available in that same link, as is all the R code used in this post.

Our target is the direction of the market over a period of one hour, which implies a classification problem. The target exported in the script is the actual dollar amount made or lost by going long the market at 0.01 lots, exclusive of trading costs. We need to convert this to a factor reflecting the market’s movement either up or down. More on this below.

Let’s import our data into R and take a closer look. First, here’s a time series plot of the first ten days of our scaled features:

You can see that our features are roughly on the same scale. Notice the first feature, V1, which corresponds to the hour of the day. It has been scaled using a slightly different approach to the other variables to ensure that the cyclical nature of that variable is maintained. See the code in the download link above for details.

Next, here’s a scatterplot matrix of our variables and target (the first ten days of data only):

Now that we’ve got our data, we’ll see if we can extract any predictive information using deep learning techniques. In this post, we’ll look at fully connected feed-forward networks, which are kind of the like the ‘Hello World’ example of deep learning. In later posts, we’ll explore some more interesting networks.

A fully connected feed forward network is one in which every neuron in a particular layer is connected to every neuron in the subsequent layer, and in which information flows in one direction only, from input to output.

Here’s a schematic of such a network with an input layer, two hidden layers and an output layer consisting of a single neuron (source: datasciencecentral.com):

It makes sense that our network would likely benefit from using not only the features for the current time step, but also a number of prior values as well, in order to predict the target. That means that we need to create features out of lagged values of our raw feature variables.

Thankfully, that’s easily accomplished using base R’s

embed()function, which also automatically drops the NA values which arise in the first \(n\) observations, where \(n\) is the number of lags to use as features. Here’s a function which returns an expanded data set consisting of the current features as well as their

lagslagged values. It assumes that the target is in the final column (and doesn’t embed lagged values of the target) and drops the relevant NA values from the target column.

# function for creating features from lagged variables lag_variables_to_features <- function(data, num_lags=1) { d <- embed(data[, -ncol(data)], num_lags+1) # this automatically drops NA, assumes target in last column d <- cbind(d, data[(num_lags+1):nrow(data), ncol(data)]) # add column for target, dropping num_lags return(d) }

Let’s test the function and take a look at its output:

# test lagging function set.seed(503) dat <- replicate(3, rnorm(10, 0, 1)) dat # [,1] [,2] [,3] # [1,] 0.355125070 -0.42202083 2.2040012 # [2,] -0.778893409 -0.03744167 0.4128119 # [3,] -0.757356957 -0.20609016 1.0322519 # [4,] 2.329800607 2.01835389 0.7804746 # [5,] 0.283974926 -0.60559854 2.5843431 # [6,] 1.281025216 -0.28414168 0.2339200 # [7,] -0.002363249 0.96044445 1.3501947 # [8,] 1.033770690 0.74774752 -0.4097266 # [9,] -0.431933268 -0.01286499 -0.3662180 # [10,] -0.342867464 -0.71862991 -1.0912861 dat <- lag_variables_to_features(dat, 2) dat # [,1] [,2] [,3] [,4] [,5] [,6] [,7] # [1,] -0.757356957 -0.20609016 -0.778893409 -0.03744167 0.355125070 -0.42202083 1.0322519 # [2,] 2.329800607 2.01835389 -0.757356957 -0.20609016 -0.778893409 -0.03744167 0.7804746 # [3,] 0.283974926 -0.60559854 2.329800607 2.01835389 -0.757356957 -0.20609016 2.5843431 # [4,] 1.281025216 -0.28414168 0.283974926 -0.60559854 2.329800607 2.01835389 0.2339200 # [5,] -0.002363249 0.96044445 1.281025216 -0.28414168 0.283974926 -0.60559854 1.3501947 # [6,] 1.033770690 0.74774752 -0.002363249 0.96044445 1.281025216 -0.28414168 -0.4097266 # [7,] -0.431933268 -0.01286499 1.033770690 0.74774752 -0.002363249 0.96044445 -0.3662180 # [8,] -0.342867464 -0.71862991 -0.431933268 -0.01286499 1.033770690 0.74774752 -1.0912861

You can see that the function returns a new dataset with the current features and their last two lagged values, while the target remains unchanged in the final column. Note that the two rows that wind up with NA values are automatically dropped.

Essentially, this approach makes new features out of lagged values of each feature. But here’s the thing about feed forward networks: they don’t distinguish between more recent values of our features and older values. Obviously the network differentiates between the different features that we create out of lagged values, and has the ability to discern relationships between them, but it doesn’t explicitly factor the sequential nature of the data.

That’s one of the major limitations of fully connected feed forward networks applied to time series forecasting exercises, and one of the motivators of recurrent architectures, which we will get to soon enough.

Now that we can process our input data, we can start experimenting with the model building process. The best place to start is Keras’ sequential model, which is essentially a paradigm for constructing deep neural networks, one layer at a time, under the assumption that the network consists of a linear stack of layers and has only a single set of inputs and outputs. You’ll find that this assumption holds for the majority of networks that you build, and it provides a very modular and efficient method of experimenting with such networks. We’ll use the sequential model quite a lot over the coming posts before getting into some more complex models that don’t fit this paradigm.

In Keras, the model building and exploration workflow typically consists of the following steps:

- Define the input data and the target. Split the data into training, validation and test sets.
- Define a stack of layers that will be used to predict the target from the input. This is the step that defines the network architecture.
- Configure the model training process with an appropriate loss function, optimizer and various metrics to be monitored.
- Train the model by repeatedly exposing it to the training data and updating the network weights according to the loss function and optimizer chosen in the previous step.
- Evaluate the model on the test set.

Let’s go through each step.

Here’s some code for loading and processing our data. It firstly loads the data set we created with our Zorro script from above and creates a new data set consisting of the current value of each feature, as well as the seven recent lagged variables. That is, we have a total of eight timesteps for each feature. And since we started with 7 features, we have a total of 56 input variables.

We also split the dataset into a training, validation and testing set. Here, I arbitrarily chose to use 50% of the data for training, 25% for validation and 25% for testing. Note that since the time aspect of our data is critical, we should ensure that our training, validation and testing data are not randomly sampled as is standard procedure in many non-sequential applications. Rather, the training, validation and test sets should come from chronological time periods.

Note that we convert our target into a binary outcome, which enables us to build a classifier.

Recall that we scaled our features at the same time as we generated them, so no need to do any feature scaling here.

## load, process and split data ## # load path <- "C:/Users/Kris/Data/" XY <- read.csv(paste0(path, 'EURUSD_L_2010_2017.csv'), header = F) XY <- as.matrix(XY) # create lags lags <- 7 proc <- lag_variables_to_features(XY, lags) # split into training, validation and test sets train_length <- floor(0.5*nrow(proc)) val_length <- floor(0.25*nrow(proc)) X_train <- proc[1:train_length, -ncol(proc)] Y_train_raw <- proc[1:train_length, ncol(proc)] Y_train <- ifelse(Y_train_raw > 0, 1, 0) X_val <- proc[(train_length+1):(train_length+val_length), -ncol(proc)] Y_val_raw <- proc[(train_length+1):(train_length+val_length), ncol(proc)] Y_val <- ifelse(Y_val_raw > 0, 1, 0) X_test <- proc[(train_length+val_length+1):nrow(proc), -ncol(proc)] Y_test_raw <- proc[(train_length+val_length+1):nrow(proc), ncol(proc)] Y_test <- ifelse(Y_test_raw > 0, 1, 0)

Next we define the stack of layers that will become our model. The syntax might seem quirky at first, but once you’re used to it, you’ll find that you can build and experiment with different architectures very quickly.

The syntax of the sequential model uses the pipeline operator

%>%which you might be familiar with if you use the

dplyrpackage. In essence, we define a model using the sequential paradigm, and then use the pipeline operator to define the order in which layers are stacked. Here’s an example:

model <- keras_model_sequential() model %>% layer_dense(units = 150, activation = 'relu', input_shape = ncol(X_train)) %>% layer_dense(units = 150, activation = 'relu') %>% layer_dense(units = 150, activation = 'relu') %>% layer_dense(units = 1, activation = 'sigmoid')

This defines a fully connected feed forward network with three hidden layers, each of which consists of 150 neurons with the rectified linear (

'relu') activation function. If you need a refresher on activation functions, check out this post on neural network basics.

layer_dense()defines a fully connected layer – that is, one in which each input is connected to every neuron in the layer. Note that for the first layer, we need to define the input shape, which is simply the number of features in our data set. We only need to do this on the first layer; each subsequent layer gets its input shape from the output of the prior layer.

layer_dense()has many arguments in addition to the activation function that we specified here, including the weight initialization scheme and various regularization settings. We use the defaults in this example.

Keras implements many other layers, some of which we’ll explore in subsequent posts.

In this example, our network terminates with an output layer consisting of a single neuron with the sigmoid activation function. This activation function converts the output to a value between 0 and 1, which we interpret as the probability associated with the positive class in a binary classification problem (in this case, the value 1, corresponding to an up move).

To get an overview of the model, call

summary(model)and observe the output:

___________________________________________________________________________________________________ Layer (type) Output Shape Param # =================================================================================================== dense_1 (Dense) (None, 150) 8550 ___________________________________________________________________________________________________ dense_2 (Dense) (None, 150) 22650 ___________________________________________________________________________________________________ dense_3 (Dense) (None, 150) 22650 ___________________________________________________________________________________________________ dense_4 (Dense) (None, 1) 151 =================================================================================================== Total params: 54,001 Trainable params: 54,001 Non-trainable params: 0 ___________________________________________________________________________________________________ >

This model architecture could be better described as ‘wide’ as opposed to ‘deep’ and it consists of around 54,000 trainable parameters. This is more than the number of observations in our data set, and has implications for the ability of our network to overfit.

Configuration of the training process is accomplished via the

keras::compile()function, in which we specify a loss function, an optimizer, and a set of metrics to monitor during training. Keras implements a suite of loss functions, optimizers and metrics out of the box, and in this example we’ll choose some sensible defaults:

model %>% compile( loss = 'binary_crossentropy', optimizer = optimizer_rmsprop(lr=0.0001), metrics = c('accuracy') )

The

'binary_crossentropy'loss function is standard for binary classifiers and the

rmsprop()optimizer is nearly always a good choice. Here we specify a learning rate of 0.0001, but finding a sensible value typically requires some experimentation. Finally, we tell Keras to keep track of our model’s accuracy, as well as the loss during the training process.

An important consideration regarding loss functions for financial prediction is that the standard loss functions rarely capture the realities of trading. For example, consider a regression model that predicts a price change over some time horizon trained using the mean absolute error of the predictions. Say the model predicted a price change of 20 ticks, but the actual outcome was 10 ticks. In practical trading terms, such an outcome would result in a profit of 10 ticks – not a terrible outcome at all. But that result is treated the same as a prediction of 5 ticks that resulted in an actual outcome of -5 ticks, which would result in a loss of 5 ticks in a trading model. That’s because the loss function is only concerned with the magnitude of the difference between the predicted and actual outcomes – but that doesn’t tell the full story. Clearly, we’d likely to penalize the latter error more than the former. To do that, we need to implement our own custom loss functions. I’ll show you how to do that in a later post, but for now it’s important to be cognizant of the limitations of our model training process.

We can train our model using

keras::fit(), which exposes our model to subsequent batches of training data, updating the network’s weights after each batch. Training progresses for a specified number of epochs and performance is monitored on both the training and validation sets.

We would normally like to stop training at the number of epochs that maximize the model’s performance on the validation set. That is, at the point just before the network starts to overfit. The problem is we can’t know

To combat this,

keras::fit()implements the concept of a callback, which is simply a function that performs some task at various points throughout the training process. There are a number of callbacks available in Keras out of the box, and it is also possible to implement your own.

In this example we’ll use the

model_checkpoint()callback, which we configure to save the network and it’s weights at the end of any epoch whose weight update results in improved validation performance. After training is complete, we can then load our best model for evaluation on the test set.

First, here’s how to configure the checkpoint callback (just set up the relevant filepath for your setup):

filepath <- "C:/Users/Kris/Research/DeepLearningForTrading/model.hdf5" checkpoint <- callback_model_checkpoint(filepath = filepath, monitor = "val_acc", verbose = 1, save_best_only = TRUE, save_weights_only = FALSE, mode = "auto")

And here’s how to configure

keras:fit()for a short training run of 75 epochs, with the model checkpoint callback:

history <- model %>% fit( X_train, Y_train, epochs = 75, batch_size = nrow(X_train), validation_data = list(X_val, Y_val), shuffle = TRUE, callbacks = list(checkpoint) )

After training is complete, we can plot the loss and accuracy of the training and validation sets at each epoch by simply calling

plot(history), which results in the following plot:

We can see that loss on the training set continuously decreases while accuracy almost continuously increases as training progresses. That is expected given the power of our network to overfit. But note the small decrease in validation loss and the bump in validation accuracy that we also get out to about 40 epochs before stalling.

A validation accuracy of a little under 53% is certainly not the sort of result that would turn heads in the classic applications of deep learning, like image classification. But trading is an interesting application, because we don’t necessarily need the same sort of performance to make money. But is a validation accuracy of 53% enough to give us some out of sample profits? Let’s find out by evaluating our model on the test set.

Here’s how to remove the fully trained model, load the model with the highest validation accuracy and evaluate it on the test set, with the output shown below the code:

rm(model) model <- keras:::keras$models$load_model(filepath) model %>% evaluate(X_test, Y_test) # output: # 12004/12004 [==============================] - 2s 197us/step # $loss # [1] 0.691 # $acc # [1] 0.523

We end up with a test set accuracy that is only slightly worse than our validation accuracy.

But accuracy is one thing, profitability is another. To assess the profitability of our model on the test set, we need the actual predictions on the test set. We can get the predicted classes via

predict_classes(), but I prefer to look at the actual output of the sigmoid function in the final layer of the model. That enables you to use a prediction threshold in your decision making, for example only entering a long trade when the output is greater than 0.6, say.

Here’s how to get the test set predictions and implement some simple, frictionless trading logic that assigns the target as an individual trade’s profit or loss when the prediction is greater than some threshold (equivalent to a buy) and the negative of the target when the prediction is less than 1 minus the threshold (equivalent to a sell) :

preds <- model %>% predict_proba(X_test) threshold <- 0.5 trades <- ifelse(preds >= threshold, Y_test_raw, ifelse(preds <= 1-threshold, -Y_test_raw, 0)) plot(cumsum(trades), type='l')

This results in the following equity curve (the y-axis is measured in dollars of profit from buying and selling the minimum position size of 0.01 lots):

I think that’s quite an amazing equity curve that demonstrates the potential of even a very small edge. However, note that adding typical retail transaction costs would destroy this small edge, which suggests that longer holding periods are more sensible targets, or that higher accuracies are required in practice.

Also note that you might get different results depending on the initial weights used in your network, as the weights aren’t guaranteed to converge to the same values when initialized to different values. If you repeat the training and evaluation process a number of times, you’ll find that validation accuracies in the range of 52-53% occur most of the time, but while most produce profitable out of sample equity curves, the range of performance is actually quite significant. This implies that there might be benefit in combining the predictions of multiple models using ensemble methods.

Before we get into advanced model architectures, in the next unit I’ll show you:

- How to fight overfitting and push your models to generalize better.
- One of the more cutting edge architectures to get the most out of a densely connected feed forward network.
- How to interrogate and visualize the training process in real time.

This post demonstrated how to process multivariate time series data for use in a feed forward neural network, as well as how to construct, train and evaluate such a network using Keras’ sequential model paradigm. While we uncovered a slim edge in predicting the EUR/USD exchange rate, in practical terms, traders with access to retail spreads and commission will want to consider longer holding times to generate more profit per trade, or will need a more performant model to make money with this approach.

**Where to from here?**

*To find out***why AI is taking off in finance**, check out these insights from my days as an AI consultant to the finance industry*If this***walk-through**was useful for you, you might like to check out another how-to article on running trading algorithms on Google Cloud Platform*If the***technical details of neural networks**are interesting for you, you might like our introductory article*Be sure to check out Part 1 and Part 2 of this series on deep learning applications for trading.*

The post Deep Learning for Trading Part 3: Feed Forward Networks appeared first on Robot Wealth.

]]>
Robot Wealth Members can access the script that produced these results via the Strategies and Tools section of their dashboard.