# Evolving Thoughts on Data Mining

Several years ago, I wrote about some experimentation I’d done with data mining for predictive features from financial data. The article has had several tens of thousands of views and nearly 100 comments.

I *think* the popularity of the article lay in its demonstration of various tools and modeling frameworks for doing data mining in R (it didn’t generate any alpha, so it can’t have been that). To that end, I’ve updated the data, code, and output, and added it to our GitHub repository. You can view the updated article here and find the code and data here.

Re-reading the article, it was apparent that my thinking had moved on quite significantly in just a few short years.

Back when I originally wrote this article, there was a commonly held idea that a newly-hyped approach to predictive modeling known as *machine learning* could discern predictive patterns in market data. A quick search on SSRN will turn up dozens of examples of heroic attempts at this very thing, many of which have been downloaded thousands of times.

Personally, I spent more hours than I care to count on this approach. And while I learned an absolute ton, I can also say that *nothing* that I trade today emerged from such a data-mining exercise. A large scale data mining exercise *contributed* to *one* of our strategies, but it was supported by a ton of careful analysis.

Over the years since I first wrote the article, a realisation has dawned on me:

Trading is very hard, and these techniques don’t really help that much with the hardest part.

I think, in general, the trading and investment community has had a similar awakening.

*OK, so what’s the “hardest part” of trading?*

Operational issues of running a trading business aside, the hardest part of trading is maximising the probability that the edges you trade continue to pay off in the future.

Of course, we can never be entirely sure about anything in the markets. They change. Edges come and go. There’s always anxiety that an edge isn’t really an edge at all, that it’s simply a statistical mirage. There is uncertainty everywhere.

Perhaps the most honest goal of the quantitative researcher is to **reduce this uncertainty as far as reasonably possible.**

Unfortunately (or perhaps fortunately, if you take the view that if it were easy, everyone would do it), reducing this uncertainty takes a lot of work and more than a little market nouse.

In the practical world of our own trading, we do this in a number of ways centred on detailed and careful analysis. Through data analysis, we try to answer questions like:

- Does the edge make sense from a structural, economic, financial, or behavioural perspective?
- Is there a reason for it to exist that I can explain in terms of taking on risk or operational overhead that others don’t want, or providing a service?
- Is it stable through time?
- Does it show up in the assets that I’d expect it to, given my explanation for why it exists?
- What else could explain it? Have I isolated the effect from things we already know about?
- What other edges can I trade with this one to diversify my risk?

In the world of machine learning and data mining, “reducing uncertainty” involves accounting for data mining bias (the tendency to eventually find things that look good if you look at enough combinations). There are statistical tests for data-mining bias, which, if being generous, offer plausible-sounding statistical tools for validating data mining efforts. However, I’m not here to be generous to myself and can admit that the appeal of such tools, at least for me, lay in the promise of avoiding the really hard work of careful analysis. *I don’t need to do the analysis, because a statistical test can tell me how certain my edge is!*

But what a double-edged sword such avoidance turns out to be.

If you’ve ever tried to trade a data-mined strategy, regardless of what your statistical test for data-mining bias told you, you know that it’s a constant battle with your anxiety and uncertainty. Because you haven’t done the work to understand the edge, it’s impossible to just leave it alone. You’re constantly adjusting, wondering, and looking for answers *after the* *fact*. It turns into an endless cycle – and I’ve *personally *seen it play out at all levels from beginner independent traders through to relatively sophisticated and mature professional trading firms.

The real tragedy about being on this endless cycle is that it short-circuits the one thing that is most effective at reducing uncertainty, at least at the level of your overall portfolio – finding new edges to trade.

This reality leads me to an approach for adding a new trade to our portfolio:

- Do the work to reduce the uncertainty to the extent possible. You don’t want to trade just
*anything*, you want to trade high-probability edges that you understand deeply. - Trade it at a size that can’t hurt you at the portfolio level if you’re wrong – and we will all be wrong from time to time.
- Leave it alone and go look for something else to trade.

The third point is infinitely more palatable if you’ve done the work and understand the things you’re already trading.

Having said all that, I’m not about to abandon machine learning and other statistical tools. They absolutely have their place, but it’s worth thinking about the relative importance of what to concentrate on and what we spend our time on.

At one extreme, we might think that market insight and quantitative analysis (what we’d call “feature engineering” in machine learning speak) is the most important thing and that we should spend all our time there.

However, the problem with this approach is that there are effective and well-understood techniques (for example PCA, lasso regression, and others) that will very much help with modeling and analysis. Understanding these tools well enough to know what they are and when they might help greatly enhances your effectiveness as a quantitative researcher.

On the other extreme, we might think that spending all our time on machine learning, data mining and statistical tests is appropriate. This is akin to owning a top-notch toolkit for servicing a car, but not knowing anything about cars, and leads to the endless cycle of patching things up mentioned above.

## (4) Comments

Great insight, makes lots of sense also in other areas where people tried to apply ML. The process starts with an idea (“edge”) then modeling it – the hardest part. Data is just the “oil”.

Thanks for sharing, Kris.

Hi Kris,

I am a big fan of the thought evolution. We used to think that ‘financial engineering’ was an awkward neologism, but it pales in comparison with its newer counterpart ‘data science’. There is also a very worrying shift towards completely data-driven hypotheses and data-driven decision making, where in fact it should always be ‘story-driven’ hypotheses and data-justified decision making. Data and models should inform decisions but not be the key drivers of both hypotheses and decisions.

The worry here – and potentially the reason why your first post was so successful – is that this thinking amplifies the Dunning-Kruger effect. Anyone with sufficient technical data analysis skills is now able to analyse any phenomena in any field without the requirement of having what is casually referred to as ‘domain knowledge’. As you point out, this quickly becomes apparent in finance, where the system is both non-stationary and very noisy. Your final point sums it up: tools should always be viewed as tools, no more and also no less.

Thanks, Emlyn

Emylin: I think your comment is great, but I have a caveat: if those driven-data systems persists ¿is this not a proof or at list hint that they are making money?

Thanks for the comment, Emlyn. I very much agree – data analysis skills are a key ingredient, but so too (even more so) is domain knowledge.