Why You Can’t Tell if Your Strategy “Stopped Working” (Statistically Speaking)

Traders love the illusion of precision. A few bad weeks go by, and you think, “Let’s run a t-test and see if the strategy stopped working.” It sounds rigorous. It isn’t.

Imagine a strategy that, in truth, earns 10% per year with 20% volatility – roughly the S&P’s long-term profile. We’ll simulate five years of daily returns, about 1,260 observations, from a geometric Brownian motion with those parameters.

Now the world changes. For the next month, 21 trading days, the strategy’s expected return drops to zero, but volatility stays at 20%.

We’d like to detect that change. The question: Can you statistically prove the edge is gone?

A t-test compares these two samples (before and after) and asks if their means differ significantly.

Five years of data vs. one month.

That’s n₁ ≈ 1260 vs. n₂ ≈ 21.

Noise ≈ 20% / √252 ≈ 1.26% daily.

The expected daily drift for 10% annual return is just 0.10 / 252 ≈ 0.04% per day. That’s our signal. The noise is thirty times larger. In other words, your daily Sharpe ratio is 0.04 / 1.26 ≈ 0.03, an astronomically low signal-to-noise ratio.

So even if the edge disappeared entirely, you’d barely notice in a month.

From the simulation:

Testp-valueInterpretation
Welch t-test0.12Not significant
Kolmogorov Smirnov test0.37Not significant (even worse)

The t-test finds nothing. The KS test, which looks at the whole distribution, not just the mean, finds even less. The supposed “collapse in performance” doesn’t even register as a blip in the statistics.

That’s the problem: volatility dominates everything. The mean shift you’re trying to detect (0.04%/day) is microscopic relative to the daily noise (±1%). Twenty-one days is simply not enough data to estimate a mean that small with any precision.

The t-stat came out around –1.6, roughly a 12% chance under the null of equal means. Even if you doubled the sample size – two months of underperformance – the p-value would still hover around 0.06. You’d need a multi-month drought before statistics would admit the obvious.

The funny part: in that same run, the “dead” strategy’s one-month realised return was +7.5%.

That’s right. A zero-drift month beat 93% of all months during the prior five years of positive-drift data.

Why?

A 20% annual volatility means roughly 5.8% standard deviation per month.

Even with zero drift, one standard deviation up is +5.8%, and 1.3σ is +7.5%. That happens 10% of the time purely by chance.

Meanwhile, in the “good” regime (10% annual drift), the expected monthly gain is only +0.8%. A +7.5% outlier month easily beats the vast majority of historical months.

So, you end up with a bizarre headline:

“Our strategy just lost its edge, but had its second-best month ever.”

Noise does that.

To see if the phenomenon was a fluke, I ran 3,000 simulations of the same setup.

Across runs:

  • The median one-month return (true zero drift) was roughly –0.06%, but the 90th percentile was +7.4%.
  • In half of all runs, the zero-drift month beat 45% of all historical months.
  • In 16% of runs, it beat more than 80% of the prior months.
  • In 7.5% of runs, it beat more than 90%.

So, one in every thirteen “dead” months looks like a top-decile success. Statistically, that’s unremarkable. Psychologically, it’s devastating – because you’ll tell yourself the system recovered, tweak nothing, and (probably) then spend the next quarter losing money.

Some traders, knowing the t-test’s weakness, pivot to “non-parametric” tests like the Kolmogorov-Smirnov. It compares the cumulative distributions directly, not the means. Surely that’s more robust?

No.

When two normal distributions differ only slightly in mean but have the same variance, the KS test has less power than the t-test. It’s designed to catch shape differences – fat tails, variance shifts, asymmetry – not small mean drifts. With n₂ = 21, it’s practically blind.

In our case, the KS p-value was 0.37. The test confidently says “nothing to see here.” It’s technically correct.

Here’s the deeper problem. The tools we use, t-tests, p-values, Sharpe ratios, were designed for large-sample, low-noise situations. Financial returns are the opposite: small signals, fat tails, short samples.

When you apply a test that needs thousands of points to reject the null at 95% confidence, you’ll never detect regime shifts in real time. The market will move on long before the statistic does.

A five-year window may contain your “true” performance, but it’s useless for diagnosing the present. A one-month drought tells you nothing. A three-month one tells you almost nothing.

The conclusion isn’t that tests are bad—it’s that the problem is mis-specified. The null hypothesis “mean return hasn’t changed” is almost never the right one. Markets evolve, but slowly and noisily. No binary test will save you.

Practical Interpretation

When a strategy underperforms for a few weeks, you face two equally dangerous errors:

  1. Type I error – You think it’s dead when it’s just noise. You abandon a still-valid edge.
  2. Type II error – You think it’s noise when it’s actually dead. You keep bleeding capital.

Classical statistics tries to balance those. Trading doesn’t care. You’re asymmetric: Type II errors cost you more because capital decays geometrically, not linearly.

So, the sensible response isn’t to chase significance but to control exposure. Cut risk when the environment looks hostile, but don’t fool yourself that a p-value will tell you when to quit.

If you want to formalise this intuition, think Bayesian: update your belief about the strategy’s drift each day. The posterior distribution will drift toward zero if recent returns are weak, but uncertainty will remain large. The proper decision rule is probabilistic, not binary.

Even better, you can encode prior scepticism – say, most strategies decay over time – and let data modify that belief. The output isn’t “dead or alive,” but “probability the drift > 0.” You can then size down continuously rather than panic after a failed t-test.

But that’s another post.

Every quantitative trader eventually learns this the hard way: statistics are lagging indicators. They confirm what you already know, long after it’s actionable.

A strategy doesn’t announce its death with a p-value. It fades, subtly, while your t-statistic wobbles somewhere between 0.5 and 1.2. By the time a 5-year backtest fails a significance test, you’ve lost more in opportunity cost than you saved by being “rigorous.”

Markets are too noisy for clean statistical detection. The right question isn’t “Has my edge stopped working?” but “Given recent evidence, how much do I trust it now?”

The answer is always probabilistic, never definitive, which is precisely why trading is hard – and why so many seek comfort in meaningless tests.

Leave a Comment