Backtesting Methodology & Statistics
Testable rules, bias control, sample size, expectancy, drawdown — how to run a manual backtest you can actually trust.
In short
A trustworthy manual backtest has four controls: written if-then rules (set before the test), hidden-future replay (kills hindsight bias), every valid setup taken (no cherry-picking), and enough trades (100+). Then judge the result by expectancy — (win% × avg win) − (loss% × avg loss) — not win rate. Expectancy, not win rate, pays your bills.
Rules You Can Actually Test
The test starts before the chart opens. Write each rule in if-then form, specific enough that a stranger would take the same trades: "IF price closes above the prior swing high on the 15-minute chart AND the 1-hour trend filter is long, THEN enter long at the close, stop below the breakout candle's low, target 2R." If a rule needs your judgment to interpret ("strong momentum", "clean structure"), either define it operationally or accept that you're testing your discretion too — which is legitimate in manual backtesting, as long as the discretionary element is consistent and journaled.
The Four Biases That Fake Results
- Hindsight bias. On a scrolled chart you cannot unsee what happened next; every "obvious" setup is obvious only in retrospect. The fix is structural, not willpower: replay with future data hidden.
- Look-ahead bias. Using information that wasn't available at decision time — the classic manual version is switching to a higher timeframe mid-replay and seeing candles that hadn't closed yet. Use tools that keep all timeframes synchronized to the replay clock.
- Cherry-picking. Skipping valid setups that "feel wrong" and counting only the photogenic ones. Rule: if it meets the written conditions, it's a trade, and it goes in the log.
- Rule drift. Quietly loosening or tightening conditions mid-test as results come in. Ideas for improvements go on a list; the current test finishes under the rules it started with, and the improved version gets its own test.
Overfitting Without Code
Overfitting isn't just a quant problem. Run six variations of a strategy over the same three months of data, keep the best one, and you've manually overfitted: the "winner" won partly by fitting that window's noise. Controls: test variations on different data windows; validate the chosen version out-of-sample (a period it has never seen); and distrust any edge that needs precise parameter values to survive — robust edges degrade gracefully when you nudge the inputs.
Expectancy and the Breakeven Table
Expectancy is the only number that combines how often you win with how much: E = (win rate × average win) − (loss rate × average loss). It must be positive after the full cost stack — spread, swap, commission — or the strategy has no edge. The flip side is the breakeven win rate for a given reward-to-risk ratio:
| Reward : Risk | Breakeven win rate | Meaning |
|---|---|---|
| 0.5 : 1 | 66.7% | Scalps with tight targets need very high accuracy |
| 1 : 1 | 50% | Coin-flip accuracy just breaks even — before costs |
| 1.5 : 1 | 40% | A common discretionary sweet spot |
| 2 : 1 | 33.3% | Wrong twice per win and still profitable |
| 3 : 1 | 25% | Trend-following territory: rare wins, big ones |
Remember costs shift every breakeven upward: at 1:1 with costs equal to 10% of the target, true breakeven is ~52.6%, not 50%.
The Metrics That Matter (and One That Lies)
- Expectancy (above) — the headline number.
- Maximum drawdown — the worst peak-to-trough equity decline. This is the number that decides whether you'd have kept trading the strategy; it's also the number prop firms test you on.
- Longest losing streak — if the backtest contains 8 consecutive losses, live trading will too. Decide now whether you can execute through it.
- Profit factor — gross wins ÷ gross losses; above ~1.3 after costs is respectable for a discretionary system.
- Win rate — the one that lies when read alone (see the table above). Useful only next to R:R.
Reading the Equity Curve
Plot cumulative R over the trade sequence and read its shape: steady rise with shallow dips is the ideal; a flat-then-spike curve means a few outlier wins carry everything (fragile — retest without the best three trades); long plateaus locate the regimes where the edge disappears (check what the market was doing there); and a curve that rises early then decays often signals rule drift or a regime change mid-test. The curve is also where drawdown stops being a number and becomes felt experience — useful calibration before risking money.
The Minimal Journal
Per trade: date/time (in the data's server clock), instrument, direction, entry, stop, target, exit, result in R, setup tag, plus cost columns (spread if not in the data, swap nights, commission). Tools with built-in trade tracking — replay backtesters like StrategyTune record P&L, win/loss and streaks automatically — cover the numbers; keep the setup tags and notes yourself, because they're what you'll learn from when you review which setups actually carried the edge.
Test Across Regimes
Every strategy has weather it likes. A breakout system tested only on trending months will flatter itself; the honest test spans trending, ranging and news-driven periods, with results segmented by regime. If the edge lives entirely in one regime, that's not failure — it's a finding: the strategy needs a regime filter, and the backtest just told you what it is.
Frequently Asked Questions
How many trades is statistically significant for a backtest?
There is no magic number, but practical thresholds are: 50 trades to spot a clearly broken strategy, 100 for a meaningful win rate and average R:R, 200+ before results plausibly reflect edge rather than variance. Below 50, streaks dominate and any conclusion is premature.
What is a good win rate for a trading strategy?
Win rate alone means nothing without the risk-reward ratio. A 40% win rate is excellent at 2:1 reward-to-risk (breakeven is 33.3%) and disastrous at 0.5:1 (breakeven is 66.7%). Judge strategies by expectancy — the combination of win rate and average win/loss size — not by win rate.
What is the expectancy formula?
Expectancy = (win rate × average win) − (loss rate × average loss). Example: 45% winners averaging +15 pips, 55% losers averaging −8 pips gives 0.45×15 − 0.55×8 = +2.35 pips per trade before costs. Positive expectancy after costs is the minimum bar for a tradeable strategy.
Can manual backtests be trusted, given human bias?
Yes, if the process controls the biases: replay with the future hidden (kills hindsight bias), written if-then rules set before the test (limits interpretation drift), taking every valid setup (prevents cherry-picking), and logging every trade (makes the result auditable). Without those controls, no — the result is a story, not a test.
Deep Dives in This Series
Writing Entry/Exit Rules You Can Actually Test
From vibe to if-then rules.
Hindsight Bias: Why Scrolling Charts Isn’t Backtesting
The bias replay tools exist to kill.
Overfitting in Manual Backtesting
Yes, it happens without code too.
Look-Ahead Bias and How Replay Prevents It
Using information you wouldn’t have had.
Backtest Metrics That Matter
Expectancy, R:R, drawdown, profit factor.
Reading an Equity Curve Like a Risk Manager
What the shape of the curve tells you.
Journaling Backtest Trades
A minimal template that captures what matters.
Testing Across Trending, Ranging & News Regimes
One regime is not a backtest.
Practice This in a Free Replay Tool
StrategyTune replays real bid/ask tick data for 70+ instruments in the browser — free, no registration, no downloads. Place simulated trades and see your stats build.
Open StrategyTune