Backtesting Methodology & Statistics: Rules, Biases, Sample Size, Expectancy

In short

A trustworthy manual backtest has four controls: written if-then rules (set before the test), hidden-future replay (kills hindsight bias), every valid setup taken (no cherry-picking), and enough trades (100+). Then judge the result by expectancy — (win% × avg win) − (loss% × avg loss) — not win rate. Expectancy, not win rate, pays your bills.

Rules You Can Actually Test

The test starts before the chart opens. Write each rule in if-then form, specific enough that a stranger would take the same trades: "IF price closes above the prior swing high on the 15-minute chart AND the 1-hour trend filter is long, THEN enter long at the close, stop below the breakout candle's low, target 2R." If a rule needs your judgment to interpret ("strong momentum", "clean structure"), either define it operationally or accept that you're testing your discretion too — which is legitimate in manual backtesting, as long as the discretionary element is consistent and journaled.

The Four Biases That Fake Results

Hindsight bias. On a scrolled chart you cannot unsee what happened next; every "obvious" setup is obvious only in retrospect. The fix is structural, not willpower: replay with future data hidden.
Look-ahead bias. Using information that wasn't available at decision time — the classic manual version is switching to a higher timeframe mid-replay and seeing candles that hadn't closed yet. Use tools that keep all timeframes synchronized to the replay clock.
Cherry-picking. Skipping valid setups that "feel wrong" and counting only the photogenic ones. Rule: if it meets the written conditions, it's a trade, and it goes in the log.
Rule drift. Quietly loosening or tightening conditions mid-test as results come in. Ideas for improvements go on a list; the current test finishes under the rules it started with, and the improved version gets its own test.

Overfitting Without Code

Overfitting isn't just a quant problem. Run six variations of a strategy over the same three months of data, keep the best one, and you've manually overfitted: the "winner" won partly by fitting that window's noise. Controls: test variations on different data windows; validate the chosen version out-of-sample (a period it has never seen); and distrust any edge that needs precise parameter values to survive — robust edges degrade gracefully when you nudge the inputs.

Expectancy and the Breakeven Table

Expectancy is the only number that combines how often you win with how much: E = (win rate × average win) − (loss rate × average loss). It must be positive after the full cost stack — spread, swap, commission — or the strategy has no edge. The flip side is the breakeven win rate for a given reward-to-risk ratio:

Reward : Risk	Breakeven win rate	Meaning
0.5 : 1	66.7%	Scalps with tight targets need very high accuracy
1 : 1	50%	Coin-flip accuracy just breaks even — before costs
1.5 : 1	40%	A common discretionary sweet spot
2 : 1	33.3%	Wrong twice per win and still profitable
3 : 1	25%	Trend-following territory: rare wins, big ones

Remember costs shift every breakeven upward: at 1:1 with costs equal to 10% of the target, true breakeven is ~52.6%, not 50%.

The Metrics That Matter (and One That Lies)

Expectancy (above) — the headline number.
Maximum drawdown — the worst peak-to-trough equity decline. This is the number that decides whether you'd have kept trading the strategy; it's also the number prop firms test you on.
Longest losing streak — if the backtest contains 8 consecutive losses, live trading will too. Decide now whether you can execute through it.
Profit factor — gross wins ÷ gross losses; above ~1.3 after costs is respectable for a discretionary system.
Win rate — the one that lies when read alone (see the table above). Useful only next to R:R.

Reading the Equity Curve

Plot cumulative R over the trade sequence and read its shape: steady rise with shallow dips is the ideal; a flat-then-spike curve means a few outlier wins carry everything (fragile — retest without the best three trades); long plateaus locate the regimes where the edge disappears (check what the market was doing there); and a curve that rises early then decays often signals rule drift or a regime change mid-test. The curve is also where drawdown stops being a number and becomes felt experience — useful calibration before risking money.

The Minimal Journal

Per trade: date/time (in the data's server clock), instrument, direction, entry, stop, target, exit, result in R, setup tag, plus cost columns (spread if not in the data, swap nights, commission). Tools with built-in trade tracking — replay backtesters like StrategyTune record P&L, win/loss and streaks automatically — cover the numbers; keep the setup tags and notes yourself, because they're what you'll learn from when you review which setups actually carried the edge.

Test Across Regimes

Every strategy has weather it likes. A breakout system tested only on trending months will flatter itself; the honest test spans trending, ranging and news-driven periods, with results segmented by regime. If the edge lives entirely in one regime, that's not failure — it's a finding: the strategy needs a regime filter, and the backtest just told you what it is.

Frequently Asked Questions

How many trades is statistically significant for a backtest?

There is no magic number, but practical thresholds are: 50 trades to spot a clearly broken strategy, 100 for a meaningful win rate and average R:R, 200+ before results plausibly reflect edge rather than variance. Below 50, streaks dominate and any conclusion is premature.

What is a good win rate for a trading strategy?

Win rate alone means nothing without the risk-reward ratio. A 40% win rate is excellent at 2:1 reward-to-risk (breakeven is 33.3%) and disastrous at 0.5:1 (breakeven is 66.7%). Judge strategies by expectancy — the combination of win rate and average win/loss size — not by win rate.

What is the expectancy formula?

Expectancy = (win rate × average win) − (loss rate × average loss). Example: 45% winners averaging +15 pips, 55% losers averaging −8 pips gives 0.45×15 − 0.55×8 = +2.35 pips per trade before costs. Positive expectancy after costs is the minimum bar for a tradeable strategy.

Can manual backtests be trusted, given human bias?

Yes, if the process controls the biases: replay with the future hidden (kills hindsight bias), written if-then rules set before the test (limits interpretation drift), taking every valid setup (prevents cherry-picking), and logging every trade (makes the result auditable). Without those controls, no — the result is a story, not a test.

Backtesting Methodology & Statistics

Rules You Can Actually Test

The Four Biases That Fake Results

Overfitting Without Code

Expectancy and the Breakeven Table

The Metrics That Matter (and One That Lies)

Reading the Equity Curve

The Minimal Journal

Test Across Regimes

Frequently Asked Questions

How many trades is statistically significant for a backtest?

What is a good win rate for a trading strategy?

What is the expectancy formula?

Can manual backtests be trusted, given human bias?

Deep Dives in This Series

Writing Entry/Exit Rules You Can Actually Test

Hindsight Bias: Why Scrolling Charts Isn’t Backtesting

Overfitting in Manual Backtesting

Look-Ahead Bias and How Replay Prevents It

Backtest Metrics That Matter

Reading an Equity Curve Like a Risk Manager

Journaling Backtest Trades

Testing Across Trending, Ranging & News Regimes

Practice This in a Free Replay Tool