Methodology | Obsidic Tennis

Philosophy

Tennis is structurally elegant for probabilistic modeling. A match decomposes into a hierarchy of nested contests: points build into games, games into sets, sets into the match. At every level, the outcome depends primarily on two numbers: how well each player holds serve. This recursive structure means that if you can accurately estimate serve dynamics, you can derive everything else through simulation.

But "everything else" hides substantial complexity. The same serve percentage plays differently on clay versus grass, against a returner ranked 5th versus 50th, at altitude in Bogotá versus sea level in Miami, and in a tiebreak versus a comfortable service hold. Surface, opponent quality, form, fatigue, and tactical matchups all shape what happens when the ball leaves the server's hand.

Any model that collapses this complexity into a single formula is leaving information on the table. Our approach is a system of specialized models (player profiling, surface-aware rating systems, and a dedicated machine learning engine) connected through thousands of simulated points. The projection should emerge from the simulation, not precede it.

Design Principle Predict the point. Simulate the match. Let everything else emerge. We don't predict who wins and work backward. We simulate every point and let the results accumulate.

Training Data Scale

The Prediction Pipeline

Each daily slate passes through a sequence of stages. The pipeline runs before matches begin, ingesting the latest player statistics, ratings, and live odds to produce projections for every match on the card.

📊 Profiles Rolling player stats with sample-size regression

→

📐 Ratings Surface-specific strength blended by experience

→

🤖 ML Engine Predicts serve dynamics from 113 features

→

🎲 Simulate 10,000 point-by-point match simulations

→

📈 Edge Compare model probs to market odds

The output is a full probability distribution for every match. Win probabilities, projected total games with standard deviations, over/under probabilities at the book's exact line, first-set winner probabilities, straight-sets probabilities, and tiebreak likelihood, all emerging from the same set of simulations, all internally consistent.

Why Internal Consistency Matters Because every metric emerges from the same 10,000 simulated matches, a win probability of 65% is guaranteed to be consistent with the projected total games, straight-sets probability, and first-set winner probability. There's no possibility of conflicting outputs the way you'd get from running separate models for each market.

Data & Sources

The model is trained on one of the most comprehensive open-source tennis datasets available, covering ATP and WTA matches from 2016 to the present, totaling over 46,400 matches across both tours. Each match record includes detailed serve and return statistics at the point-aggregate level.

Live odds are sourced from a multi-bookmaker aggregator covering 12+ US and European sportsbooks. The system selects the best available odds across all bookmakers for each market, ensuring the model is always evaluating against the sharpest available price.

Tournament Context

Every tournament is tagged with metadata that feeds into the model: surface type (hard, clay, grass), altitude, court speed characteristics, draw size, and match format (best-of-three vs best-of-five). Tournament prestige level provides implicit context about draw quality and competitive intensity.

Source	Coverage	Used For
Match Database	46,400+ matches, 2016–present	Training, profiles, ratings
Odds Aggregator	Live ATP/WTA odds, 12+ bookmakers	Edge calculation, market comparison
Tournament Config	All ATP/WTA/Challenger venues	Surface, altitude, court speed

Markets Covered Match winner (moneyline), total games over/under at the book's exact line, straight sets, and first set winner. All evaluated against the best available price across multiple bookmakers.

Player Intelligence

Every player in the system (681 ATP and 654 WTA players with active profiles) is tracked through two parallel systems: a surface-aware rating system and a detailed statistical profile.

Rating System

Each player maintains multiple ratings: an overall strength rating and separate ratings for each surface (hard, clay, grass). After each completed match, ratings are updated with adjustments proportional to the magnitude of the result. An upset at a Grand Slam moves ratings significantly more than an expected result at a Challenger.

The key insight is surface transfer. A player's effective rating on any surface isn't purely their surface-specific record. It incorporates their overall strength and performance on related surfaces, weighted by how much data exists. Hard and grass courts share more characteristics (fast pace, low bounce) than either shares with clay, so ratings transfer more between those surfaces. This prevents the model from treating every surface transition as a complete unknown.

Statistical Profiles

Beyond ratings, each player has a comprehensive statistical profile covering serve and return performance across multiple time horizons. Recent form is weighted more heavily than historical performance, but longer track records provide stability and the exact weighting was determined empirically through backtesting.

Not every player has deep data. A qualifier with a handful of matches on clay shouldn't be profiled purely from those results. We use a sample-size regression framework: every observed statistic is blended toward the tour average for that surface, with the blend ratio determined by how much data is available. As matches accumulate, individual performance increasingly dominates; with thin samples, the model conservatively falls back to population baselines.

Sample-Size Regression: Individual Signal vs Tour Average

Players with few matches are pulled heavily toward the tour average. As data accumulates, individual performance dominates. The crossover point and blending curve were determined empirically through backtesting.

Why This Matters Early in a player's career, or when they first play on a new surface, this regression framework prevents the kind of wild projections that come from taking small sample sizes at face value. A player who aces 20% in their first 3 grass matches will be pulled back toward the tour average until there's enough data to trust the signal.

The ML Engine

The core of the system is a dedicated machine learning model that predicts serve dynamics for each player in a given match context. This is the single most important estimation in tennis modeling. From serve dynamics, the entire match structure can be derived through simulation.

The model ingests 113 engineered features spanning player serve and return metrics, rating differentials, matchup characteristics, tournament context, recency signals, and derived interaction terms. It learns the non-linear relationships between these features: the way a big server performs differently against an elite returner than a weak one, or how a clay specialist's serve changes on grass depending on altitude and court speed.

What the Model Learns

Rather than applying uniform adjustments ("clay reduces serve effectiveness by X%"), the ML engine discovers player-specific and context-specific patterns from the data. A left-handed server with a particular serve profile faces different challenges against different return styles on different surfaces. The model captures these interaction effects without needing them to be explicitly engineered.

Training Rigor

The training process follows strict temporal separation. The model only ever trains on data from before the prediction date. There is no data leakage: every feature is computed from information available before the match begins. Rolling windows explicitly exclude the current match. Regularization and early stopping prevent overfitting to training noise.

Multi-Signal Blending The final serve estimate used in simulation isn't from the ML model alone. It's blended with a strength-based prior from the rating system, giving the prediction two anchors: a data-driven estimate from the ML engine and a fundamental strength signal from historical ratings. The blend weights were calibrated through backtesting to maximize out-of-sample accuracy.

Feature Categories (113 Total)

Match Simulation

This is where everything converges. Armed with calibrated serve estimates for both players, the simulator plays out 10,000 complete matches point by point, respecting the full hierarchical structure of tennis scoring.

The Hierarchy

Each simulation follows the actual rules of tennis scoring. Points build into games with deuce rules, games build into sets with tiebreaks at 6-6, and sets build into matches (best of 3 or 5). The server alternates every game, and the tiebreak follows its own serve rotation. Every single point is resolved independently based on the server's estimated serve dynamics against the specific returner.

What the Simulation Produces

From 10,000 simulated matches, we aggregate full probability distributions for every metric the market cares about:

Win%

Match win probability

2-0, 2-1

Set score distribution

Total

Games distribution

TB%

Tiebreak probability

SS%

Straight sets prob

Illustrative: Simulated Total Games Distribution (Best of 3)

Illustrative distribution. Every matchup produces a unique shape. The bimodal structure reflects straight-set vs three-set outcomes. This full distribution drives O/U pricing.

Why Simulation Over Formulas While closed-form approximations exist for match win probability, simulation naturally produces every derivative metric (total games distribution, tiebreak probability, first-set winner, and straight-sets probability) all from the same run. These metrics are guaranteed to be consistent with each other because they emerge from the same simulated matches.

Calibration & Backtesting

Raw simulation outputs are rarely perfectly calibrated. The model may systematically overpredict favorites, underestimate tiebreak frequency, or project total games that drift from reality. A dedicated calibration layer corrects these biases using parameters optimized against historical results, with separate calibration for ATP and WTA given their structural differences.

The calibration parameters were determined through extensive backtesting, not chosen by intuition. Each parameter was optimized to minimize the gap between projected and observed outcomes across thousands of matches.

ATP vs WTA WTA matches feature lower serve dominance than ATP, which means more breaks of serve, shorter sets, and fewer tiebreaks. The calibration layer maintains separate parameters for each tour rather than applying a one-size-fits-all correction.

Backtest Results

The model has been through an extensive and rigorous backtest, These are genuine out-of-sample results, not in-sample fits.

65.6%

ML Winner Accuracy

2,381 matches

71.8%

Best-of-5 Accuracy

252 matches

113

ML Features

per match

10,000

Simulations

per match

Accuracy by Surface

The model performs differently across surfaces, reflecting the varying levels of predictability inherent to each surface type. Grass courts, where serve dominates and the better server tends to win, show the strongest signal.

Winner Prediction Accuracy by Surface

Accuracy by Tour & Format

Segment	Accuracy	Matches
ATP Tour	64.7%	1,266
WTA Tour	66.6%	1,115
Best-of-5	71.8%	252
Best-of-3	64.9%	2,129

The WTA tour shows slightly higher accuracy than ATP (66.6% vs 64.7%), likely reflecting the WTA's wider spread of talent making favorites more reliably predictable. Best-of-5 accuracy jumps to 71.8% because the longer format reduces variance and allows the stronger player to emerge more consistently.

Simulation Accuracy

Beyond winner prediction, the simulation engine tracks accuracy across derivative markets:

67.9%

Sim win probability

calibrated

63.3%

First set winner

accuracy

62.0%

Straight sets

accuracy

Transparency These results are from genuine out-of-sample backtesting. We publish these numbers because we believe transparency builds trust. If our accuracy ever degrades, we'll report that too.

Edge Detection

Important Disclaimer Obsidic is a research and analytics platform. We build models and publish projections. The presence of a detected edge does not constitute betting advice. What we provide is a well-calibrated analytical framework grounded in data.

After the model produces calibrated probabilities, they're compared against the best available live odds across multiple bookmakers to identify where the model disagrees with the market. When the model sees a player's chances as meaningfully higher than the odds imply, that's a potential edge.

How It Works

The system converts posted odds to implied probabilities, then compares them against the model's calibrated output. The difference (model probability minus implied probability) is the edge. Positive edge means the model thinks the outcome is more likely than the market's price suggests.

Every detected edge is assigned a confidence tier based on the magnitude of the discrepancy. The tiered rating system helps distinguish between marginal edges that might not survive market movement and substantial edges where the model has high conviction.

Markets Evaluated

The system evaluates multiple markets for every match on the card:

Market	What It Means
Moneyline	Model thinks a player wins more often than odds imply
Total Over	Model projects more games than the book's line
Total Under	Model projects fewer games than the book's line
Straight Sets	Model expects a dominant performance
1st Set Winner	Model favors a player in the opening set

We Want to Reiterate The presence of an edge in our model doesn't guarantee a profitable outcome on any individual match. Tennis is a high-variance sport. What the model does is identify where probabilities appear to be mispriced. Over hundreds of plays, correctly identified edges compound. The value shows in the long run, not in any single result.

Honest Limitations

Every model has blind spots. Acknowledging them is as important as explaining what the model does well. Here are the areas where our tennis system is most likely to be wrong or imprecise.

No In-Match Dynamics

The model treats serve dynamics as constant throughout a match. In reality, players adjust tactics mid-match, fatigue shifts serving patterns, and momentum creates streaks that a static estimate can't capture. A player who struggles in the first set but typically raises their level will be undervalued by pre-match projections.

Injury & Fitness Blind Spot

We have no real-time injury data. A player nursing a hip injury who plays at 80% capacity will be projected at their full statistical level. The model detects declining form through recency-weighted statistics, but acute injuries that haven't yet affected results are invisible.

Surface Transitions

When a player moves to a surface they've rarely played on, the model relies heavily on regression toward the tour average and rating transfer from related surfaces. For established players, this works well. For young players with thin records on a new surface, projections carry more uncertainty.

Retirement Risk

Retired matches are excluded from scoring because they don't reflect the full competitive outcome, but the model can't predict which matches will end in retirement. A player trending toward retirement due to fitness concerns may still be projected normally.

Odds Market Timing

The tennis betting market is less liquid than major US sports, which means odds can be softer, but it also means they can move quickly. The odds captured at pipeline runtime may differ from the odds available later. A detected edge at 8% may have narrowed considerably by the time you see it.

The Bottom Line This is a probabilistic model, not an oracle. A 70% prediction means the outcome goes the other way 30% of the time. The goal isn't to be right on every match but to identify edges consistently enough that they compound over hundreds of plays. Variance is real and unavoidable. The model's value shows over large samples, not individual results.

Our Commitment Obsidic is under continuous development. The models evolve as more data becomes available and as we discover better approaches. Backtesting results are published transparently, including the numbers that don't look impressive. We'd rather be honest than impressive.

Our Methodology

Philosophy

The Prediction Pipeline

Data & Sources

Tournament Context

Player Intelligence

Rating System

Statistical Profiles

The ML Engine

What the Model Learns

Training Rigor

Match Simulation

The Hierarchy

What the Simulation Produces

Calibration & Backtesting

Backtest Results

Accuracy by Surface

Accuracy by Tour & Format

Simulation Accuracy

Edge Detection

How It Works

Markets Evaluated

Honest Limitations

No In-Match Dynamics

Injury & Fitness Blind Spot

Surface Transitions

Retirement Risk

Odds Market Timing