Obsidic projects every tennis match through a system of interconnected models. From player-specific serve dynamics to full point-by-point simulation. Here's a comprehensive look at how the engine works and why we built it this way.
Tennis is structurally elegant for probabilistic modeling. A match decomposes into a hierarchy of nested contests: points build into games, games into sets, sets into the match. At every level, the outcome depends primarily on two numbers: how well each player holds serve. This recursive structure means that if you can accurately estimate serve dynamics, you can derive everything else through simulation.
But "everything else" hides substantial complexity. The same serve percentage plays differently on clay versus grass, against a returner ranked 5th versus 50th, at altitude in Bogotá versus sea level in Miami, and in a tiebreak versus a comfortable service hold. Surface, opponent quality, form, fatigue, and tactical matchups all shape what happens when the ball leaves the server's hand.
Any model that collapses this complexity into a single formula is leaving information on the table. Our approach is a system of specialized models (player profiling, surface-aware rating systems, and a dedicated machine learning engine) connected through thousands of simulated points. The projection should emerge from the simulation, not precede it.
Each daily slate passes through a sequence of stages. The pipeline runs before matches begin, ingesting the latest player statistics, ratings, and live odds to produce projections for every match on the card.
The output is a full probability distribution for every match. Win probabilities, projected total games with standard deviations, over/under probabilities at the book's exact line, first-set winner probabilities, straight-sets probabilities, and tiebreak likelihood, all emerging from the same set of simulations, all internally consistent.
The model is trained on one of the most comprehensive open-source tennis datasets available, covering ATP and WTA matches from 2016 to the present, totaling over 46,400 matches across both tours. Each match record includes detailed serve and return statistics at the point-aggregate level.
Live odds are sourced from a multi-bookmaker aggregator covering 12+ US and European sportsbooks. The system selects the best available odds across all bookmakers for each market, ensuring the model is always evaluating against the sharpest available price.
Every tournament is tagged with metadata that feeds into the model: surface type (hard, clay, grass), altitude, court speed characteristics, draw size, and match format (best-of-three vs best-of-five). Tournament prestige level provides implicit context about draw quality and competitive intensity.
| Source | Coverage | Used For |
|---|---|---|
| Match Database | 46,400+ matches, 2016–present | Training, profiles, ratings |
| Odds Aggregator | Live ATP/WTA odds, 12+ bookmakers | Edge calculation, market comparison |
| Tournament Config | All ATP/WTA/Challenger venues | Surface, altitude, court speed |
Every player in the system (681 ATP and 654 WTA players with active profiles) is tracked through two parallel systems: a surface-aware rating system and a detailed statistical profile.
Each player maintains multiple ratings: an overall strength rating and separate ratings for each surface (hard, clay, grass). After each completed match, ratings are updated with adjustments proportional to the magnitude of the result. An upset at a Grand Slam moves ratings significantly more than an expected result at a Challenger.
The key insight is surface transfer. A player's effective rating on any surface isn't purely their surface-specific record. It incorporates their overall strength and performance on related surfaces, weighted by how much data exists. Hard and grass courts share more characteristics (fast pace, low bounce) than either shares with clay, so ratings transfer more between those surfaces. This prevents the model from treating every surface transition as a complete unknown.
Beyond ratings, each player has a comprehensive statistical profile covering serve and return performance across multiple time horizons. Recent form is weighted more heavily than historical performance, but longer track records provide stability and the exact weighting was determined empirically through backtesting.
Not every player has deep data. A qualifier with a handful of matches on clay shouldn't be profiled purely from those results. We use a sample-size regression framework: every observed statistic is blended toward the tour average for that surface, with the blend ratio determined by how much data is available. As matches accumulate, individual performance increasingly dominates; with thin samples, the model conservatively falls back to population baselines.
The core of the system is a dedicated machine learning model that predicts serve dynamics for each player in a given match context. This is the single most important estimation in tennis modeling. From serve dynamics, the entire match structure can be derived through simulation.
The model ingests 113 engineered features spanning player serve and return metrics, rating differentials, matchup characteristics, tournament context, recency signals, and derived interaction terms. It learns the non-linear relationships between these features: the way a big server performs differently against an elite returner than a weak one, or how a clay specialist's serve changes on grass depending on altitude and court speed.
Rather than applying uniform adjustments ("clay reduces serve effectiveness by X%"), the ML engine discovers player-specific and context-specific patterns from the data. A left-handed server with a particular serve profile faces different challenges against different return styles on different surfaces. The model captures these interaction effects without needing them to be explicitly engineered.
The training process follows strict temporal separation. The model only ever trains on data from before the prediction date. There is no data leakage: every feature is computed from information available before the match begins. Rolling windows explicitly exclude the current match. Regularization and early stopping prevent overfitting to training noise.
This is where everything converges. Armed with calibrated serve estimates for both players, the simulator plays out 10,000 complete matches point by point, respecting the full hierarchical structure of tennis scoring.
Each simulation follows the actual rules of tennis scoring. Points build into games with deuce rules, games build into sets with tiebreaks at 6-6, and sets build into matches (best of 3 or 5). The server alternates every game, and the tiebreak follows its own serve rotation. Every single point is resolved independently based on the server's estimated serve dynamics against the specific returner.
From 10,000 simulated matches, we aggregate full probability distributions for every metric the market cares about:
Raw simulation outputs are rarely perfectly calibrated. The model may systematically overpredict favorites, underestimate tiebreak frequency, or project total games that drift from reality. A dedicated calibration layer corrects these biases using parameters optimized against historical results, with separate calibration for ATP and WTA given their structural differences.
The calibration parameters were determined through extensive backtesting, not chosen by intuition. Each parameter was optimized to minimize the gap between projected and observed outcomes across thousands of matches.
The model has been through an extensive and rigorous backtest, These are genuine out-of-sample results, not in-sample fits.
The model performs differently across surfaces, reflecting the varying levels of predictability inherent to each surface type. Grass courts, where serve dominates and the better server tends to win, show the strongest signal.
| Segment | Accuracy | Matches |
|---|---|---|
| ATP Tour | 64.7% | 1,266 |
| WTA Tour | 66.6% | 1,115 |
| Best-of-5 | 71.8% | 252 |
| Best-of-3 | 64.9% | 2,129 |
The WTA tour shows slightly higher accuracy than ATP (66.6% vs 64.7%), likely reflecting the WTA's wider spread of talent making favorites more reliably predictable. Best-of-5 accuracy jumps to 71.8% because the longer format reduces variance and allows the stronger player to emerge more consistently.
Beyond winner prediction, the simulation engine tracks accuracy across derivative markets:
After the model produces calibrated probabilities, they're compared against the best available live odds across multiple bookmakers to identify where the model disagrees with the market. When the model sees a player's chances as meaningfully higher than the odds imply, that's a potential edge.
The system converts posted odds to implied probabilities, then compares them against the model's calibrated output. The difference (model probability minus implied probability) is the edge. Positive edge means the model thinks the outcome is more likely than the market's price suggests.
Every detected edge is assigned a confidence tier based on the magnitude of the discrepancy. The tiered rating system helps distinguish between marginal edges that might not survive market movement and substantial edges where the model has high conviction.
The system evaluates multiple markets for every match on the card:
| Market | What It Means |
|---|---|
| Moneyline | Model thinks a player wins more often than odds imply |
| Total Over | Model projects more games than the book's line |
| Total Under | Model projects fewer games than the book's line |
| Straight Sets | Model expects a dominant performance |
| 1st Set Winner | Model favors a player in the opening set |
Every model has blind spots. Acknowledging them is as important as explaining what the model does well. Here are the areas where our tennis system is most likely to be wrong or imprecise.
The model treats serve dynamics as constant throughout a match. In reality, players adjust tactics mid-match, fatigue shifts serving patterns, and momentum creates streaks that a static estimate can't capture. A player who struggles in the first set but typically raises their level will be undervalued by pre-match projections.
We have no real-time injury data. A player nursing a hip injury who plays at 80% capacity will be projected at their full statistical level. The model detects declining form through recency-weighted statistics, but acute injuries that haven't yet affected results are invisible.
When a player moves to a surface they've rarely played on, the model relies heavily on regression toward the tour average and rating transfer from related surfaces. For established players, this works well. For young players with thin records on a new surface, projections carry more uncertainty.
Retired matches are excluded from scoring because they don't reflect the full competitive outcome, but the model can't predict which matches will end in retirement. A player trending toward retirement due to fitness concerns may still be projected normally.
The tennis betting market is less liquid than major US sports, which means odds can be softer, but it also means they can move quickly. The odds captured at pipeline runtime may differ from the odds available later. A detected edge at 8% may have narrowed considerably by the time you see it.