Bayesian accuracy scoring across 64 analysts and 501 geopolitical forecasts. Every prediction is falsifiable, dual-cited, and scored against what actually happened — so that when experts disagree about what comes next, the weights on their judgments are empirical rather than rhetorical.
Dataset v23 · 998 dual citations · 30 inference categories · 12 languages.
What this ledger is
This ledger records falsifiable, forward-looking predictions made by geopolitical analysts — then scores them against what actually happened. It exists to answer a practical question: when multiple experts disagree about what will happen next, whose judgment should carry more weight?
Each expert's accuracy score serves as an empirical prior in a Bayesian framework. When a new event arises, the model matches it to the relevant categories, retrieves the experts from the leaderboard, and aggregates their forecasts — weighted by their measured track record. Every entry in the predictions table carries dual citations: the original source where the prediction was made, and an independent confirmation of what occurred.
Each resolved prediction is rated on a six-point scale: TRUE, MT (mostly true), PT (partially true), MF (mostly false), FALSE, and UNR (unresolved — event has not yet fully played out or is ambiguous on the evidence). The 70% aggregate is a weighted pool across TRUE and MT at full weight and PT at partial weight, following the dataset's rubric.
Model evaluation
A validation set of 30 binary events with known outcomes is held aside from the training of expert weights. For each event, the model filters to pre-resolution expert statements in the predictions table, identifies the relevant ones via category matching, weights each contribution by the expert's accuracy score on prior resolutions, and aggregates into a single probability. The pooled forecast is then scored against the observed outcome via Brier score, which penalises both miscalibration and sharpness.
Why a Bayesian weighting works here
Geopolitical forecasting is dominated by a small number of structural regularities (balance of power, domestic coalition cohesion, economic capacity to sustain a campaign) and a long tail of idiosyncratic analyst biases (ideology, regional attachment, career incentives). A flat average over pundits inherits the tail. Weighting by a measured Beta-posterior on each analyst's prior track record concentrates mass on those whose biases happen to align with how the world has actually moved — without having to pre-judge whose worldview is "correct." The validation set checks that this weighting generalises: if weights overfit past events, the Brier score on held-out binary resolutions will degrade.
Top analysts — by accuracy, minimum five resolved predictions
The leaderboard below shows the top 12 of 64 tracked analysts, ordered by accuracy on resolved predictions. The "N" column is the number of scored predictions in the ledger — a confidence proxy. Analysts with fewer than five resolved predictions are excluded from the minimum-bar view to prevent single-call noise from dominating the board.
| # | Analyst | Accuracy | N resolved |
|---|---|---|---|
| 01 | Dmitriy Shapiro | 95% | 5 |
| 02 | Fawaz Gerges | 93% | 7 |
| 03 | Ahmad Naghibzadeh | 90% | 5 |
| 04 | Jiang Xueqin | 88% | 13 |
| 05 | Ray Takeyh | 88% | 6 |
| 06 | Barry Posen | 86% | 7 |
| 07 | Anthony Cordesman | 86% | 7 |
| 08 | Fyodor Lukyanov | 85% | 5 |
| 09 | Danny Citrinowicz | 84% | 11 |
| 10 | Narges Bajoghli | 83% | 6 |
| 11 | Jason Brodsky | 82% | 7 |
| 12 | Sidharth Kaushal | 81% | 9 |
Accuracy alone overstates confidence in analysts with small samples. Jiang Xueqin's 88% across 13 predictions carries more weight than Dmitriy Shapiro's 95% across 5 — even though Shapiro tops the raw table. The Bayesian posterior used for aggregation folds in this sample-size penalty directly through the Beta prior, so thinly-sampled leaders are shrunk toward the aggregate 70% mean until they accumulate more resolutions.
Prediction volume over time
The underlying dataset stretches back to 1984 for long-horizon structural forecasts, with the bulk of entries concentrated in the post-2020 period when systematic logging began and geopolitical flux increased. Pre-2020 entries are bucketed by year; 2020 onwards are bucketed by quarter. Volume is not evenly distributed — the 2024 Q3 to 2026 Q1 window, covering the Iran–Israel crisis and its run-up, accounts for a disproportionate share of total predictions and nearly all of the time-sensitive tactical calls.
Across the full set of 461 resolved predictions, the mix is dominated by TRUE and MT outcomes — consistent with the aggregate 70% score — with a smaller spread of PT, MF, and FALSE calls concentrated in high-stakes, low-prior events (regime change, strait closure, nuclear thresholds). The UNR pool reflects predictions whose resolution conditions are still pending (for example, multi-year thresholds or conditional "if X, then Y" forecasts where X has not yet occurred).
How the ledger is used
- Logging. An analyst's falsifiable statement is extracted from its original source — an interview, a podcast, a thread, an op-ed, or a think-tank paper — with the date, the claim, and the resolution condition recorded verbatim. Dual citations link to the source and to the eventual confirmation.
- Categorisation. Each prediction is tagged with one or more of 30 inference categories (for example: Iran domestic stability, Strait of Hormuz closure, US troop posture, oil price regime). Categories drive retrieval when a new event arises.
- Resolution. When the resolution condition is met, the prediction is scored TRUE / MT / PT / MF / FALSE / UNR against the verbatim claim — not against a charitable reinterpretation.
- Weighting. Each expert carries a Beta-posterior on their hit-rate, updated after every resolution. The posterior mean is the weight used in aggregation; the posterior variance governs how much a new prediction shifts the weight.
- Aggregation. For a new event, the model retrieves all experts with prior predictions in the matching categories, takes their current forecast, weights by posterior accuracy, and pools into a single probability. This pooled forecast is what gets scored on the validation set.
Dataset characteristics
| Field | Value |
|---|---|
| Version | v23 |
| Experts tracked | 64 |
| Predictions logged | 501 |
| Predictions resolved | 461 |
| Predictions unresolved | 40 |
| Aggregate accuracy | 70% |
| Dual citations | 998 |
| Inference categories | 30 |
| Languages represented | 12 |
| Validation set size | 30 binary events |
| Validation metric | Brier score |
Caveats
The ledger is not a popularity contest, a rating of intellectual quality, or a score of how well-reasoned an argument was. It only measures whether explicit, falsifiable claims came true within their stated horizons. An analyst can be consistently correct on low-stakes calls and catastrophically wrong on the one that mattered — accuracy and impact are different quantities, and this ledger indexes only the first.
Selection bias in which statements get logged is the largest residual concern. To mitigate it, the logging protocol captures every forward-looking, falsifiable claim in a source at time of publication, not only those that turn out to be interesting in retrospect. The dual citation requirement catches most backfills.
Finally: accuracy is not stationary. A forecaster who was sharp on the Trump-era Middle East may not be sharp on a post-ceasefire realignment. The Beta-posterior shrinks slowly by design, but users should weigh recent resolutions more heavily than the raw table suggests when the strategic landscape has just shifted.
Prepared March 24, 2026 by Unmitigated Wisdom. Dataset v23 · 501 predictions · 998 citations · 12 languages · 30 categories. Methodology: Bayesian Beta-posterior weighting with Brier-scored validation against a 30-event held-out set.