The thesis
Tamil Nadu 2026 is a contest of arithmetic. Two close-matched traditional alliances — DMK SPA and AIADMK NDA — will absorb most votes. Two standalones — TVK and NTK — will pull meaningful vote share without winning many seats. Two faction spoilers — Sasikala’s AIPTMMK and Ramadoss’s AJPK — will fragment specific caste communities in specific constituencies.
The model treats this as a structural problem. Each of the 234 constituencies is modeled separately. Multi-cycle vote share data establishes a baseline. Demographic features — caste composition, age structure — modulate it. Specific 2026 candidates and locked seats override it where personal vote matters more than the structural average.
Then four different model architectures — built on different theoretical premises — are run independently, and their predictions triangulated. Where they agree, the forecast is confident. Where they disagree, the seat is genuinely uncertain.
Data inventory
All inputs are public. The model uses seven cycles of constituency-level voting data spanning 15 years — the longest reliable AC-mapped window after the 2008 delimitation that redrew Tamil Nadu’s constituencies.
| Cycle | Type | Role |
|---|---|---|
| 2009 | Lok Sabha | Backtest input + bottom-up coefficient |
| 2011 | Assembly | Backtest target + coefficient |
| 2014 | Lok Sabha | Backtest input + coefficient |
| 2016 | Assembly | Backtest target + coefficient |
| 2019 | Lok Sabha | Backtest input + v1.6 component |
| 2021 | Assembly | Backtest target + coefficient |
| 2024 | Lok Sabha | Primary anchor (v1.6) + coefficient |
Earlier cycles — the 2001 and 2006 Assembly elections — were used at the district level only. Tamil Nadu’s 2008 delimitation reshuffled boundaries enough that 74 of today’s 234 constituencies cannot be cleanly mapped to a single pre-2008 predecessor. District-level aggregates from those years still inform long-term partisanship signals.
In addition: the 2026 voter roll (after the special intensive revision), the official candidate list, caste composition data per constituency, and age cohort distributions.
Statewide vote share, 2009–2024
Four-model architecture
Why four models instead of one? Because election forecasting is not a problem with a correct architecture. Each of the four below makes different theoretical assumptions about how voters behave. Their disagreements are themselves data — they tell us where uncertainty is real.
Top-down anchor (v1.6)
Starts from the most recent Lok Sabha 2024 vote shares per AC and projects forward using a transition matrix that accounts for alliance reconfigurations and structural shifts. Validated against three previous LS → Assembly cycle pairs. Captures recent voter mood; less sensitive to long-term structure.
Bottom-up structural (v2 Stage 1)
Per-AC alliance shares averaged across seven cycles, weighted toward more recent ones. Then 2026-specific factors layered on: TVK injection, AIPTMMK and AJPK spoiler effects, sitting-MLA personal vote, and locked seats. Captures the structural floor each alliance can rely on.
Bottom-up + mood (v2 Stage 2)
Stage 1 plus a statewide overlay representing the net effect of incumbency drag, scheme delivery, and corruption rhetoric. Calibrated at −2pp on the incumbent (DMK SPA), with mass redistributed proportionally.
Bayesian hierarchical
Three-level pooling: state → district → constituency. Each AC’s 2026 share is a draw from its district’s distribution, which is itself a draw from the statewide one. Mathematically rigorous; deliberately ignorant of personal-vote candidates — the gap between this model and the others quantifies how much individual candidates matter.
The factor system
Seven factors modulate the structural baseline. They’re scenario-tunable, which is how we generate the nine deterministic scenarios and the Monte Carlo distribution.
| F1 | TVK statewide level | Three scenarios: 11% / 15% / 19% |
| F2 | TVK regional intensity | Per-AC multiplier from age and caste structure |
| F3 | DMK schemes effectiveness | Statewide ±2pp range |
| F4 | AIPTMMK (Sasikala) spoiler | 78 contested ACs, per-AC propensity |
| F5 | AJPK (Ramadoss) spoiler | 29 contested ACs, per-AC propensity |
| F6 | OPS Bodi local effect | AC 200 only, +5pp DMK SPA |
| F7 | Stronghold + personal vote + freeze | 200 ACs flagged objectively, 98 manually locked |
The freeze list
Of 234 constituencies, 98 were “frozen” at +10pp toward a specific party based on local domain knowledge that the historical model can’t capture: known strong candidates, known weak incumbents, known faction-controlled seats, family-politics seats. The freeze applies to the bottom-up models only; v1.6 and the Bayesian model run free of it, deliberately, so we can compare what local knowledge contributes.
The composition: 51 DMK SPA seats (DMK 42, Congress 2, DMDK 2, VCK 1, CPI 1, CPI(M) 3), 45 NDA seats (AIADMK 36, BJP 6, PMK 3), and 2 TVK seats (Perambur and Gobichettipalayam, the latter because Sengottaiyan’s personal vote follows him to TVK). The list itself is the human contribution to an otherwise data-driven model.
Backtest validation
The architecture was tested on three Lok Sabha → Assembly cycle pairs — very different elections, on purpose:
| Cycle pair | AC match | Wave context |
|---|---|---|
| 2009 LS → 2011 AC | 87.2% | AIADMK wave (anti-DMK) |
| 2014 LS → 2016 AC | 66.4% | AIADMK incumbent re-elected |
| 2019 LS → 2021 AC | 66.2% | DMK wave (post-Jayalalithaa) |
| Average | 73.3% | — |
The architecture handles wave elections cleanly (87% in 2011) and struggles more when the pattern is mixed (~66%). The opposition consistently undershoots in the forecast by 3–4 percentage points — a phenomenon called mean reversion: when one alliance dominates a Lok Sabha cycle, the next Assembly tends to compress that gap. We’ve calibrated this back into the Monte Carlo as a +3pp NDA correction.
Industry baseline: best academic models on Indian elections achieve 65–75% AC match. Our 73.3% sits within that range.
What features matter
A separate Random Forest model was trained — not to forecast, but to rank features by predictive power. Training set: predicting the 2021 Assembly outcome from caste, age, and lagged historical shares.
| Feature category | DMK SPA | NDA | NTK |
|---|---|---|---|
| Lagged historical shares | 61.8% | 66.5% | 71.0% |
| Caste composition | 23.1% | 19.7% | 19.9% |
| Age cohorts | 15.1% | 13.8% | 9.1% |
History dominates. About two-thirds of an AC’s next outcome can be explained by how it voted in earlier cycles. Caste composition is the second strongest signal — but not for the obvious reason. The constituencies that lean a particular way structurally tend to have stable caste compositions, so the two signals carry overlapping information.
Monte Carlo
Each of the four models gives a single point prediction. The Monte Carlo varies all the parameters — TVK level, scheme effectiveness, anti-incumbency intensity, mean reversion, per-AC noise — according to plausible distributions, then runs the v2 Stage 2 model 1,000 times. The result is a probability distribution over seat counts instead of a point estimate. The 5th and 95th percentiles bracket the “90% confidence interval” you’ll see in the predictions section.
Three additional scenarios — DMK-optimistic, NDA-optimistic, TVK-breakthrough — run the same machinery with shifted parameter means, to test what would have to be true for the result to surprise.
Limitations
- TVK seat ceiling is likely understated. All four models predict 0–3 TVK seats. A celebrity-led party at 14–19% statewide could plausibly win more in a few concentrated pockets the model can’t see.
- The freeze list is a +10pp nudge, not an absolute lock. Some frozen seats may flip in extreme parameter combinations.
- Mood overlay is statewide, not per-AC. Anti-incumbency really varies by constituency. A future version would model it locally.
- Three backtest pairs is small. Adding bye-election validation (Vikravandi 2024, Erode East 2025) would strengthen calibration.
- Pre-2008 data was used at district level only due to delimitation, which means 74 ACs are missing 8 years of constituency-level signal.
- Random Forest measures predictive power, not causation. A feature ranking high means it correlates with outcomes — not that it drives them.
These are stated up front because they shape what should be believed and what should be doubted in the predictions.
The final ensemble seat counts, the Monte Carlo seat distribution, the per-constituency 4-model agreement table, and the predicted-versus-actual validation appear here once counting begins. The methodology you’ve just read is the contract: that is what produced the numbers you’ll see.
Pre-view the predictions page →