Literature Readings · Prediction Markets · Paper A · Pipeline
Paper A Data Pipeline — Concrete Scoping
How to actually build the SWZ-redux infrastructure in 8 weeks
Concrete build plan · May 19, 2026
🎯 Goal
Build the data infrastructure for Paper A by Week 8. Success criterion: by end of Week 8, can produce a working event-study regression for the July 13, 2024 Trump assassination attempt — minute-level PolyMarket Trump probability + intraday S&P 500 sector ETF returns + computed Spec 1 coefficients matching qualitative expectations.
This document is for someone who needs to actually go build the thing. It assumes Python / SQL / data engineering competence. All data sources are public or standard-academic-subscription.
Architecture Diagram
1. PolyMarket on-chain
Cong et al. Apr 2026 dataset (primary) · Polygon RPC + The Graph (backup) · Goldsky / Dune (intermediate)
2. Kalshi
Academic API (apply early) · Public quarterly reports (fallback)
3. Asset prices (intraday)
WRDS TAQ (US equities, 1-sec) · Bloomberg / Refinitiv (FX, rates, commodities) · CBOE (VIX) · CoinGecko (crypto)
4. News timing
Bloomberg Event DB · RavenPack · Manual 40-event compilation
5. Industry exposure
Compustat (tax) · OpenSecrets (lobbying) · Atkin-Khandelwal (trade) · USCIS H-1B · 10-K text mining
6. Master panel
PostgreSQL / DuckDB master schema · event × asset × minute resolution
1. PolyMarket Data Layer
1.1 Primary source: Cong et al. Apr 2026 dataset
The Lin William Cong dataset release (arXiv 2604.20421) is the canonical research infrastructure. Pull from:
# GitHub release for the Cong et al. dataset (check arxiv abs for canonical link) # Likely structure: HuggingFace Datasets repo + S3 mirror from datasets import load_dataset ds = load_dataset("cornell-cong/polymarket-2020-2026") markets = ds["markets"] # 770K+ markets with metadata fills = ds["fills"] # 943M trade fills oracle = ds["oracle"] # 2M oracle resolution events # Filter to 2024 US election markets election_markets = markets.filter( lambda m: "trump" in m["slug"].lower() or "harris" in m["slug"].lower() or "election" in m["slug"].lower() )
1.2 Backup: Polygon RPC direct
If Cong dataset has gaps or you need fresh data post-Mar 2026:
# Polygon mainnet RPC — free tier available via Alchemy / Infura / QuickNode # PolyMarket contracts on Polygon: # CTF (Gnosis ConditionalTokens): 0x4D97DCd97eC945f40cF65F87097ACe5EA0476045 # Polymarket Exchange: 0x4bFb41d5B3570DeFd03C39a9A4D8dE6Bd8B8982E # USDC (settlement): 0x2791Bca1f2de4661ED88A30C99A7a9449Aa84174 from web3 import Web3 w3 = Web3(Web3.HTTPProvider("https://polygon-mainnet.g.alchemy.com/v2/YOUR_KEY")) # Pull all Trade events from PolyMarket exchange contract exchange = w3.eth.contract(address="0x4bFb...82E", abi=EXCHANGE_ABI) trades = exchange.events.Trade.get_logs(fromBlock=BLOCK_2024_START, toBlock=BLOCK_2024_END)
1.3 Intermediate: Goldsky / Dune Analytics
If you want SQL access without managing the indexing infrastructure yourself:
| Goldsky | Indexed Polymarket subgraph; SQL access; ~$200/month for academic tier; fastest queries |
| Dune Analytics | Free for queries; pre-built Polymarket tables; rate-limited but workable |
| The Graph | Free decentralized indexer; GraphQL queries; slower than Goldsky but free |
-- Sample Dune query: minute-level Trump win price during 2024 SELECT date_trunc('minute', evt_block_time) AS minute, avg(price) AS trump_price, sum(size) AS minute_volume FROM polymarket_polygon.exchange_evt_OrderFilled WHERE market_slug = 'will-trump-win-the-2024-us-presidential-election' AND evt_block_time BETWEEN '2024-01-01' AND '2024-11-30' GROUP BY 1 ORDER BY 1;
1.4 What we need from PolyMarket
- Minute-level Trump probability time series, Jan 1 – Nov 30, 2024. Mid-quote / VWAP per minute.
- Trade-level fills: wallet (anonymized but consistent), market, position, price, size, timestamp.
- Identified whale wallets: Fredi9999, Theo4, Theo3, etc. (Chainalysis 11 wallets) — verify match against published list.
- Sirolly-Sethi wash-trading cluster wallets (43K wallets) — request from authors or replicate methodology.
- Top-1% volume wallets per Akey et al. — derive from trade-level data.
2. Kalshi Data Layer
2.1 Academic API access
Kalshi offers academic API access. Apply on Day 1 — typical 3-week turnaround.
- Apply:
research@kalshi.comwith institutional email, research statement, and IRB / institutional letter - Expected response: 2-4 weeks
- Access: read-only API + historical CSV dumps
- Cost: free for academic research
2.2 What we need from Kalshi
# Sample API call (once access granted) import requests resp = requests.get( "https://trading-api.kalshi.com/trade-api/v2/markets", headers={"Authorization": f"Bearer {API_KEY}"}, params={"event_ticker": "PRESPARTY-24", "status": "settled"} ) markets = resp.json()["markets"] # Returns Kalshi presidential party markets, settled prices, etc. # For trade-level data: for ticker in ["PRESPARTY-24-D", "PRESPARTY-24-R"]: trades = requests.get( f"https://trading-api.kalshi.com/trade-api/v2/markets/{ticker}/trades", headers={"Authorization": f"Bearer {API_KEY}"} ).json()
2.3 Public fallback
If academic API delays:
- Kalshi public market data via web scraping (rate-limited but workable)
- Kalshi quarterly volume reports (less granular)
- Third-party Kalshi data via Dune Analytics or similar
3. Asset Prices Layer
3.1 WRDS subscription (mandatory)
Standard academic subscription gives access to TAQ (1-second equity trades) + Compustat + CRSP. Your institution likely already has WRDS.
# WRDS Python connection import wrds db = wrds.Connection(wrds_username="your_id") # TAQ intraday data for major sector ETFs around July 13, 2024 (Trump assassination) sector_etfs = ["SPY", "XLF", "XLE", "XLV", "ITA", "XLI", "XLY"] taq_query = f""" SELECT date, time_m, sym_root, price, size FROM taqm_2024.ctm_20240713 WHERE sym_root IN ({','.join(f"'{s}'" for s in sector_etfs)}) AND time_m BETWEEN '12:00:00' AND '23:00:00' ORDER BY time_m """ trades = db.raw_sql(taq_query) # For Trump assassination at ~6:11 PM ET (10:11 PM UTC), trades around that minute should show large moves
3.2 Bloomberg Terminal (mandatory for FX / commodities)
Bloomberg Terminal ~$24K/year. Standard academic infrastructure for top-5 paper. For FX intraday:
# Bloomberg Python (BLP API) — runs inside Terminal session from xbbg import blp mxn_intraday = blp.bdib( ticker="USDMXN Curncy", dt="2024-07-13", typ="BID_ASK", interval=1 # 1-minute bars ) # Returns DataFrame: time, open, high, low, close, volume per minute
3.3 Refinitiv as alternative
If no Bloomberg access, Refinitiv DataScope Tick History via WRDS provides similar FX/rates data at ~$3K/year additional cost.
3.4 Free fallbacks
| Crypto | CoinGecko + CryptoCompare (free, 1-minute history available) |
| Treasury yields | NY Fed daily; CME for intraday Treasury futures |
| VIX | CBOE direct download (free historical) |
| SPDR ETFs | Yahoo Finance / Alpha Vantage for daily; 1-minute via paid Alpha Vantage tier (~$50/month) |
4. News Timing Layer
4.1 The 40-event canonical timeline
Compile manually + verify against Bloomberg event database + Wikipedia + WSJ archive. Each event needs:
- Event timestamp (UTC, second-resolution where possible)
- Event type (debate, poll, conviction, assassination, dropout, election, etc.)
- Direction prior (expected to favor Trump or Harris)
- Magnitude prior (major / moderate / minor)
- News-flow control (what other macro news happened that day?)
# Sample event registry (CSV format) # event_id, date, time_utc, name, direction_prior, magnitude, notes 1, 2024-01-15, 23:30:00, "Iowa caucus", "trump+", major, "Trump wins" 2, 2024-05-30, 21:00:00, "Trump conviction NY", "harris+", moderate, "hush money case verdict" 3, 2024-06-27, 01:00:00, "Biden-Trump debate", "trump+", major, "Biden poor performance" 4, 2024-07-13, 22:11:00, "Trump assassination attempt #1", "trump+", major, "Butler PA rally" 5, 2024-07-21, 17:46:00, "Biden drops out", "harris+", major, "Truth Social post" # ... and so on for ~40 events
4.2 RavenPack for controls
RavenPack provides sentiment-scored news flow. Use to control for general news intensity during event windows. ~$10K/year academic access (split with other departments).
5. Industry Exposure Layer
For each S&P 500 firm, build pre-determined exposure to each of 5 Trump-policy channels. Use 2017-2019 data only (pre-COVID, pre-2024) to avoid look-ahead.
5.1 Tax exposure (Compustat)
# Compustat: 2017-2019 avg effective tax rate per firm tax_query = """ SELECT gvkey, datadate, txt / pi AS effective_tax_rate FROM comp.funda WHERE fyear IN (2017, 2018, 2019) AND indfmt = 'INDL' AND consol = 'C' AND popsrc = 'D' """ etr = db.raw_sql(tax_query) firm_etr = etr.groupby("gvkey")["effective_tax_rate"].mean() # Trump tax exposure: higher ETR = more to gain from cuts firm_tax_exposure = firm_etr * 1.0 # Linear in ETR
5.2 Regulatory exposure (OpenSecrets + Federal Register)
# OpenSecrets bulk download — 2017-2019 lobbying spend per industry NAICS # https://www.opensecrets.org/industries/bulk-data import pandas as pd lobbying = pd.read_csv("opensecrets_lobby_2017_2019.csv") industry_lobby = lobbying.groupby("naics4")["amount"].sum() # Federal Register mentions per industry (via FR API) import requests fr_query = requests.get( "https://www.federalregister.gov/api/v1/documents.json", params={"per_page": 1000, "conditions[publication_date][gte]": "2017-01-01", "conditions[publication_date][lte]": "2019-12-31"} ).json() # Match FR documents to industries via topic tags / NAICS classification firm_reg_exposure = ...
5.3 Trade exposure (Atkin-Khandelwal)
Use the Atkin-Khandelwal trade-exposure dataset (available from authors) or replicate from Census trade data + Compustat foreign-sales segments.
5.4 Immigration exposure (USCIS + BLS)
USCIS H-1B annual filings + Bureau of Labor Statistics industry data for unauthorized-worker share. Construct per-NAICS immigration dependency.
5.5 Geopolitical exposure (Bloomberg + 10-K)
Bloomberg Government for defense contract revenue share. SEC EDGAR 10-K filings + NLP to extract "geographic risk" Item 1A text from 2017-2019 annual reports.
6. Storage Architecture
6.1 Recommended: PostgreSQL + DuckDB
- PostgreSQL for the master schema (events, firms, exposures — relational tables, moderate size)
- DuckDB for analytic queries on time-series data (TAQ + PolyMarket fills + intraday FX). Columnar, fast, embedded.
- Parquet for raw historical pulls (TAQ ~50GB, PolyMarket fills ~30GB, FX ~10GB)
# Sample DuckDB query joining PolyMarket × intraday TAQ import duckdb con = duckdb.connect("paperA.duckdb") # Register pre-computed parquet files con.execute("CREATE VIEW pm AS SELECT * FROM 'polymarket_1min.parquet'") con.execute("CREATE VIEW taq AS SELECT * FROM 'taq_2024.parquet'") # Event-window join for July 13, 2024 Trump assassination result = con.execute(""" SELECT pm.minute, pm.trump_price, taq.symbol, taq.price AS asset_price FROM pm JOIN taq ON pm.minute = taq.minute WHERE pm.minute BETWEEN '2024-07-13 21:55:00' AND '2024-07-13 22:45:00' """).df()
6.2 Total size estimate
| PolyMarket fills (parquet) | ~30 GB |
| TAQ 2024 (parquet) | ~50 GB |
| FX + commodities intraday | ~10 GB |
| FOMC text + news | ~5 GB |
| Industry exposure tables | ~1 GB |
| Total | ~100-150 GB |
Fits comfortably on a single workstation with ~500GB SSD. Cloud option: AWS S3 + Athena for ~$50/month total.
7. Build Phases (8 Weeks)
| Week | Phase | Deliverable |
|---|---|---|
| 1 | PolyMarket setup | Cong dataset downloaded + parsed; backup Polygon RPC + The Graph subgraph access; Goldsky account; first SQL query showing daily Trump-win probability time series Jan-Nov 2024 |
| 2 | PolyMarket → minute resolution | Minute-level Trump probability time series; whale wallet identification verified against Chainalysis list; trade-flow aggregates per minute |
| 3 | Kalshi + asset price connection | Kalshi academic API access (or fallback); WRDS TAQ for sector ETFs around all 9 major events; Bloomberg / Refinitiv pull for FX, rates, commodities |
| 4 | News timing + event registry | 40-event canonical timeline CSV with second-resolution timestamps verified across sources; RavenPack news intensity for control |
| 5 | Industry exposure (tax + reg) | Pre-determined TaxExposure_i (Compustat) and RegExposure_i (OpenSecrets) for all S&P 500 firms; validation on 2016 election |
| 6 | Industry exposure (trade + immig + geo) | TradeExposure_i, ImmigrationExposure_i, GeopoliticalExposure_i; composite TrumpExposure_i constructed; pre-registered at OSF |
| 7 | Master panel | PostgreSQL + DuckDB schema joining all layers; event × asset × minute master table; quality checks for missing data |
| 8 | First specification | Working Spec 1 regression for July 13 Trump assassination event. Shows expected positive β for defense (ITA), oil (XLE), regional banks (KRE); negative β for renewables (TAN), Chinese ADRs. |
8. Cost Estimate
| Item | Cost | Notes |
|---|---|---|
| WRDS subscription | $0 | Already via university (assumed) |
| Bloomberg Terminal | $24K/yr | 1 license; typical academic infrastructure |
| Refinitiv (alternative to Bloomberg) | $3K/yr | Via WRDS DataScope add-on |
| RavenPack news sentiment | $5-10K/yr | Optional; can use free Bloomberg news count instead |
| Goldsky academic tier | ~$2K/yr | Optional; The Graph + Dune are free alternatives |
| Alpha Vantage premium (crypto/equity) | $50/month | For intraday equity if no Bloomberg |
| AWS S3 + compute | $50-200/month | If using cloud rather than workstation |
| Vivvix / WMP (for Paper B) | $3-5K one-time | Paper B only; Paper A doesn't need |
| Total Paper A min | ~$5K | Refinitiv only, no Bloomberg, free fallbacks |
| Total Paper A typical | ~$30K | Bloomberg + standard add-ons; worth it for top-5 |
9. First Concrete Deliverable: July 13, 2024 Event Study
By end of Week 8, produce this working analysis as proof-of-pipeline.
🎯 The July 13, 2024 Test
The Trump assassination attempt occurred at 6:11 PM ET on Saturday, July 13, 2024 in Butler, PA. PolyMarket Trump probability moved from ~60% to ~70% within minutes. Markets opened Monday July 15.
The clean test: Did defense ETFs (ITA), oil ETFs (XLE), and regional bank ETFs (KRE) gap up at Monday open? Did renewable ETFs (TAN, ICLN) gap down? Did the Mexican peso depreciate?
# Concrete first deliverable spec # Step 1: PolyMarket Trump probability at minute resolution around July 13 pm = con.execute(""" SELECT minute, trump_price FROM pm WHERE minute BETWEEN '2024-07-13 22:00:00' AND '2024-07-15 17:00:00' """).df() # Step 2: Compute the "treatment intensity" delta_pm = pm[pm["minute"] == "2024-07-15 14:30:00"]["trump_price"].iloc[0] - \ pm[pm["minute"] == "2024-07-13 22:10:00"]["trump_price"].iloc[0] # Expected ≈ +0.10 (10 pp shift) # Step 3: Compute Monday open vs Friday close returns for sector ETFs sectors = ["ITA", "XLE", "XLF", "XLV", "KRE", "TAN", "ICLN", "SPY"] returns = compute_jump_returns(sectors, friday_close="2024-07-12", monday_open="2024-07-15") # Step 4: Implied β = ΔY / ΔPM betas = returns / delta_pm # Expected signs: # β_ITA > 0 (defense — Trump favors military spending) # β_XLE > 0 (energy — drilling deregulation) # β_KRE > 0 (regional banks — Trump deregulation) # β_TAN < 0 (solar — Trump rolls back IRA) # β_ICLN < 0 (clean energy) # β_MXN > 0 (USD/MXN — peso depreciates on tariff fears, so USDMXN UP) # β_SPY: small but positive (modest tax-cut effect) print(f"July 13 event: ΔPM = {delta_pm:.4f}") for sector, β in zip(sectors, betas): print(f" β_{sector} = {β:.4f}")
✅ Success criterion
If by end of Week 8 you can produce this output with reasonable estimates (β_ITA ≈ +0.05, β_TAN ≈ -0.10, β_MXN ≈ +0.03, etc.), the pipeline works and Paper A is realistic. If you can't get reasonable estimates, the issue is either (a) data quality somewhere in the pipeline, (b) the July 13 event is not as clean as expected, or (c) the assumption that PolyMarket shocks cause asset moves is wrong. All three are useful information.
From here: Generalize to all 40 events, all asset cross-sections, all three identification strategies. Months 3-8 of the proposal timeline.
Concrete build plan · Generated May 19, 2026