🔒

Literature Readings

Reading notes — please enter access password

✗ Incorrect password

Literature Readings

Paper A — Data Pipeline

←Prediction Markets ←Paper A Proposal

Sections

Overview & goals Architecture diagram 1. PolyMarket layer 2. Kalshi layer 3. Asset prices layer 4. News timing layer 5. Industry exposure layer 6. Storage 7. Build phases 8. Cost estimate 9. First deliverable

Related Papers

🥇Paper A 🥈Paper B 🥉Paper C

Literature Readings · Prediction Markets · Paper A · Pipeline

Paper A Data Pipeline — Concrete Scoping

How to actually build the SWZ-redux infrastructure in 8 weeks

Concrete build plan · May 19, 2026

🎯 Goal

Build the data infrastructure for Paper A by Week 8. Success criterion: by end of Week 8, can produce a working event-study regression for the July 13, 2024 Trump assassination attempt — minute-level PolyMarket Trump probability + intraday S&P 500 sector ETF returns + computed Spec 1 coefficients matching qualitative expectations.

This document is for someone who needs to actually go build the thing. It assumes Python / SQL / data engineering competence. All data sources are public or standard-academic-subscription.

Architecture Diagram

1. PolyMarket on-chain

Cong et al. Apr 2026 dataset (primary) · Polygon RPC + The Graph (backup) · Goldsky / Dune (intermediate)

↓

2. Kalshi

Academic API (apply early) · Public quarterly reports (fallback)

↓

3. Asset prices (intraday)

WRDS TAQ (US equities, 1-sec) · Bloomberg / Refinitiv (FX, rates, commodities) · CBOE (VIX) · CoinGecko (crypto)

↓

4. News timing

Bloomberg Event DB · RavenPack · Manual 40-event compilation

↓

5. Industry exposure

Compustat (tax) · OpenSecrets (lobbying) · Atkin-Khandelwal (trade) · USCIS H-1B · 10-K text mining

↓

6. Master panel

PostgreSQL / DuckDB master schema · event × asset × minute resolution

1. PolyMarket Data Layer

1.1 Primary source: Cong et al. Apr 2026 dataset

The Lin William Cong dataset release (arXiv 2604.20421) is the canonical research infrastructure. Pull from:

# GitHub release for the Cong et al. dataset (check arxiv abs for canonical link)
# Likely structure: HuggingFace Datasets repo + S3 mirror

from datasets import load_dataset

ds = load_dataset("cornell-cong/polymarket-2020-2026")
markets = ds["markets"]   # 770K+ markets with metadata
fills = ds["fills"]        # 943M trade fills
oracle = ds["oracle"]      # 2M oracle resolution events

# Filter to 2024 US election markets
election_markets = markets.filter(
    lambda m: "trump" in m["slug"].lower()
            or "harris" in m["slug"].lower()
            or "election" in m["slug"].lower()
)

1.2 Backup: Polygon RPC direct

If Cong dataset has gaps or you need fresh data post-Mar 2026:

# Polygon mainnet RPC — free tier available via Alchemy / Infura / QuickNode
# PolyMarket contracts on Polygon:
#   CTF (Gnosis ConditionalTokens):  0x4D97DCd97eC945f40cF65F87097ACe5EA0476045
#   Polymarket Exchange:             0x4bFb41d5B3570DeFd03C39a9A4D8dE6Bd8B8982E
#   USDC (settlement):               0x2791Bca1f2de4661ED88A30C99A7a9449Aa84174

from web3 import Web3
w3 = Web3(Web3.HTTPProvider("https://polygon-mainnet.g.alchemy.com/v2/YOUR_KEY"))

# Pull all Trade events from PolyMarket exchange contract
exchange = w3.eth.contract(address="0x4bFb...82E", abi=EXCHANGE_ABI)
trades = exchange.events.Trade.get_logs(fromBlock=BLOCK_2024_START, toBlock=BLOCK_2024_END)

1.3 Intermediate: Goldsky / Dune Analytics

If you want SQL access without managing the indexing infrastructure yourself:

GoldskyIndexed Polymarket subgraph; SQL access; ~$200/month for academic tier; fastest queries
Dune AnalyticsFree for queries; pre-built Polymarket tables; rate-limited but workable
The GraphFree decentralized indexer; GraphQL queries; slower than Goldsky but free
-- Sample Dune query: minute-level Trump win price during 2024
SELECT
  date_trunc('minute', evt_block_time) AS minute,
  avg(price) AS trump_price,
  sum(size) AS minute_volume
FROM polymarket_polygon.exchange_evt_OrderFilled
WHERE market_slug = 'will-trump-win-the-2024-us-presidential-election'
  AND evt_block_time BETWEEN '2024-01-01' AND '2024-11-30'
GROUP BY 1
ORDER BY 1;

1.4 What we need from PolyMarket

  • Minute-level Trump probability time series, Jan 1 – Nov 30, 2024. Mid-quote / VWAP per minute.
  • Trade-level fills: wallet (anonymized but consistent), market, position, price, size, timestamp.
  • Identified whale wallets: Fredi9999, Theo4, Theo3, etc. (Chainalysis 11 wallets) — verify match against published list.
  • Sirolly-Sethi wash-trading cluster wallets (43K wallets) — request from authors or replicate methodology.
  • Top-1% volume wallets per Akey et al. — derive from trade-level data.

2. Kalshi Data Layer

2.1 Academic API access

Kalshi offers academic API access. Apply on Day 1 — typical 3-week turnaround.

  • Apply: research@kalshi.com with institutional email, research statement, and IRB / institutional letter
  • Expected response: 2-4 weeks
  • Access: read-only API + historical CSV dumps
  • Cost: free for academic research

2.2 What we need from Kalshi

# Sample API call (once access granted)
import requests

resp = requests.get(
    "https://trading-api.kalshi.com/trade-api/v2/markets",
    headers={"Authorization": f"Bearer {API_KEY}"},
    params={"event_ticker": "PRESPARTY-24", "status": "settled"}
)
markets = resp.json()["markets"]
# Returns Kalshi presidential party markets, settled prices, etc.

# For trade-level data:
for ticker in ["PRESPARTY-24-D", "PRESPARTY-24-R"]:
    trades = requests.get(
        f"https://trading-api.kalshi.com/trade-api/v2/markets/{ticker}/trades",
        headers={"Authorization": f"Bearer {API_KEY}"}
    ).json()

2.3 Public fallback

If academic API delays:

  • Kalshi public market data via web scraping (rate-limited but workable)
  • Kalshi quarterly volume reports (less granular)
  • Third-party Kalshi data via Dune Analytics or similar

3. Asset Prices Layer

3.1 WRDS subscription (mandatory)

Standard academic subscription gives access to TAQ (1-second equity trades) + Compustat + CRSP. Your institution likely already has WRDS.

# WRDS Python connection
import wrds
db = wrds.Connection(wrds_username="your_id")

# TAQ intraday data for major sector ETFs around July 13, 2024 (Trump assassination)
sector_etfs = ["SPY", "XLF", "XLE", "XLV", "ITA", "XLI", "XLY"]
taq_query = f"""
SELECT date, time_m, sym_root, price, size
FROM taqm_2024.ctm_20240713
WHERE sym_root IN ({','.join(f"'{s}'" for s in sector_etfs)})
  AND time_m BETWEEN '12:00:00' AND '23:00:00'
ORDER BY time_m
"""
trades = db.raw_sql(taq_query)
# For Trump assassination at ~6:11 PM ET (10:11 PM UTC), trades around that minute should show large moves

3.2 Bloomberg Terminal (mandatory for FX / commodities)

Bloomberg Terminal ~$24K/year. Standard academic infrastructure for top-5 paper. For FX intraday:

# Bloomberg Python (BLP API) — runs inside Terminal session
from xbbg import blp

mxn_intraday = blp.bdib(
    ticker="USDMXN Curncy",
    dt="2024-07-13",
    typ="BID_ASK",
    interval=1  # 1-minute bars
)
# Returns DataFrame: time, open, high, low, close, volume per minute

3.3 Refinitiv as alternative

If no Bloomberg access, Refinitiv DataScope Tick History via WRDS provides similar FX/rates data at ~$3K/year additional cost.

3.4 Free fallbacks

CryptoCoinGecko + CryptoCompare (free, 1-minute history available)
Treasury yieldsNY Fed daily; CME for intraday Treasury futures
VIXCBOE direct download (free historical)
SPDR ETFsYahoo Finance / Alpha Vantage for daily; 1-minute via paid Alpha Vantage tier (~$50/month)

4. News Timing Layer

4.1 The 40-event canonical timeline

Compile manually + verify against Bloomberg event database + Wikipedia + WSJ archive. Each event needs:

  • Event timestamp (UTC, second-resolution where possible)
  • Event type (debate, poll, conviction, assassination, dropout, election, etc.)
  • Direction prior (expected to favor Trump or Harris)
  • Magnitude prior (major / moderate / minor)
  • News-flow control (what other macro news happened that day?)
# Sample event registry (CSV format)
# event_id, date, time_utc, name, direction_prior, magnitude, notes
1, 2024-01-15, 23:30:00, "Iowa caucus", "trump+", major, "Trump wins"
2, 2024-05-30, 21:00:00, "Trump conviction NY", "harris+", moderate, "hush money case verdict"
3, 2024-06-27, 01:00:00, "Biden-Trump debate", "trump+", major, "Biden poor performance"
4, 2024-07-13, 22:11:00, "Trump assassination attempt #1", "trump+", major, "Butler PA rally"
5, 2024-07-21, 17:46:00, "Biden drops out", "harris+", major, "Truth Social post"
# ... and so on for ~40 events

4.2 RavenPack for controls

RavenPack provides sentiment-scored news flow. Use to control for general news intensity during event windows. ~$10K/year academic access (split with other departments).

5. Industry Exposure Layer

For each S&P 500 firm, build pre-determined exposure to each of 5 Trump-policy channels. Use 2017-2019 data only (pre-COVID, pre-2024) to avoid look-ahead.

5.1 Tax exposure (Compustat)

# Compustat: 2017-2019 avg effective tax rate per firm
tax_query = """
SELECT gvkey, datadate, txt / pi AS effective_tax_rate
FROM comp.funda
WHERE fyear IN (2017, 2018, 2019)
  AND indfmt = 'INDL'
  AND consol = 'C'
  AND popsrc = 'D'
"""
etr = db.raw_sql(tax_query)
firm_etr = etr.groupby("gvkey")["effective_tax_rate"].mean()

# Trump tax exposure: higher ETR = more to gain from cuts
firm_tax_exposure = firm_etr * 1.0  # Linear in ETR

5.2 Regulatory exposure (OpenSecrets + Federal Register)

# OpenSecrets bulk download — 2017-2019 lobbying spend per industry NAICS
# https://www.opensecrets.org/industries/bulk-data
import pandas as pd

lobbying = pd.read_csv("opensecrets_lobby_2017_2019.csv")
industry_lobby = lobbying.groupby("naics4")["amount"].sum()

# Federal Register mentions per industry (via FR API)
import requests
fr_query = requests.get(
    "https://www.federalregister.gov/api/v1/documents.json",
    params={"per_page": 1000, "conditions[publication_date][gte]": "2017-01-01",
            "conditions[publication_date][lte]": "2019-12-31"}
).json()

# Match FR documents to industries via topic tags / NAICS classification
firm_reg_exposure = ...

5.3 Trade exposure (Atkin-Khandelwal)

Use the Atkin-Khandelwal trade-exposure dataset (available from authors) or replicate from Census trade data + Compustat foreign-sales segments.

5.4 Immigration exposure (USCIS + BLS)

USCIS H-1B annual filings + Bureau of Labor Statistics industry data for unauthorized-worker share. Construct per-NAICS immigration dependency.

5.5 Geopolitical exposure (Bloomberg + 10-K)

Bloomberg Government for defense contract revenue share. SEC EDGAR 10-K filings + NLP to extract "geographic risk" Item 1A text from 2017-2019 annual reports.

6. Storage Architecture

6.1 Recommended: PostgreSQL + DuckDB

  • PostgreSQL for the master schema (events, firms, exposures — relational tables, moderate size)
  • DuckDB for analytic queries on time-series data (TAQ + PolyMarket fills + intraday FX). Columnar, fast, embedded.
  • Parquet for raw historical pulls (TAQ ~50GB, PolyMarket fills ~30GB, FX ~10GB)
# Sample DuckDB query joining PolyMarket × intraday TAQ
import duckdb

con = duckdb.connect("paperA.duckdb")

# Register pre-computed parquet files
con.execute("CREATE VIEW pm AS SELECT * FROM 'polymarket_1min.parquet'")
con.execute("CREATE VIEW taq AS SELECT * FROM 'taq_2024.parquet'")

# Event-window join for July 13, 2024 Trump assassination
result = con.execute("""
  SELECT pm.minute, pm.trump_price, taq.symbol, taq.price AS asset_price
  FROM pm
  JOIN taq ON pm.minute = taq.minute
  WHERE pm.minute BETWEEN '2024-07-13 21:55:00' AND '2024-07-13 22:45:00'
""").df()

6.2 Total size estimate

PolyMarket fills (parquet)~30 GB
TAQ 2024 (parquet)~50 GB
FX + commodities intraday~10 GB
FOMC text + news~5 GB
Industry exposure tables~1 GB
Total~100-150 GB

Fits comfortably on a single workstation with ~500GB SSD. Cloud option: AWS S3 + Athena for ~$50/month total.

7. Build Phases (8 Weeks)

WeekPhaseDeliverable
1 PolyMarket setup Cong dataset downloaded + parsed; backup Polygon RPC + The Graph subgraph access; Goldsky account; first SQL query showing daily Trump-win probability time series Jan-Nov 2024
2 PolyMarket → minute resolution Minute-level Trump probability time series; whale wallet identification verified against Chainalysis list; trade-flow aggregates per minute
3 Kalshi + asset price connection Kalshi academic API access (or fallback); WRDS TAQ for sector ETFs around all 9 major events; Bloomberg / Refinitiv pull for FX, rates, commodities
4 News timing + event registry 40-event canonical timeline CSV with second-resolution timestamps verified across sources; RavenPack news intensity for control
5 Industry exposure (tax + reg) Pre-determined TaxExposure_i (Compustat) and RegExposure_i (OpenSecrets) for all S&P 500 firms; validation on 2016 election
6 Industry exposure (trade + immig + geo) TradeExposure_i, ImmigrationExposure_i, GeopoliticalExposure_i; composite TrumpExposure_i constructed; pre-registered at OSF
7 Master panel PostgreSQL + DuckDB schema joining all layers; event × asset × minute master table; quality checks for missing data
8 First specification Working Spec 1 regression for July 13 Trump assassination event. Shows expected positive β for defense (ITA), oil (XLE), regional banks (KRE); negative β for renewables (TAN), Chinese ADRs.

8. Cost Estimate

ItemCostNotes
WRDS subscription$0Already via university (assumed)
Bloomberg Terminal$24K/yr1 license; typical academic infrastructure
Refinitiv (alternative to Bloomberg)$3K/yrVia WRDS DataScope add-on
RavenPack news sentiment$5-10K/yrOptional; can use free Bloomberg news count instead
Goldsky academic tier~$2K/yrOptional; The Graph + Dune are free alternatives
Alpha Vantage premium (crypto/equity)$50/monthFor intraday equity if no Bloomberg
AWS S3 + compute$50-200/monthIf using cloud rather than workstation
Vivvix / WMP (for Paper B)$3-5K one-timePaper B only; Paper A doesn't need
Total Paper A min~$5KRefinitiv only, no Bloomberg, free fallbacks
Total Paper A typical~$30KBloomberg + standard add-ons; worth it for top-5

9. First Concrete Deliverable: July 13, 2024 Event Study

By end of Week 8, produce this working analysis as proof-of-pipeline.

🎯 The July 13, 2024 Test

The Trump assassination attempt occurred at 6:11 PM ET on Saturday, July 13, 2024 in Butler, PA. PolyMarket Trump probability moved from ~60% to ~70% within minutes. Markets opened Monday July 15.

The clean test: Did defense ETFs (ITA), oil ETFs (XLE), and regional bank ETFs (KRE) gap up at Monday open? Did renewable ETFs (TAN, ICLN) gap down? Did the Mexican peso depreciate?

# Concrete first deliverable spec

# Step 1: PolyMarket Trump probability at minute resolution around July 13
pm = con.execute("""
  SELECT minute, trump_price FROM pm
  WHERE minute BETWEEN '2024-07-13 22:00:00' AND '2024-07-15 17:00:00'
""").df()

# Step 2: Compute the "treatment intensity"
delta_pm = pm[pm["minute"] == "2024-07-15 14:30:00"]["trump_price"].iloc[0] - \
           pm[pm["minute"] == "2024-07-13 22:10:00"]["trump_price"].iloc[0]
# Expected ≈ +0.10 (10 pp shift)

# Step 3: Compute Monday open vs Friday close returns for sector ETFs
sectors = ["ITA", "XLE", "XLF", "XLV", "KRE", "TAN", "ICLN", "SPY"]
returns = compute_jump_returns(sectors, friday_close="2024-07-12", monday_open="2024-07-15")

# Step 4: Implied β = ΔY / ΔPM
betas = returns / delta_pm
# Expected signs:
#   β_ITA > 0     (defense — Trump favors military spending)
#   β_XLE > 0     (energy — drilling deregulation)
#   β_KRE > 0     (regional banks — Trump deregulation)
#   β_TAN < 0     (solar — Trump rolls back IRA)
#   β_ICLN < 0    (clean energy)
#   β_MXN > 0     (USD/MXN — peso depreciates on tariff fears, so USDMXN UP)
#   β_SPY: small but positive (modest tax-cut effect)

print(f"July 13 event: ΔPM = {delta_pm:.4f}")
for sector, β in zip(sectors, betas):
    print(f"  β_{sector} = {β:.4f}")

✅ Success criterion

If by end of Week 8 you can produce this output with reasonable estimates (β_ITA ≈ +0.05, β_TAN ≈ -0.10, β_MXN ≈ +0.03, etc.), the pipeline works and Paper A is realistic. If you can't get reasonable estimates, the issue is either (a) data quality somewhere in the pipeline, (b) the July 13 event is not as clean as expected, or (c) the assumption that PolyMarket shocks cause asset moves is wrong. All three are useful information.

From here: Generalize to all 40 events, all asset cross-sections, all three identification strategies. Months 3-8 of the proposal timeline.

Concrete build plan · Generated May 19, 2026

← Paper A Proposal Pre-registration → ← Lit Review