Sibyl is a multi-agent pipeline that reads the astrophysics literature the way a
quantitative strategy reads a market — mining tens of thousands of papers for the
implicit connections buried across them, generating falsifiable predictions from
pre-cutoff work alone, and scoring them against results the model could not have seen.
The question underneath the engineering is a hard one: how much of scientific discovery
can be automated, and how do we know when to trust it?
Posters on OpenReview:
Temporal Backtesting
·
Autonomous Hypothesis Generation.
Sibyl runs as a staged pipeline of language-model agents, each with a narrow job. A high-volume triage agent (Claude Haiku) scores paper relevance; full-text reasoning agents (Claude Sonnet) extract structured, quote-anchored claims, compile them into a knowledge base, and generate predictions under explicit falsification conditions. Two stages — knowledge-base compilation and prediction generation — are gated behind mandatory human review, so the system is autonomous by default but never runs unsupervised at the points where an error would compound. Every extracted claim carries a verbatim source quote and a bibcode, so a finished prediction traces end-to-end back to the printed literature rather than to the model's parametric memory.
The pipeline is evaluated the way a trading strategy is: by backtesting. The corpus is split at a fixed 2014 cutoff; predictions are generated from pre-cutoff literature alone and scored against post-2015 publications, counting a prediction as confirmed only when the later result matches its predicted direction and the confirming data was genuinely unavailable beforehand. The cutoff places three field-changing programs — LIGO's first observing run, NICER, and eROSITA — entirely inside the validation window. Of 60 predictions generated from pre-2015 work, 11 were confirmed by independent later results, spanning four X-ray-binary families rather than a single well-studied system; the confirmed rate holds between 12.5% and 18% across increasingly conservative provenance filters.
What separates Sibyl from a pipeline that simply emits plausible text is that it audits itself. A post-hoc provenance audit checks every extracted claim against its cited paper — quote, source class, formula, and sample size — and a cross-prediction consistency check catches fabricated content that per-claim quote checks miss. That audit surfaced four failure modes, ordered by how fixable they are: schema underspecification, corpus contamination, validation-era leakage, and citation hallucination. Reporting them, rather than burying them, is the point — the confirmation rate above is quoted under the filters that remove each one.
Key references: Sibyl posters at the ICML 2026 AI4Science workshop — Temporal Backtesting for Literature-Based Scientific Discovery with LLM Agents and A Multi-Agent Pipeline for Autonomous Hypothesis Generation.
Leading the Space Lab at Texas State — an experimental program where students design CubeSat payloads, build balloon-borne instruments, and work with vacuum systems for space-qualified hardware.
Building automated, machine-learning classification pipelines for X-ray sources in nearby galaxies (M33, M31) using Chandra and HST. Studying pulsar bow shocks, unidentified Galactic GeV/TeV sources, and the connection between X-ray binaries and the young star clusters in which they form.