ChatPredict

Why you can trust this record

Plain answers, each backed by how the system verifiably works — including what broke on the way. The instrument is the paper's, verbatim; the honesty is enforced by the storage design.

Contents

Could this be faked?

The design makes it structurally hard. Verdicts are produced at temperature 0 — the same input returns the same answer, so any verdict can be re-checked. Every model's raw reply is stored verbatim next to the parsed answer. Dropped articles are logged with reasons instead of vanishing. Every round writes a ledger row with its real fire time, lateness included. And the whole database is committed to a git repository after every round, so history is append-only and every change carries a timestamped commit. This page itself is regenerated from that database on every update — nothing displayed here is typed by hand.

There is also a standing rule about what may exist in the pipeline at all: nothing we couldn't do for real. Every input is a public or free-tier source any retail person could access; ticker matching is exact, with no human in the loop; models run with deterministic settings; decisions use only information available at decision time. If a step couldn't be done live with real money on the line, it isn't in the system.

What broke, and what changed

A live system earns trust by accounting for its failures, not by claiming none. This one was built and debugged in the open over Jun 9–12, 2026 — four commissioning days during which every failure below happened and was fixed. The study formally begins on Jun 15, 2026, the first day the self-hosted clock fires both rounds on time under final, correct conditions; all the commissioning data before it was cleared at the owner's direction, so the record holds only rounds run the right way. The failures themselves stay documented here, and the runner platform's own run logs for those days still exist. The record:

Jun 9, 2026 — commissioning, incident 1

A test round finished its work, but new code had been pushed while it ran; the round's attempt to save its database was rejected and the data died with the runner. Changed: the save step now rebases onto the moved history and retries; on any failure the database is additionally parked as a downloadable artifact. A failed save can no longer cost data.

Jun 9, 2026 — commissioning, incident 2

The scheduler fired the afternoon round 1h41m late; the then-gate only accepted runs within ±40 minutes of target, refused it, and the afternoon's collection was lost. Changed: the gate's policy became never early, late-OK-and-recorded, never twice, and round targets moved ~1.5 hours ahead of the executions they serve. Late now means a recorded slip, not a lost day. Evidence: every lateness in the record is a mono figure on History.

Jun 11, 2026 — commissioning, incident 3

A backup trigger woke holding a stale copy of the repository, saw no record of the round that had already run, and ran a full duplicate — whose save then failed (two database commits cannot merge). No data was lost, but hours of model quota were. Changed: every wake now syncs to the latest repository state before the gate decides. Evidence: the ledger shows exactly one row per round per day — the dedup at work.

Jun 11–12, 2026 — commissioning, incident 4

The triggering clock was the hardest part. Two different shared-cloud schedulers were tried and both proved unreliable: the build platform's own scheduler delivered triggers 1.5–2.6 hours late and silently dropped entire mornings, and a second cloud scheduler then skipped a morning tick outright — neither publishes a delivery guarantee, and a late trigger is useless for a decision that must precede an order deadline. Changed: the clock moved off best-effort cloud cron entirely onto a self-hosted scheduler the project controls and pays for, which fires at the exact targets. Its first fire landed within a second of its target — the cloud-cron reliability problem is gone. Evidence: every fire time on History — public arithmetic, not a promise.

Jun 12, 2026 — commissioning, incident 5

The new clock's very first real round exposed a bug one layer deeper: the trigger arrived on time, but the runner and the gate disagreed about which round it was, and the afternoon round ran against the morning round's target — effectively an hour early. The recorded lateness came out negative, which is how it was caught. Changed: the round identity is now decided in one place that both the gate and the runner read, so they can never disagree, with regression tests covering both daylight-saving seasons. The early round was wiped — which is why the clean record begins Jun 15, not the 12th. Evidence: the gate's never-early rule, and a single fire time per round on History.

Check it yourself

Three ways to audit this site from your chair:

1. Re-ask a model. Take any headline from News & Verdicts and the paper's prompt (quoted below), set temperature 0, and ask the same model — you should get the same verdict the table shows.

2. Check the arithmetic. The flowchart's funnel must reconcile on every build — collected = kept + dropped, kept = feeds + look-ups. The generator fails the build rather than display numbers that don't add up.

3. Check the timing — and the gaps. Every round's recorded fire time sits in the History ledger next to the target it was due at, including the lateness in minutes. The rounds that never happened are shown as misses, not quietly omitted — so the count you see is the whole story, not a flattering slice.

The study

What exactly is this project?

A forward, out-of-sample test of “Can ChatGPT Forecast Stock Price Movements?” (Lopez-Lira & Tang, 2023). The paper showed that asking ChatGPT whether a headline is good or bad news for a stock predicted next-period returns — on historical data. We run the same experiment live, going forward: every trading day the machine collects fresh US-stock news, asks a panel of models the paper's exact question, and records everything. In a few months we analyze whether it worked on data that did not exist when the paper was written.

Is real money involved?

No. This is a paper ledger — we record what the strategy would have traded. No orders are ever placed. The point is to find out, honestly, whether it works.

What is sacred and what did we change?

The paper's instrument is reproduced verbatim: the exact prompt text, the YES / NO / UNKNOWN answer format, and temperature 0. That is the replication core and is never touched. Everything we add — more models, a relevance tag, recording non-traded items — is an analysis-time addition that never alters that core.

The three periods and the two rounds

What are the three news periods, precisely?

Every article belongs to exactly one period, decided by its publish time (all times Eastern):

periodpublishedstrategy buys atsells at
P3after 4:00 PM yesterday (4pm → midnight)today's OPENtoday's CLOSE
P1today before 6:00 AMtoday's OPENtoday's CLOSE
P2today 6:00 AM – 4:00 PMtoday's CLOSEnext day's CLOSE

P3 and P1 both enter at the same moment — today's open. P2 enters at the close.

Why only TWO collection rounds for THREE periods?

Because there are only two decision moments in a day. By early morning, both P3 (it ended at midnight) and P1 (it ended at 6 AM) are completely published — nothing more can ever arrive in those periods. One morning round reads all of it and decides everything that enters at the open. P2 needs its own afternoon round because its trades enter at the close. Two entry moments → two rounds; a third would have nothing to decide.

When do the rounds run, and why those times?

The morning round at 8:00 AM ET (decisions for the 9:30 open) and the afternoon round at 2:30 PM ET (decisions for the 4:00 close) — each about 1.5 hours before the execution it serves. The buffer covers two measured costs: scheduler delivery slack, and the ~80 minutes a round of several hundred headlines takes when free-tier models are paced to their rate limits. The trade-off is honest and recorded: news published after 2:30 PM misses that day's close decision (the paper has the same kind of cutoff tail) — but it is still collected the next morning, so the record has no holes.

What happens to news published after the afternoon round?

It is never lost. The next morning's round pulls everything published since 2:30 PM yesterday, so those articles are stored and judged like all others. The 2:30–4:00 PM slice is P2 by definition but arrived too late to trade — it is recorded with that fact; after 4:00 PM it is simply P3 and trades at the next open as designed.

The clock — the most important rule

Why does everything key off the article's publish time instead of when we saw it?

Because that is what the strategy is: news published in period X → trade at X's entry point. Which period an article belongs to is a fact about the article, fixed the moment it was published. Our machine seeing it ten minutes or three hours later cannot change that fact — so a slow run can never file news into the wrong period. We additionally record when we saw it and how late each round fired, so collection lag is measured rather than hidden, and the analysis can verify that every decision would have been makeable in real time.

If a round fires late, is the data still valid?

Collection-wise, yes: every article published before the target still exists later — running at 3:10 instead of 2:30 collects a superset. The recorded lateness tells the analysis exactly which days' decisions happened after the ideal moment, and it can include or exclude them. What the system never does is run early — an early afternoon round would truncate P2 and silently miss news. The gate enforces: never early, late OK and recorded, never twice.

The question we ask

What exactly is each model asked?

The paper's prompt, word for word: “Forget all your previous instructions. Pretend you are a financial expert. You are a financial expert with stock recommendation experience. Answer YES if good news, NO if bad news, or UNKNOWN if uncertain in the first line. Then elaborate with one short and concise sentence on the next line. Is this headline good or bad for the stock price of [company] in the short term?” — followed by the headline, at temperature 0.

Why show both the company name and the ticker when the paper used only the name?

The paper's licensed news feed always identified the company cleanly. Free headlines sometimes carry only a name, sometimes only a ticker — so we always show both, e.g. “Apple Inc. (AAPL)”, to make sure the model knows which company it is judging. It extends the prompt's context; the question itself is unchanged.

Why ask for a RELEVANT / NOT_RELEVANT tag the paper never asked for?

The paper's data vendor shipped a relevance score with every article and the authors filtered on it. Free feeds don't have that — and worse, they routinely tag articles with the wrong or too many tickers (we measured this on every source we tested). So each model also answers whether the headline is genuinely about the named company; NOT_RELEVANT collapses to UNKNOWN in analysis — no signal, no trade. One thing the tag deliberately does not catch: law-firm “investor alert” press releases about real tickers. Those are recorded raw and filtered at analysis time.

The judging panel

Which models judge the news, and why these?

gpt-4.1-nano (OpenAI) is the paid anchor — the ChatGPT-lineage model that carries the replication. Gemini 2.5 Flash (Google) tests whether the signal survives a different model family, and gpt-4o-mini (OpenAI) is a third judge. (It replaced a Groq Llama-3.1-8B model on 2026-06-15, after that model proved unable to ever return a "NO" verdict — it scored 0 NOs across 472 headlines on day one, even on bankruptcies — making it useless for the short side.)

Why gpt-4.1-nano and not a newer GPT-5 model?

We tried. The GPT-5 family rejects temperature 0 — the API returns an error (verified live) — and temperature 0 is part of the paper's sacred setup. gpt-4.1-nano is the most capable inexpensive model that honors it.

How do you avoid hammering the free models' rate limits?

Each model client paces itself to its provider's free-tier budget and the three run in parallel — which also means a round's wall-clock is set by its slowest model: roughly 80 minutes for several hundred headlines. Each (stock × headline) is judged once per model, ever — at temperature 0 a re-ask returns the same answer, so re-asking is waste. A model that was rate-capped or errored is retried next round; one that answered is never re-asked. Every skip, cap and failure is recorded per model per item.

Wasn't there a fourth model?

Yes — FinBERT, a finance-tuned sentiment model. We dropped it: the free tier could not sustain a months-long study, and it was one of our additions, not part of the paper. A clean three-model panel beat a partially covered fourth member.

How the machine finds news

How does a round decide which stocks to look at?

It doesn't start from a stock list — it starts from the news. Market-wide feeds deliver ticker-tagged articles for the whole US market at once. Then “what's hot” lists — top gainers and losers, the most-traded names, today's earnings calendar — name stocks that should have news; any of those the feeds didn't already cover get a targeted per-stock look-up. Everything found is judged and stored, and the stocks we checked and found nothing for are recorded too: “no news” is a stored fact, not an absence.

How are articles matched to stocks?

Exactly or not at all: the source's own ticker tag, an official SEC company number, or an explicit ticker written in the text — validated against our universe of 5,852 US common stocks. We never fuzzy-match company names; that is how news ends up filed under the wrong company. Unmatchable items are dropped and logged with the reason.

Why can the flow diagram on the Overview be trusted?

Because it isn't a drawing. The steps, the sources inside each step, and the schedule are imported from the same modules the round executes, and the numbers on each step are queried from the database the rounds write. If the code changes, this site changes with it on the next build — there is no hand-maintained copy to drift.

What is stored

What exactly does the database keep?

Six core tables: every kept article with its raw source payload; every model reply verbatim with its parsed verdict, relevance and status; point-in-time snapshots of the 5,852-stock universe; everything we did not keep, with the reason; every hot-list stock we checked, including checked-and-found-nothing; and the run ledger of when each round actually fired versus its target. Prices for return calculations are collected separately and are freely backfillable. Store raw, decide later — the strategy is an analysis lens over a complete record.

Where does it all live?

A single SQLite file, committed back to a git repository after every round — the repository is simultaneously the store, the backup, and the audit trail. This dashboard is a static site rebuilt from that file on every commit; it has no write access to anything.