How verifAI fact-checks a reel, step by step

People ask me what "AI fact-checking" actually means when they see a verdict on verifAInow.es. The honest answer is that it is much less magical than the marketing copy on most "AI" tools suggests — and much more interesting. This post walks through the real pipeline, in order, with the trade-offs we made at each step.

The input we accept

You can paste three kinds of links into verifAI:

An Instagram reel (instagram.com/reel/… or a short share link)
A TikTok video (tiktok.com/@user/video/… or vm.tiktok.com/…)
A news article (any HTTP URL that returns readable HTML)

The first thing the backend does is canonicalise the URL. Instagram in particular hands out half a dozen shapes for the same reel — share URLs, the iOS share-sheet variant, the instagram.com/p/… shape some embeds use, sometimes wrapped in an l.instagram.com redirect. Canonicalising up front means a reel checked twice ends up in the same row in our jobs table, so cache hits and history pages stay sane.

Step 1 — pull the media

For video links we go to a chain of providers, in order, until one succeeds:

Apify actors that wrap the public APIs of Instagram and TikTok
yt-dlp with a cookie jar, as a fallback for what Apify can't reach
A RapidAPI fallback for the rare reels both of the above refuse

Each one is wrapped in a timeout. The reason we keep a fallback chain is that any single provider goes through bursts of blocked / rate-limited periods, and a single 503 from Apify shouldn't mean a broken fact-check for the user. We do not store the video file. It lives in a per-job working directory that the lifecycle manager removes when the pipeline finishes (or, if DEBUG_KEEP_WORKDIR=1, when an operator wipes it later).

For article links we skip this whole step and go straight to a Mozilla Readability extraction on the HTML.

Step 2 — extract audio

ffmpeg -i input.mp4 -ar 16000 -ac 1 -c:a libopus output.opus — that's the entire step. 16 kHz mono is the sweet spot for Whisper-class models: high enough sample rate to keep phonemes intact, low enough that an average reel transcribes in single-digit seconds.

Step 3 — transcribe

We default to Whisper-large-v3-turbo hosted on Groq. The turbo variant trades a small bit of accuracy for a 4-5x speedup, which matters because every second the user waits is a second they might give up on the verification.

The output is a transcript plus a detected language tag. The language tag is important: it determines which dictionary template we feed downstream to the claim extractor, so a Spanish reel produces Spanish-language claims in the verdict card.

Step 4 — read the text that's burned into the video

This is the step that surprised me most when we shipped it. Short-form videos almost never speak the punchline — they show it as a caption overlay. "Doctors HATE this trick" lives in the title card, not the audio.

So we sample frames (default: 20, capped) and feed them to meta-llama/llama-4-scout-17b-16e-instruct with a prompt that asks for verbatim on-screen text, deduplicated, in the original language. The model is multimodal and surprisingly good at ignoring background noise (logos, comment counts, the host's t-shirt) and picking up the actual editorial overlays.

The visual text gets concatenated into the transcript under an [On-screen text] header. From here on, the pipeline treats it as just another paragraph of source material.

Step 5 — split into atomic claims

A reel typically contains a handful of distinct factual claims wrapped in a single voice-over. We ask openai/gpt-oss-120b (via Groq) to extract them as a JSON array, with rules:

One claim per array element
Drop opinions, jokes, rhetorical questions
Each claim must stand alone — if it depends on context from earlier in the transcript, fold that context into the claim text
Preserve the original meaning even when the claim is translated into the user's UI language

The model is run with response_format: json_object and a tight system prompt, and we parse the output defensively — we'd rather drop a malformed claim than show garbage in the UI.

Step 6 — fact-check each claim

For every claim the pipeline runs two checks in parallel:

Google Fact Check Tools API. When this returns a match, the verdict card is labelled "Authoritative" and the source links go straight to the original fact-check (Snopes, AP, AFP, EFE Verifica, Maldita.es, etc.). Whenever Google has done the work, we should be honest and say so.

Live web search via Tavily (or SearXNG, if self-hosted). The search results — title, snippet, URL — are handed back to gpt-oss-120b, which is prompted to issue exactly one of true | false | partially_true | misleading | unverified and to cite the URLs it actually used. The prompt explicitly forbids opining without sources; "unverified" exists specifically so the model has an out instead of hallucinating an answer.

Both paths produce a Verdict with the same shape (rating, explanation, source URLs), so the UI doesn't care which one fired.

Step 7 — show progress as it happens

The whole pipeline streams progress to your browser over Server-Sent Events. Each stage emits a stage event, each claim emits a verdict event as soon as it resolves, so you can read the early verdicts while the slow ones are still in flight.

This matters more than it sounds. A 30-second reel might take 25-40 seconds to fact-check end to end. If we showed only a spinner, most people would abandon the page. Streaming the partial results turns waiting into reading.

What we deliberately do not do

A few things we considered and rejected:

We do not store the videos. The pipeline holds them long enough to transcribe and read frames, then deletes them. Only the resulting Verdict[] lives in the database.
We do not "rate" the reel as a whole. Each claim gets its own verdict. A reel can contain one true claim and one false claim; collapsing those into a single number would be misleading.
We do not invent sources. If neither Google Fact Check nor the web search produces relevant evidence, the verdict is unverified and the explanation says so. This makes some verdicts feel anticlimactic — that's fine. The alternative is a model hallucinating a confident-sounding answer with no support.

Limitations you should know about

I want to be clear about where this tool stops being useful:

Sarcasm and rhetorical claims get filtered out by the claim extractor most of the time, but not always. If you ever see a verdict on a joke, it's the extractor's fault.
Brand-new news (the last few hours) is the weakest case. Both Google Fact Check and the web search lag the news cycle by hours-to-days. For breaking events the verdict is often "unverified, no recent sources".
Highly technical or jargon-heavy claims (specialist medicine, niche economics) push the synthesis model past the average reader's authoritative-source bar. It will often correctly cite a paper but stop short of an interpretation.
Language coverage. The UI is in English and Spanish; the pipeline can handle other languages because Whisper does, but the system prompts are tuned for those two. Quality drops for everything else.

Why share the recipe?

Because misinformation tooling that hides its method is part of the problem. If you don't agree with a verdict, you should be able to look at the transcript and the cited sources, and decide for yourself. That's the entire point of every "show your work" link in the verdict card.

If you want to try this end to end, paste a reel into verifAInow.es and watch the pipeline panel as it runs. If something looks wrong, my inbox is fernandoruedaoliva@gmail.com — I read everything.