← Back to all projects

VeriDex

spaCy NLP Streamlit Docker
View on GitHub →

The Problem with "Just Google It"

I kept running into the same situation: someone makes a claim online, I want to verify it, and the best advice anyone has is "just Google it." That is not verification. That is hoping the first page of search results happens to contain the truth. Real fact checking requires pulling from multiple independent sources, extracting the actual claims being made, and cross referencing them. Nobody does this manually because it takes forever. So I automated it.

VeriDex is an autonomous OSINT research agent. You give it a claim, and it goes to work. It queries DuckDuckGo (no Google API keys, no rate limit headaches), scrapes the results with full robots.txt compliance because I am not trying to get anyone's IP banned, and then does the actual interesting part: extracting structured facts from unstructured web pages.

How the Pipeline Works

The extraction layer uses spaCy's NER to pull out named entities, then runs TF-IDF weighted TextRank to identify the most salient claims in each source. This is not keyword matching. TextRank builds a graph of sentences weighted by term importance and extracts the ones that carry the most informational weight. It is surprisingly effective at finding the "meat" of an article even when it is buried under ads and filler.

Once I have structured facts from multiple sources, VeriDex scores credibility across five dimensions: source authority, cross source consensus, claim specificity, temporal consistency, and internal coherence. Each dimension gets a score, and the final credibility rating is a weighted combination. The cross source consensus score is probably the most useful one. If five independent sources all report the same figure, that is a lot more credible than one source making a bold claim nobody else corroborates.

Query Scrape robots.txt NER + TextRank Credibility 5-dim Score
VeriDex pipeline: query, scrape, extract, score

The Dashboard

Everything surfaces through a Streamlit dashboard with Plotly visualizations. You can see the credibility breakdown per source, the consensus graph across sources, and drill into individual extracted facts. Results persist in SQLite so you can revisit previous research sessions. When you need to share findings, there is PDF and JSON export built in.

Fully Local, Zero API Keys

The whole thing runs locally. No OpenAI key, no cloud services, no third party APIs beyond DuckDuckGo's public search. I was deliberate about this. A fact checking tool that depends on a commercial API is a fact checking tool that stops working when the API changes pricing or goes down. VeriDex is 1,397 lines of Python that run anywhere you have Python and a network connection. Docker support included if you want to spin it up in a container.

The architecture is intentionally simple. There is no LLM in the loop. The NER plus TextRank pipeline is deterministic and fast. You get consistent results on the same inputs, which matters a lot when the whole point of the tool is verifying information. A system that gives you different credibility scores on Tuesday than it gave you on Monday is not a system you can trust.

Limitations and Honest Assessment

VeriDex is not a replacement for investigative journalism. It cannot assess source intent, detect sophisticated disinformation campaigns, or evaluate claims that require domain expertise to interpret. What it can do is the mechanical part of fact checking: gather sources, extract claims, and surface agreement or disagreement across them. The "last mile" of judgment is still on you. But at least now you are making that judgment with structured evidence instead of gut feeling and a browser tab with 47 open pages.