Phishing URLs Are Getting Smarter
Modern phishing URLs are not the obvious "totallylegitbank-login.xyz" domains anymore. Attackers use homograph attacks (replacing characters with visually similar Unicode), subdomain stuffing (login.secure.bank.attacker.com), and URL shorteners to hide the real destination. Traditional blacklist approaches cannot keep up because new phishing domains appear faster than anyone can catalog them. You need a model that understands the structure and semantics of URLs, not just pattern matching against known bad ones.
Two Models, Two Perspectives
ThreatX uses a dual model architecture. The first model is BERT, fine tuned on URL text. BERT sees the URL as a sequence of tokens and learns to detect semantic obfuscation: misleading subdomains, brand name impersonation, suspicious token combinations. It catches the attacks that look right to a human eye but are structurally wrong.
The second model is an MLP that evaluates hand engineered structural features: URL length, count of special characters, number of subdomains, presence of IP addresses, TLD reputation, path depth, query parameter count, and about a dozen more. This catches the attacks that are statistically anomalous even if they look plausible to a language model.
Three Training Iterations
The first version was just the MLP with basic features. It worked okay on obvious phishing URLs but missed the sophisticated ones. The second iteration added BERT, which dramatically improved detection of homograph attacks and brand impersonation. But it was slow. BERT inference on every URL click is not great for user experience.
The third iteration introduced a tiered approach. The MLP runs first as a fast filter. If it is confident (high or low probability), we use that result immediately. If it is uncertain (probability near the decision boundary), we escalate to BERT for a deeper analysis. This gives us BERT level accuracy with MLP level latency for the majority of URLs. The feature engineering across all three iterations was probably the most educational part of the project. Understanding which structural features actually carry signal for phishing detection versus which ones are just noise required a lot of experimentation.
The Chrome Extension
Both models are served from a Flask API, and the Chrome extension queries it on every page navigation. The extension shows a green badge for safe URLs, a yellow warning for suspicious ones, and a red alert for likely phishing. It is unobtrusive for normal browsing but loud when it detects something wrong.
The real time requirement was a hard constraint. Nobody is going to use a security extension that adds visible latency to their browsing. The tiered inference approach keeps response times under 100ms for 85% of URLs (the ones where the MLP is confident) and under 500ms for the rest (the ones that need BERT). Good enough that you do not notice it during normal use.
1st Runner Up at CRISIL
We presented this at the CRISIL hackathon and took 1st Runner Up. The judges were particularly interested in the dual model rationale. Most competing solutions used a single model, either a traditional ML classifier on features or a neural network on raw URLs. The argument that semantic and structural signals are complementary and should be evaluated independently before fusion resonated well. The tiered inference for latency optimization was the other piece that set us apart. Security tools that people actually use beat theoretically better tools that are too slow for real time use.