← Back to all projects

LabLens

spaCy NER BioSchemas Docker
View on GitHub →

Scientific Papers Are a Metadata Graveyard

Every scientific paper contains a mountain of structured information buried in unstructured prose. The instruments used, the chemical concentrations, the experimental conditions, the organisms studied. All of it is in there, but it is locked inside paragraphs of natural language. If you want to find every paper that used a particular mass spectrometer at a specific temperature range, you are out of luck unless someone manually tagged those details. Nobody has time for that. So I built LabLens to do it automatically.

300 Regex Patterns and Counting

LabLens extracts metadata from scientific PDFs using a combination of spaCy NER and 300 domain specific regex patterns. The patterns cover eight entity categories: instruments, chemicals, experimental conditions, parameters, materials, methods, organisms, and samples. Each category has its own pattern library tuned to the way scientists actually write about these things in papers.

Why regex alongside NER? Because scientific nomenclature is weird. Chemical formulas like Na2SO4 or instrument model numbers like Bruker AVANCE III 400 do not follow the patterns that general purpose NER models are trained on. SpaCy is great at catching "Escherichia coli" as an organism, but it will miss "E. coli K-12 MG1655" without domain specific help. The regex layer catches the structured nomenclature that NER misses, and NER catches the natural language references that regex cannot template for.

PDF Input spaCy NER 300 Regex Confidence Scoring 4 signals merged 8 entity categories BioSchemas JSON-LD
Dual extraction pipeline: NER and regex converge into confidence scored entities

Multi Signal Confidence

Raw extraction is only half the problem. The other half is knowing how confident you should be in each extracted entity. LabLens scores confidence using four signals: specificity (how precise is the mention), context (does the surrounding text support this entity type), frequency (how many times does it appear in the paper), and source agreement (did both NER and regex independently find it).

Source agreement is the most powerful signal. When spaCy's NER independently identifies "Shimadzu LC-20AD" as an instrument and the regex layer also matches it against the instrument pattern library, that is much higher confidence than either signal alone. Entities found by both pipelines consistently have fewer false positives than entities found by only one.

FAIR Data via BioSchemas

Extracting metadata is only useful if you can do something with it. LabLens exports everything as BioSchemas compliant JSON-LD, which means the extracted metadata follows the FAIR data principles: Findable, Accessible, Interoperable, Reusable. You can feed the output directly into institutional repositories, knowledge graphs, or any system that understands schema.org markup.

This was not an afterthought. The whole point of LabLens is to make the implicit knowledge in scientific papers machine readable. If the output format is some proprietary JSON blob, you have just moved the problem from "locked in PDFs" to "locked in my custom format." BioSchemas solves that by giving you a standardized vocabulary that the scientific web already understands.

The Interface

Upload a PDF, get back annotated entities with confidence scores, category breakdowns, and exportable JSON-LD. The Streamlit UI makes it accessible to researchers who do not want to touch a command line. Docker support means you can run it on a shared lab server and let the whole group use it through a browser. No installation, no dependency conflicts, just point your browser at the right port.