A chord diagram of concepts and relationships extracted from 12 lectures of graduate stochastic analysis. 96 concepts, 534 relationships — no predefined ontology, no labeled training data. Structure emerges from the text.
Why this question
Can you recover the conceptual structure of a mathematical field without knowing what field you’re reading? The pipeline doesn’t know it’s processing stochastic analysis — it sees text, runs repeated extractions, and scores concepts and relationships by how consistently they appear across runs. Confidence is frequency, not a classifier output. This is early work; the current corpus is one course. The next step is a second domain to test whether the approach surfaces genuine cross-domain connections.
Methodology
Each lecture is processed with n=10 extraction runs. Concept confidence is the fraction of runs that mentioned the concept cluster, borrowed from semantic entropy (Farquhar et al. 2024). Synonym clustering uses tiered entailment: two small models first, a larger tiebreaker for disagreements. A second-pass semantic merge recovers concepts fragmented across naming variants — “infinitesimal generator”, “generator (diffusion)”, and “generator” collapse to a single node.
Implementation
The pipeline is written in Julia — LLM calls via Ollama running locally, GraphML export, per-lecture reports merged in a post-processing step. Observable is a pure downstream consumer: all the heavy work is pre-computed offline, the notebook reads a static file.