Friday, May 1, 2026

Graph with no llms

 I converted the PDF to text, segmented it into 1,156 paragraphs, and built a canonical schema with four entity types: people, companies, countries, and technologies. Each canonical entity carries an alias list. I ran regex matching for every alias against every paragraph to produce entity mentions. I built an undirected weighted graph where nodes are entities and edge weights are the number of paragraphs in which two entities co-occur, with PMI as a secondary score to surface surprising pairs. I ran Louvain community detection, computed the standard centrality measures (PageRank, betweenness, eigenvector, weighted degree), and then layered pattern-based typed relation extraction on top — 17 predicates like FOUNDED, INVENTED, ACQUIRED, SANCTIONED, filtered by semantic-type plausibility so (Apple, FOUNDED, Taiwan) gets dropped.

No LLMs. Just schema, regex, NetworkX, and Louvain. The whole pipeline is ~800 lines of Python and runs in about 20 seconds.

No comments:

Post a Comment