I only keep 219 of these 329 in the network — an entity has to be mentioned in at least two paragraphs to get a node. The long tail of single-mention entities adds clutter without signal.
For the edges, I used paragraph-level co-occurrence. Two entities that appear in the same paragraph are connected; the edge weight is the number of paragraphs in which they both appear. This is a coarse proxy — it conflates “mentioned together” with “actually related” — but on a well-edited book, it works surprisingly well. Paragraphs are typically topical. If Morris Chang and TSMC appear in 34 paragraphs together, they're related regardless of what the verbs are.
I layered PMI (pointwise mutual information) on top of raw weight to surface pairs that co-occur more often than you’d expect given their individual mention counts. PMI is how you separate United States + Intel (weight = 64, but PMI = -0.08 because both appear in half the book) from John Bardeen + Walter Brattain (weight = 7, PMI = 4.4 because they basically only ever appear in each other's company — they're the transistor co-inventors
No comments:
Post a Comment