[OC] Entity Treemap from 50,000+ News Articles

Data source:
Collected from ~20 major global news outlets for 2025 (e.g. BBC, Reuters, NPR, The Guardian, Al Jazeera, France24). Articles were scraped by kosmopulse.com.

Methodology:

  • Extracted named entities (people, places, organizations) using spaCy NLP.
  • Constructed a co-occurrence matrix to detect which entities appear together across articles.
  • Applied hierarchical clustering (Ward linkage) to group related entities.
  • Labeled internal tree nodes with the most frequent entity in each cluster.
  • Final structure exported as a tree and visualized using Plotly Express (Treemap ).

Tools:
Python, pandas, spaCy, scikit-learn, scipy, plotly, Jupyter

What it shows:
Each box represents an entity (like “Donald Trump” or “Ukraine”). Size reflects how often it appeared across the dataset as an entity along side other entities. Boxes are nested based on clustering — showing which names and topics tend to appear together and as subtopics of each other in global media coverage.

for the original HIGH-resolution PDF (width=3000, height=2000) check out https://www.kosmopulse.com/post/we-ve-added-5-new-news-sources-and-a-curious-visualization-to-match

“I also created a 60s video version of this exploration if you're curious — https://youtu.be/3H5bcNKXihM

Posted by Serious-Parking-2625

4 comments
  1. **Data source**: News articles scraped from ~20 global news outlets (2025), including BBC, Reuters, NPR, The Guardian, Al Jazeera, and others. Extracted by [kosmopulse.com](http://kosmopulse.com) .

    **Method**:

    – Named Entity Recognition (spaCy) to extract people, places, organizations from article text

    – Co-occurrence matrix of entity pairs

    – Hierarchical clustering (Ward linkage)

    – Final visualization via Plotly Express (Treemap/Sunburst)

    **Tools**:

    – Python (pandas, spaCy, sklearn, scipy, plotly)

    – Jupyter + Colab for preprocessing and clustering

    **Visualization**:

    Each box represents an entity (like “Donald Trump” or “Ukraine”). Size reflects how often it appeared across the dataset as an entity along side other entities. Boxes are nested based on clustering — showing which names and topics tend to appear together and as subtopics of each other in global media coverage.

    for the original HIGH-resolution PDF (width=3000, height=2000) check out [https://www.kosmopulse.com/post/we-ve-added-5-new-news-sources-and-a-curious-visualization-to-match](https://www.kosmopulse.com/post/we-ve-added-5-new-news-sources-and-a-curious-visualization-to-match)

    “I also created a 60s video version of this exploration if you’re curious — [https://youtu.be/3H5bcNKXihM](https://youtu.be/3H5bcNKXihM)

  2. Amazing visualization! Really puts the interconnectivity of global news into perspective.😉

  3. Uh why are us, pakistan, trump, uk & singapore extremely common together? Or am i reading this wrong? Nothing happened in singapore that would warrant this

Comments are closed.