[OC] This chart plots the lexical diversity of Eminem’s lyrics, calculated as the ratio of unique words to total words, against the total word count of each song. Each point represents a track from his catalog (excluding skits), and the bubble size reflects Genius pageviews.

The shaded horizontal and vertical bands mark the middle 50% of values along each axis:

  • Lexical richness from 0.395 to 0.462
  • Word count from 696 to 952

Only a subset of songs are directly labeled on the chart. For the rest, the interactive version includes tooltips with full metadata, which has been fun to explore.

The four labeled quadrants were added to provide some structure, grouping songs by whether they tend to be longer, more repetitive, or more varied in vocabulary.

Lyrics were retrieved from Genius and tokenized in R. Plot was created in DataWrapper. 341 non-skit songs are shown; 23 skits were excluded from analysis.

Link to the interactive plot is here.

Posted by TreeFruitSpecialist

12 comments
  1. [OC]

    Source: Lyrics were scraped from Genius using the `{geniusr}` package and cleaned/tokenized in R with `{tidytext}`. Lexical diversity was calculated as the ratio of unique tokens to total words.

    Tool: Visualization was made in Datawrapper.

  2. should make a version where you can compare different artists overlayed with eachother, would make it more useful to compare because it’s difficult to determine what the baseline is for an average rapper

  3. My first reaction is a method error. Rap god should(*) be very diverse, but it necessarily says “i, you” a lot.

    Would a one-word song be perfectly diverse and longer songs necessarily repeat simple words?

    Unique words per song (absolute)? Commenting for updates

  4. Type token ratio as a measure of lexical diversity is sensitive to sample size. The general trend you see of a negative correlation is not a property of eminem’s music but a property of the bias in the lexical measure you are using. If you took the longer songs and took a subsample of them they would have higher lexical diversity.

  5. Someone did exactly this about 15 years ago with various artists. The conclusion was “WuTang Clan ain’t nothin’ to fuck wit’.” They were like 6 of the top 10 in breadth of vocab.

  6. This only really makes sense when accounting for song length and tempo, otherwise faster and longer songs would naturally have more ‘diversity’, so maybe the real metric should be the number of unique words per X bars

  7. I love seeing visualizations like this but between artists and just seeing Aesop Rock sitting out buy himself. It’s actually how I first discovered him.

  8. Would love to see this grouped or colored by album. I swear MMLP2 has some insane triple and even quadruple entendres that fly over 95% of listeners heads.

  9. Really nice plot, and approach.

    Now do Donna Summers’ “I Feel Love”. 🙂

Comments are closed.