Word Diversity in Top Grossing American Films

**Updated with final visualization and repo

I’m back in the classroom as a student for the first time in over twenty years!

I’m taking a course at the CUNY Graduate Center – Interactive Data Visualization – and I’m really enjoying the structure and accountability the classroom brings. We’ve been learning the fundamentals of D3.js to build charts of a variety of types, for example I made an interactive map of the US which charts UFO reports across the years 1910 – 2014. And I’ve enjoyed discovering data sets and building visualizations with the tools D3 provides.

We’ve come to the end of the tutorials phase of the class and are moving onto final projects, which interestingly follows a similar pattern for the undergraduate web design and development courses I teach. So like my students I’m in the proposal phase of the project and here’s what I’m planning to build…


For some time I’ve been interested in doing some text analysis of movie dialogue using the corpus of community generated SRT files as a primary resource. If you’re unfamiliar with a SRT (SubRip) file, it’s a widely supported subtitle file format that includes timestamps and text associated with each timestamp. Most video players support these files allowing video watchers to import a subtitle crafted in their choice of language.

Panic Room SRT sample

These files are particularly important for fans of films which haven’t had translations created for their native language. And there is quite a large following of translators generating SRTs for popular films. Opensubtitles.org is one of the more popular repositories online, and there you can find subtitles for almost any film imaginable. For example, The Avengers: Engame has over 400 SRT files written for dozens of different languages.

My interest though lies with the English SRT files of popular American films and there are plenty of these.

I’ve dabbled with movie data using the movie database API, building a six degrees of separation game and most recently for this class a visualization of top grossing American films of the past 50 years. But now I’m hoping to connect the movie database API to SRT data to build a few different visualizations inspired by a few different questions.

Project Plans

Swarm Chart Study

Inspired by Matt Daniel’s The Largest Vocabulary in Hip-Hop study, I’m planning to build a similar swarm chart by analyzing the word diversity of popular American movies. The visualization will be filterable by genre as well as by year and decade. Hopefully there will be some interesting data points to feature as well. I’m imagining that your average action film has a lot less word diversity, but we’ll see!

There may also be some opportunities to aggregate and look at word diversity in the movies filtered by popular screenwriters and directors. Also I’m considering as stretch goal to connect SRT file data with film script data. The script creates the opportunity to connect characters and therefore actors to particular lines of dialogue. So if successful, there could be some word diversity studies of actor vocabularies.

Line Chart Study

Separately I’m planning to do some ‘featured’ word and/or phrase searches and see how their usage has evolved overtime. Some if this is inspired by tropes like ‘get out of there.’ It would be interesting to see how the usage of ‘save the world’ has grown with our culture’s fear of a variety of apocalypses. As well I’d like to see how and when modern terms like ‘internet’ first appeared and its usage grew.

I’m also interested in the use of profanity in popular films which has been affected by cultural taste as well as the film business overtime. I’ve always thought the peak of usage of F-bombs in popular films was in the 80s. Action films like Die Hard (1988) had dialogue with prolific and enduring usage of profanity, but by the end of the franchise in an effort to garner a PG-13 rating only the signature line was the surviving F-bomb. But you’ll see recently they’re pushing the limits of ‘allowed’ PG-13 profanity.

Sam Richardson gets ‘shitty’ in The Tomorrow War

For these visualizations I’m planning to create some layered line charts that track some usage of particular phrases. Lines will likely have some tooltips connected to the data presenting counts and hopefully featuring particularly memorable films.


Building some of the visualizations will be challenging, but I’ve also put myself into a bit of a pickle with my data sources. I’ve found a couple of repositories in Github that will be helpful:

  • Movie Script Database – a set of python scripts to download and parse movie scripts into structured CSV files.
  • Subtext – a Jupyter notebook which has connected to it almost 1000 CSV files built by parsing SRT files.

I was successful with working through the process of scraping a number of different websites for movie scripts and parsing them into CSVs. These files combined with the subtext parsed SRTs to CSVs are a great starting point to do my project, but…

I don’t have the word diversity counts or the ability to search for particular words and phrases. So I’ve been teaching myself some python and in particular the use of two libraries pandas to work with CSVs and nltk for natural language processing. It’s been going ok so far, but I’m hoping the break next week will allow me to figure out the core of what I need to generate the data I’m interested in.

So there’s some real skill stretching needed to be able to pull this off, wish me luck!


Data to track
  • Year
  • Genre
  • Word counts, words per minute
  • Screen writers, directors word counts
  • Counts of profanity and trope phrases
  • Counting actors’ usage of particular word and/or phrases
Identify unique ‘events’ and spotlight them
  • average word count in a film
  • top word count screenwriter
  • top word count film non-Shakespeare
  • genre which averages the fewest variety of words
  • film with the fewest words
Resources and tools

brief presentation on this post

1 comment

Leave a comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.