Skip to main content

Andrew's Blog

Andrew's Blog

< More Blogs

What Words Reveal: Sentiment Analysis of the 2024 Presidential Debate Transcript

Andrew Nolan

2024-09-12

Since Kamala Harris announced Tim Walz as her running mate I have been more hopeful and optimistic about politics than I have been in a long time. This feeling primarily comes from having an option I can be excited to vote for, but the data scientist inside me wanted to know if there was something more than that. Is there something in the words Harris says that makes me hopeful? Not just the context of the words and the policies they describe, but the actual word choice itself. Is there something about the way Trump speaks (besides the politics and rhetoric) that makes me upset and angry? Luckily, we can leverage the power of technology to visualize their words and understand the emotional choices of their diction.

Let's take a look at Tuesday night's debate and see if we can unravel this mystery. The first thing we need to do is...

Prepare the data

ABC, the host of the debate, provided a transcript online. They nicely labeled every statement from former president Donald Trump, vice president Kamala Harris, and our two hosts David Muir and Linsey Davis. This makes it easy for us to get the text we need.

After extracting the text we can preprocess it. This step is all about getting the text ready for analysis. We do not want noisy data, so we can do a few basic things to clean it up such as:

Now we have the data prepped and we are ready to...

Visualize the text

There are a few basic things we can see from the data right away. Trump spoke 74 times, Harris 34, and the moderators contributed 108 times. While Trump spoke more often his statements were often shorter, on average only 109 words compared to Harris's 172 words per response. These counts also line up with the length of time the candidates spoke. CNN reported Trump spoke for 42 minutes and 52 seconds while Harris spoke for 37 minutes and 36 seconds. Although Trump spoke more than twice as often, in total he only spoke for about 5 extra minutes.

This seems to line up with what I witnessed as a viewer of the debate. Trump seemed to interject more often leading to the more frequent, shorter statements. For a debate with muted mics, they seemed to turn them back on pretty often.

The numbers on the speaking are fun, but they don't tell us much about what the candidate's actually said. Let's visualize the actual text! The most famous data visualization for text is the word cloud. Oftentimes, I feel the word cloud is overlooked as a data visualization tool, sometimes being treated more as a gimmick. But we can gain valuable insights from looking at a map of the most frequently spoken words from each candidate. Below, you can see Harris's cloud on the left and Trump's words on the right.

Wordcloud of Harris's debate speech Wordcloud of Trumps's debate speech

I think the biggest thing that stands out looking at these clouds is that Harris talked about Trump a lot. Conversely, it seems like Trump didn't mention Harris by name much at all. Although he did talk about Biden. Trump does not mention America, while American is one of Harris's top words. And interestingly, both candidates said "people" quite a bit.

Word clouds are one of those things in which everyone will get something different out of it. There is a lot to take in, take some time to look through yourself and see what you can find. Does anything in particular stand out to you?

My goal of this experiment is to see how optimistic and hopeful the content of the speeches were. Visualizing the text through word clouds can give us some helpful insights, but we can specifically visualize the positivity and negativity as well using...

Sentiment Analysis!

Sentiment analysis is the process of categorizing text data based on emotions. Specifically, we analyze each word and assign it a score from -1 to 1, where -1 is negative and 1 is a positive emotion. The specific emotional value of each word can be specified by a human or generated from a computer using machine learning.

For our example we are going to use the VADER sentiment analysis model. This is an open source rule-based sentiment analysis model developed by researchers at Georgia Tech for analyzing the emotions of social media posts. However, it has been tuned to be used for general purpose text as well.

Now that we have this model, we can use it to score Harris and Trump's debate statements.

Histogram of Harris's debate sentiment per statement Wordcloud of Trumps's debate sentiment per statement

From these histograms we see Harris had primarily positive statements. Trump's statements had a wide range of emotion but included a large count of very negative statements. The average sentiment of Harris's statements was 0.35634. For Trump it was -0.16691. This means Harris leaned towards the positive and Trump was mostly negative but fairly close to neutral in his word choice.

What did these sentiments look like over time? Did the candidate's get more or less positive throughout the debate?

Line chart of Harris's debate sentiment over time Line chart of Trumps's debate sentiment over time

Well... It looks like they bounced back and forth throughout the night. This may not be too surprising considering the format of the debate. Each candidate takes turns responding to a question or to the other candidate's reply. We can assume that when making a counter argument they may focus more on negative aspects of their opponent's statement.

We do see one interesting thing looking at the sentiments over time. Trump ended with a series of negative statements while Harris left the night on an optimistic positive note.

Sentiment analysis is done on a binary scale like this one, negative to positive. But we can take a look at a wider range of feelings using...

Emotional Analysis

Like sentiment analysis, emotional analysis assigns scores to text based on the words used. However, emotional analysis focuses on categorizing text into specific emotions, while sentiment analysis focuses on putting text into general positive, negative, or neutral categories.

For emotional analysis we will leverage another pre-trained model. In this case we will use the NRC Emotion Lexicon. This model is one of the top emotional analysis lexicons and was developed by the National Research Council of Canada. It includes 8 different emotions.

Just like the sentiment analysis, we can score the transcript with this model and then visualize the results.

Harris's statements emotional breakdown throughout the debate Trump's statements emotional breakdown throughout the debate

Surprisingly, these graphs are very similar. The peaks and valleys of the graph line up pretty well between the two candidates with one major difference. Trump has a higher percentage of anger in his statements.

Looking back at our word clouds, a lot of the words used feel fairly neutral to me. But looking at some of the words that made it in the word cloud background, still heavily used but not quite as much, there are some key words on the Harris side like "believe", "understand", and "care" that stand out as aligning with the emotions we see.

And that brings us to our...

Closing Remarks

I had a lot of fun putting together these data visualizations and trying to uncover some secret. I am not sure I uncovered any secret truth here, but it's nice to see the results lined up with my hypothesis. Harris, in the debate, used more positive language and Trump had more angry and negative rhetoric. I like to think I am rational and focused on policies, but this does also line up with my personal sentiment when listening to them talk.

I enjoyed working on this and may continue this series of sentiment analysis blogs and look at the DNC/RNC speeches as well as the upcoming vice presidential debate. It could also be fun to try this same text with different models in case there is some bias. Stay tuned.

I am sure some of my own personal bias snuck into this piece. If you made it this far, let me know your thoughts! Did you take away anything different that I missed from these graphs? Do you have any ideas for other things I could look for in the debate transcript or other types of text to analyze? Let me know, I am always open to discussion!

Code

If you would like to explore or mess around with the experiment yourself, all the graphs and analysis was done in MATLAB and you can find the code in this github repo.

Enjoyed this article? Subscribe to the RSS Feed!

☀️ 🌙