How to use Python and the YouTube API to scrape YouTube Comments

In this tutorial, we will walk through the process of using Python and the YouTube API to extract YouTube video comments using video titles as our search parameters.

Background & Environment Setup

I started this project due to an interest in understanding what insights I could learn from following the comments over time of a YouTube Mountain Biking Reality Show I like called “Pinkbike Academy.” The below tutorial will just focus on the tutorial to retrieve the data, but if you’d like to read up on the analysis results of those YouTube Comments you can see the results here.

I decided to execute my analysis using an .ipynb in Google Collab, which you can read up on here. It’s free and easier to use than setting up your own local environment!

The Code

Once setup and in our environment, we will import the necessary libraries: requests and pandas. The requests library is used to make HTTP requests to the YouTube API and the pandas library is used to store the data in a dataframe. The API endpoint URL is set with the api_key variable, which contains the API key needed to access the YouTube API. You can find more info on the YouTube API and how you can get your own API key here.

import requests
import pandas as pd

# Set the API endpoint URL
api_key = "XXXXXXXXXXXXXXXXXXXXXXXXXXX"

Next, a list of search queries, search_queries, is defined. These search queries will be used to search for the titles of the YouTube videos. In this case, we have included all episodes from a mountain biking reality tv show I enjoy called “Pinkbike Academy”.

**Methodology note: I felt comfortable using these search queries to identify my videos of interest, as there don’t seem to be any conflicting video titles that I could mistakingly pull up using a search query parameter in this context. But, if you had for example “Funny Video XYZ” it’s hard to say if the search query methodology will pull up your desired YouTube video. There are some alternative methodologies in the YouTube API Documentation that you could consider instead.

# # Initialize an empty dataframe with two columns: 'search_query' and 'video_id'
df_video_ids = pd.DataFrame(columns=['search_query', 'video_id'])

search_queries = ['Pinkbike Academy Season 1 Episode 1',
'Pinkbike Academy Season 1 Episode 2',
'Pinkbike Academy Season 1 Episode 3',
'Pinkbike Academy Season 1 Episode 4',
'Pinkbike Academy Season 1 Episode 5',
'Pinkbike Academy Season 1 Episode 6',
'Pinkbike Academy Season 1 Episode 7',
'Pinkbike Academy Season 1 Episode 8',
'Pinkbike Academy Season 1 Episode 9',
'Pinkbike Academy Season 1 Episode 10',
'Pinkbike Academy Season 2 Episode 1',
'Pinkbike Academy Season 2 Episode 2',
'Pinkbike Academy Season 2 Episode 3',
'Pinkbike Academy Season 2 Episode 4',
'Pinkbike Academy Season 2 Episode 5',
'Pinkbike Academy Season 2 Episode 6',
'Pinkbike Academy Season 2 Episode 7',
'Pinkbike Academy Season 2 Episode 8',
'Pinkbike Academy Season 2 Episode 9',
'Pinkbike Academy Season 2 Episode 10',
'Pinkbike Academy Season 3 Episode 1',
'Pinkbike Academy Season 3 Episode 2',
'Pinkbike Academy Season 3 Episode 3',
'Pinkbike Academy Season 3 Episode 4',
'Pinkbike Academy Season 3 Episode 5',
'Pinkbike Academy Season 3 Episode 6',
'Pinkbike Academy Season 3 Episode 7',
'Pinkbike Academy Season 3 Episode 8',
'Pinkbike Academy Season 3 Episode 9',
'Pinkbike Academy Season 3 Episode 10']

Then, the below code sets up a for loop that iterates over the list of search queries, and, for each search query, it makes an HTTP request to the YouTube API using the requests.get() function. The search query is included in the URL as a query parameter. The response from the API is stored in the response variable and is converted to a JSON object using the response.json() method.

# Set the API endpoint URL
url = f"https://www.googleapis.com/youtube/v3/search?key={api_key}&part=id&type=video"

# Iterate over the list of search query options
for search_query in search_queries:
  # Check if the dataframe already has information for the current search query
  if search_query in df_video_ids['search_query'].values:
    print(f"Information for search query '{search_query}' already exists in the dataframe.")
    continue

  # Set the search query in the URL
  search_url = f"{url}&q={search_query}"

  # Make a GET request to the API
  response = requests.get(search_url)

  # Convert the response to a JSON object
  data = response.json()

  # Extract the video ID from the response
  video_id = data["items"][0]["id"]["videoId"]

  # Add a new row to the dataframe with the search query and video ID
  df_video_ids = df_video_ids.append({'search_query': search_query, 'video_id': video_id}, ignore_index=True)

# Save the updated dataframe to the CSV file
df_video_ids.to_csv('df_video_ids.csv', index=False)

video_ids = df_video_ids['video_id'].unique().tolist()

df_video_ids.head(3)

Now that we have the video IDs for the desired Pinkbike Academy videos, we are ready to start extracting their comments!

The below code retrieves comments for our list of video IDs. We create an empty dictionary called “comments” to store the comments for each video. Then, we iterate over the list of video IDs and make a GET request to the API for each video. The API response is then converted to a JSON object, and the comments are extracted from the response and added to a list called “video_comments”. If there is a next page of comments, the code updates the URL with the next page token and makes another GET request. This process continues until all comments have been retrieved). At the end, the code prints the total number of comments for each video for us to review.

# Set the API endpoint URL
url = f"https://www.googleapis.com/youtube/v3/commentThreads?key={api_key}&part=snippet"


# Set the initial value for the next_page_token variable
next_page_token = ""

# Create a list of video IDs
video_ids = video_ids

# Initialize an empty dictionary to store the comments for each video
comments = {}

# Iterate over the list of video IDs
for video_id in video_ids:
  # Initialize an empty list to store the comments for the current video
  video_comments = []

  # Set the URL to retrieve comments for the current video
  video_url = f"{url}&videoId={video_id}"

  while True:
    # Make a GET request to the API
    response = requests.get(video_url)

    # Convert the response to a JSON object
    data = response.json()

    # Extract the comments from the response
    video_comments += data["items"]

    # Check if there is a next page of comments
    if "nextPageToken" in data:
      # Set the value of the next_page_token variable
      next_page_token = data["nextPageToken"]
      # Update the URL to retrieve the next page of comments
      video_url = f"{video_url}&pageToken={next_page_token}"
    else:
      # If there are no more pages of comments, break out of the loop
      break

  # Add the list of comments for the current video to the dictionary
  comments[video_id] = video_comments

# Print the total number of comments for each video
for video_id, video_comments in comments.items():
  print(f"Total comments for video {video_id}: {len(video_comments)}")

Finally, this last section of code is creating a dataframe from our dictionary “comments”. It initializes an empty list called “comment_data” and iterates over the comments for each video in the dictionary. For each comment, it extracts the relevant data, such as the video ID, author name, and text, and stores it in a dictionary. It then appends the dictionary to the “comment_data” list. Finally, it creates a pandas dataframe from the “comment_data” list and prints the first few rows of the dataframe.

# Initialize an empty list to store the comment data
comment_data = []

# Iterate over the comments for each video
for video_id, video_comments in comments.items():
  # Iterate over the comments for the current video
  for comment in video_comments:
    # Add the comment data to the list
    comment_data.append({
        "video_id": video_id,
        "author_name": comment["snippet"]["topLevelComment"]["snippet"]["authorDisplayName"],
        "author_channel_id": comment["snippet"]["topLevelComment"]["snippet"]["authorChannelId"]["value"],
        "text": comment["snippet"]["topLevelComment"]["snippet"]["textOriginal"],
        "like_count": comment["snippet"]["topLevelComment"]["snippet"]["likeCount"]
    })

# Create a pandas dataframe from the comment data
df = pd.DataFrame(comment_data)

# Print the first few rows of the dataframe
print(df.head())

Conclusion

And that’s it! You now have your desired YouTube Comments data in a Pandas dataframe ready for sentiment analysis. One tip is to remember to join back the video_ids to the original search query so you can more easily identify which comment belongs to which YouTube Video.

If you’d like to read up on the sentiment analysis results for Pinkbike Academy, or just find some inspiration for what you can do with your newly scraped comments, you can find that post here.


Thank you for reading! If you have any feedback or thoughts, would love to continue the conversation — add a comment below. Or, you can reach me directly at @JacksonBurton11 on Twitter or email me at [email protected].

If you’d like to stay up to date on any future Off Road Analyst posts, sign up below!