Downloading Data From Twitter Using the Streaming API

James Thorn
6 min readJun 9, 2019

--

If you have been living in a Faraday cage for the last 10 years, and have never heard about it, Twitter is a very popular micro-blogging service where users create short messages called tweets that generally express opinions about different topics.

Over the last decade Twitter has become a very popular social networking application, and therefore a lot of interest has span around how to efficiently collect data from the platform.

In this post we will cover how to use the Streaming API to get Tweets that contain certain words or hashtags, and how to efficiently handle the objects returned by the API.

How to get data from Twitter

There are two main ways to download data from Twitter:

  • Using the REST API to get historical data, followers, friends, or the timeline of specific user.
  • Using the Streaming API to download data that is being produced on real time.

This post will cover how to use the Streaming API to get Tweets that contain certain words or hashtags, and how to efficiently handle the objects returned by the API.

Using the Streaming API data that is being produced on the moment that we run the collection script will be gathered. We can download tweets with specific keywords or hashtags. For example, if we are searching for tweets with the hashtag #UCL (UEFA Champions League) and somebody posts the following message:

“Football Tonight everybody, happy Tuesday #UCL #BarcelonaVsLiverpool #GoReds”

Then we will have the message added to our dataset. As we will see later, it is not only the text posted in the message what we get from this API, but a lot of other additional information like the user who posted the message, the timestamp, and much more.

How to use Twitter’s Streaming API

To use any of the APIs provided by Twitter, first we need to collect a series of Twitter API keys, which will be used to connect to such API. For this the steps are the following:

  1. Go to https://apps.twitter.com/ and log in with your Twitter credentials.
  2. Click “create an app” (first you might have to apply for a twitter development account)
  3. Fill in the form to create the application.
  4. Go to “Keys and Tokens” tab to collect your tokens.
  5. Create an Access token and access token secret.
Screen where the creationg of the Access Token and Access Token Secret takes place.

I suggest you copy these to a safe location that can be easily accessed, as they will have to be used in any application that aims to collect data from any Twitter API.

Connecting to the API and downloading data

We will be using a Python Library called Tweepy to connect to the Twitter API and download the data. First, we must install Tweepy, which can be done by following the instructions from this link:

Now tweepy is installed, we will write the script to download the data. Open up a text editor and copy the following code:

In the code above, where it says: “ENTER YOUR ACCESS TOKEN”, etc… replace the capital letters with your respective tokens for each line.

For downloading tweets containing certain key words, change the last line to incorporate the words or hashtags you want to look for. Specifically, change tracklist to be a list with the words or hashtags we want to look for.

In the above example we would be downloading tweets with hashtags concerning Brexit and USA, if we replaced tracklist with [“#Brexit”,”#USA”] . For more sophisticated filters follow the guidelines from

These guidelines can be used to search for hot topics, narrow to certain geographies, specific times, or URLS. Usernames can also be added to this list apart to specific words to search for active trends or for tweets mentioning a certain user.

By running this code, we get a JSON object as an output, that looks like the following text:

{"created_at":"Fri May 10 04:26:55 +0000 2019","id":1126705151463444480,"id_str":"1126705151463444480","text":"Question Time last night the most biased BBC political programme I\u2019ve ever seen. Farage, in his 34th appearance, si\u2026 https:\/\/t.co\/6RReO5nO5C","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":true…

This is not the full JSON object for the sake of space. All the fields, along with descriptions can be found in:

Some of the most important fields are:

  • text, which contains the text included in the tweet.
  • created_at, which is a timestamp of when the tweet was created.
  • user, which contains information about the user that created the tweet, like the username and user id.

Save the document as Twitter_Downloader.py and run the following command to save the output of the script in a txt file called twitter_data.txt.

Twitter_Downloader.py > twitter_data.txt

There will be no output on the CLI, but the .txt file should be created in the same folder where the python script is.

If we run this, data will start being downloaded until we manually stop the script. To fix this, we can add some variables to our code in order to only download a certain number of tweets. Modify the code under the Twitter API credentials in the following manner:

In the code above modify the value of the variable n_tweets to the number of tweets you want to download. In this case we would be downloading 10 tweets. Also, we have made a new variable called tracklist to keep better track of the key words that we are searching for.

Let’s take a closer look at the data that we can download. After running the script for a while and saving its output to a .txt file, open up a Jupyter Notebook and write the following code:

import json
import pandas as pd
tweets_data_path = "twitter_data.txt"
tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
try:
tweet = json.loads(line)
tweets_data.append(tweet)
except:
continue
tweets = pd.DataFrame()
tweets['text'] = list(map(lambda tweet: tweet['text'], tweets_data))
tweets['Username'] = list(map(lambda tweet: tweet['user']['screen_name'], tweets_data))
tweets['Timestamp'] = list(map(lambda tweet: tweet['created_at'], tweets_data))
tweets.head()

After doing this, the resulting Dataframe should look like:

As you can see, all the text fields have the structure “RT @user: text text text”. This is because the way the Streaming API outputs tweets that have been retweeted by a user and are therefore not originally created by such user. On another post the different format of the tweets returned by the Streaming API will be discussed.

All the documentation about how to use Tweepy and Twitters streaming API can be found in:

Conclusion:

We have looked at how to collect data that is being produced on real time using the Streaming API, and how this data can be efficiently displayed and processed. It can then be used for a lot of purposes: from trend or fake news detection using complex Machine Learning algorithms, to Sentiment Analysis for inferring how positive the feeling of a certain brand is, graph building, information diffusion models and much more.

For further research, or clarification of the information found here refer to the previous links left throughout this guide or to:

· Twitter Developers page: https://developer.twitter.com/en/docs

· Tweepy’s github page: https://github.com/tweepy/tweepy

· Tweepy’s official page: https://www.tweepy.org/

· Twitter’s advanced search: https://twitter.com/search-advanced

The Python script and the Jupyter Notebook used in for this post can be found in:

In the next post we will look at how to use the REST API to collect historical data like previous tweets or the followers from a certain user. Take care until then and thank you for reading :)

--

--