Saving Tweets to Mongo in preparation for learning Hadoop

Submitted by Chris Larson on Thu, 04/04/2013 - 20:58

This week I found out how easy it is to spin up a Hadoop cluster with Amazon AWS. My fist app was the simple word count app found on the AWS tutorial site. The biggest thing missing from the tutorial is an interesting data set. (The tutorial uses a 3 line text file.) After the Word Count Hadoop tutorial the next most popular tutorial is calculating Twitter sentiment. For this I needed to become familiar with Twitter's API, specifically the Streaming API. To get started I needed to sign up at the Twitter Developer site. After logging in I hovered the mouse over my picture and chose My Applications from the menu. On the apps page I created a new app and requested access tokens. After a few minutes the access tokens populated and I was ready to write my python script.

Twitter just updated their API so that all connections require OAuth, so it took some digging to find an up to date example. The python script I used can be found on Github https://github.com/arngarden/TwitterStream/blob/master/TwitterStream.py and works without modification. This is a nice example because it handles the error cases and responds with the retry delays specified in the Twitter API documentation. For my purposes I modified the code to write the tweets to a mongodb rather than just print them to the terminal. Since the tweets are sent in JSON I simply wrote them to the mongodb without modification. Later when I want to retrieve the tweets they will still be JSON encoded and ready to use.

My first test is to look for tweets that mention #CLOUD, Big Data, or Analytics and find out what articles are linked to the most. The Twitter Streaming API allows me to filter for the keywords, so I only receive tweets in my categories. Next I need to check if the tweets have a link and store the link for the final map and reduce step. The code for checking tweets for URLs was interesting to me because this was my first time writing queries for MongoDB. I am pasting the URL search code here for anyone else who may be new to Mongo to reuse.

# Python script for extracting links from Tweets
import pymongo
import json
import bson
from pymongo import MongoClient

client = MongoClient()
mongo = MongoClient('localhost', 27017)
mongo_db = mongo['data']
mongo_coll = mongo_db['data']   #Tweets database

url_db = mongo['url']    #Extracted URLs database
url_db.url.remove() #clear the database everytime this script runs

cursor = mongo_coll.find() 
for record in cursor:    #for all the tweets in the database
    msgurl = record['entities']['urls']    #look for URLs in the tweets
    for recordurl in msgurl:    #Some tweets have multiple URLs
        if recordurl is not None:
            url1 = recordurl["expanded_url"]
            print url1 #Print to the terminal so it looks like the script is working
            url_db.url.insert({'name':[url1]})    #Store in the mongodb for further processing

The final step was to count the number of times an individual URL appeared in the database. I'd like to see what types of articles are linked to the most. For the Map and Reduce functions I followed the PyMongo tutorial and only changed the map function to look at my URL column instead of the example Tag column. The output to the terminal was a list of all the URLs with the number of times they appeared.

Chris Larson's blog

+ [VIDEO] LabVIEW and nodejs - Hello World!	+ [VIDEO] LabVIEW 2013 Web Services - Web App Demo
+ LabVIEW Web Services - Web App	+ Emulate Android and Bluetooth LE hardware
+ 360 Degree myRIO Spinner	+ How to use Cordova on Ubuntu to build Android apps
+ The DOM and Why It is The Key to LabVIEW On iPads, Smart Phones and Web Browsers.	+ Cascading Style Sheets for LabVIEW Developers
+ Web Development Tools for LabVIEW Programmers	+ LabVIEW Web Services - The RESTful CRUD

Saving Tweets to Mongo in preparation for learning Hadoop

Read More