This week I found out how easy it is to spin up a Hadoop cluster with Amazon AWS. My fist app was the simple word count app found on the AWS tutorial site. The biggest thing missing from the tutorial is an interesting data set. (The tutorial uses a 3 line text file.) After the Word Count Hadoop tutorial the next most popular tutorial is calculating Twitter sentiment. For this I needed to become familiar with Twitter's API, specifically the Streaming API. To get started I needed to sign up at the Twitter Developer site. After logging in I hovered the mouse over my picture and chose My Applications from the menu. On the apps page I created a new app and requested access tokens. After a few minutes the access tokens populated and I was ready to write my python script.
Twitter just updated their API so that all connections require OAuth, so it took some digging to find an up to date example. The python script I used can be found on Github https://github.com/arngarden/TwitterStream/blob/master/TwitterStream.py and works without modification. This is a nice example because it handles the error cases and responds with the retry delays specified in the Twitter API documentation. For my purposes I modified the code to write the tweets to a mongodb rather than just print them to the terminal. Since the tweets are sent in JSON I simply wrote them to the mongodb without modification. Later when I want to retrieve the tweets they will still be JSON encoded and ready to use.
My first test is to look for tweets that mention #CLOUD, Big Data, or Analytics and find out what articles are linked to the most. The Twitter Streaming API allows me to filter for the keywords, so I only receive tweets in my categories. Next I need to check if the tweets have a link and store the link for the final map and reduce step. The code for checking tweets for URLs was interesting to me because this was my first time writing queries for MongoDB. I am pasting the URL search code here for anyone else who may be new to Mongo to reuse.
# Python script for extracting links from Tweets
import pymongo
import json
import bson
from pymongo import MongoClient
client = MongoClient()
mongo = MongoClient('localhost', 27017)
mongo_db = mongo['data']
mongo_coll = mongo_db['data'] #Tweets database
url_db = mongo['url'] #Extracted URLs database
url_db.url.remove() #clear the database everytime this script runs
cursor = mongo_coll.find()
for record in cursor: #for all the tweets in the database
msgurl = record['entities']['urls'] #look for URLs in the tweets
for recordurl in msgurl: #Some tweets have multiple URLs
if recordurl is not None:
url1 = recordurl["expanded_url"]
print url1 #Print to the terminal so it looks like the script is working
url_db.url.insert({'name':[url1]}) #Store in the mongodb for further processingThe final step was to count the number of times an individual URL appeared in the database. I'd like to see what types of articles are linked to the most. For the Map and Reduce functions I followed the PyMongo tutorial and only changed the map function to look at my URL column instead of the example Tag column. The output to the terminal was a list of all the URLs with the number of times they appeared.