At University of Maryland, Baltimore County's annual hackathon, 5 guys came together to produce an single page web app. The app was named Calzone. Calzone's purpose was to predict the number of upvotes a Redditt post would receive based solely on the title.

This blog article is a 12 part series that will describe the data collection, feature extraction, and regression analysis(prediction) used to build Calzone.

PART 1: Data Collection

We chose to collect a year's worth of posts from the r/politics subreddit. We used the PRAW package to access Reddit's api. To submit an api request, you first must request OAuth credentials from Reddit. See here for instructions on how to do such.

Now that we have our credentials let's get going. The full code is below. We will begin by initializing a reddit instance. We next select a subreddit and retrieve the submissions for the past year. Selection of posts over a given time frame requires that a timestamp be specified. The 'get_timestamps' function allows you to input the beginning and end dates (format 01/01/2001). The submissions variable is a generator object and contains all the data. We looped through the submissions and obtained the id, title, number of votes, url, and creation date for each post. Finally, the collected data is saved to a pandas dataframe object then written to a csv file. The follow-up article will deal with data cleanup and exploratory analysis. See you around!

In [ ]:
from datetime import datetime
import calendar
import praw
import pandas as pd


# To specifying date ranges in PRAW require timestamps
def get_timestamps(time1,time2):
    ''' Convert date range into timestamps
    
    time1(str):  start date formated as 01/01/2001
    time2(str):  end date formated as 01/01/2001
    
    returns(str) start and end timestamps
    '''
    month1, day1, year1 = time1.split('/')
    month2, day2, year2 = time2.split('/')
    dt1 = datetime(int(year1), int(month1), int(day1))
    dt2 = datetime(int(year2), int(month2), int(day2))
    t1 = calendar.timegm(dt1.timetuple())
    t2 = calendar.timegm(dt2.timetuple())
    return t1,t2


# Step 1: Intialize reddit connection 
reddit = praw.Reddit(client_id='CLIENT_ID', client_secret="CLIENT_SECRET",
                     password='PASSWORD', user_agent='USERAGENT',
                     username='USERNAME')
# Step 2: define subreddit to capture
subreddit = reddit.subreddit('politics')


# Step 3: Select time frame and acquire posts from specified subreddit
t1,t2 = get_timestamps('01/01/2017', '12/31/2017')
submissions = subreddit.submissions(t1,t2)

# Step 4: Extract pertinent information from posts
data = [[data.id, data.subreddit_name_prefixed,
         data.title, data.ups, data.url, data.created_utc] for data in submissions]

# Step 5: Place data in pandas dataframe
df = pd.DataFrame(data,
                  columns=['id', 'subreddit', 'title', 'ups', 'url', 'created_utc'])
# Step 6: Save to CSV file
df.to_csv('data_blog.csv', index=False)

Related content