At University of Maryland, Baltimore County's annual hackathon, 5 guys came together to produce an single page web app. The app was named Calzone. Calzone's purpose was to predict the number of upvotes a Redditt post would receive based solely on the title.
This blog article is a 12 part series that will describe the data collection, feature extraction, and regression analysis(prediction) used to build Calzone.
Now that we have our credentials let's get going. The full code is below. We will begin by initializing a reddit instance. We next select a subreddit and retrieve the submissions for the past year. Selection of posts over a given time frame requires that a timestamp be specified. The 'get_timestamps' function allows you to input the beginning and end dates (format 01/01/2001). The submissions variable is a generator object and contains all the data. We looped through the submissions and obtained the id, title, number of votes, url, and creation date for each post. Finally, the collected data is saved to a pandas dataframe object then written to a csv file. The follow-up article will deal with data cleanup and exploratory analysis. See you around!
from datetime import datetime import calendar import praw import pandas as pd # To specifying date ranges in PRAW require timestamps def get_timestamps(time1,time2): ''' Convert date range into timestamps time1(str): start date formated as 01/01/2001 time2(str): end date formated as 01/01/2001 returns(str) start and end timestamps ''' month1, day1, year1 = time1.split('/') month2, day2, year2 = time2.split('/') dt1 = datetime(int(year1), int(month1), int(day1)) dt2 = datetime(int(year2), int(month2), int(day2)) t1 = calendar.timegm(dt1.timetuple()) t2 = calendar.timegm(dt2.timetuple()) return t1,t2 # Step 1: Intialize reddit connection reddit = praw.Reddit(client_id='CLIENT_ID', client_secret="CLIENT_SECRET", password='PASSWORD', user_agent='USERAGENT', username='USERNAME') # Step 2: define subreddit to capture subreddit = reddit.subreddit('politics') # Step 3: Select time frame and acquire posts from specified subreddit t1,t2 = get_timestamps('01/01/2017', '12/31/2017') submissions = subreddit.submissions(t1,t2) # Step 4: Extract pertinent information from posts data = [[data.id, data.subreddit_name_prefixed, data.title, data.ups, data.url, data.created_utc] for data in submissions] # Step 5: Place data in pandas dataframe df = pd.DataFrame(data, columns=['id', 'subreddit', 'title', 'ups', 'url', 'created_utc']) # Step 6: Save to CSV file df.to_csv('data_blog.csv', index=False)