What's in a name?

A tutorial on my machine-learning workflow for predicting whether or not this post will be popular!

The purpose of the notebook is to describe my efforts to predict whether or not a post to the /r/datascience subreddit will be a success. I define success as receiving more than the average number of votes. What's unique about my methodology is that the prediction is based solely on the title of the redditor's post, hence the blog title: What's in a name?.

District Data Lab Research Lab Proposal 2

Proposal #1 Horse racing dataset.

The dataset is derieved from tips that tipsters provided to bettors.
Tipsters are people that give bettors their best guess aka tip on how to place a bet on a horse race.
This dataset could be used to predict horse races.


Interesting Dataset;


Classification tutorial with an unbalanced data set

District Data Lab Research Lab Proposal 1

Russian Trolls: Proposal #2

This dataset released by NBC on Feb. 14 contains 200K tweets from Russian Trolls:


Interesting Dataset; Beginner Tutorial


  1. Create a beginner tutorial on topic modeling
  2. Tutorial on predicting number of retweets(Caveats: Very hard to predict)


  1. Proposal #1 :Use FreqDist to show most frequent words; Use t-SNE to visualize clustering

Term Frequency - Inverse Document Frequency

Explanation of Term frequency - inverse document frequency.

For feature extraction we used Sci-Kit Learns, tf-idf vectorizer. It is a count vectorizer combined with idf. The count vectorizer measures term frequency(tf), ie how often a word appears in a title. If we do this for the following sentences then we produce the matrix below.

Calzone Feature Analysis

Feature Analysis

This Notebook is for feature analysis and addresses the following

  1. Broad overview of the data we are working with
  2. Evaluation of importance of each feature

In regards to feature analysis:

We needed a way to determine which features to include in our analsys. Below you will find 3 graphs depicting the same information in different ways What we concluded from this analysis was that some features are highly correlated thus provide no additional information. Because of this, we have chosen to use only the features: wordcount, polarity, subjectivity, and noun-phrases

Can Machine Learning predict how many upvotes your post will receive? Part 1.

At University of Maryland, Baltimore County's annual hackathon, 5 guys came together to produce an single page web app. The app was named Calzone. Calzone's purpose was to predict the number of upvotes a Redditt post would receive based solely on the title.

This blog article is a 12 part series that will describe the data collection, feature extraction, and regression analysis(prediction) used to build Calzone.

UpWork Challenge - What server nodes are closest to our customer?

As a way to keep my skills fresh, I look for small jobs on Upwork that require python. Here is one of the job request I came across yesterday.

A tech company, Company X, with servers all over the world wanted to determine which two servers within a regional zone are closest to their web application users. Company X wants us to report back the name of cities that the servers are in and their IP addresses. Company X provided a dictionary object (see partial dict below, the full dict is provided at

Explanation of Elif Statement

I often troll the learnpython subreddit for questions from python beginners to answer. It helps me to stay sharp but also contribute to the community. Today, a post asked to explain elif statements like the person was a 5yr old. I found a great explanation on Tutorialpoint. I also expanded ...

➟ Read more