A tutorial on my machine-learning workflow for predicting whether or not this post will be popular!¶
The purpose of the notebook is to describe my efforts to predict whether or not a post to the /r/datascience subreddit will be a success. I define success as receiving more than the average number of votes. What's unique about my methodology is that the prediction is based solely on the title of the redditor's post, hence the blog title: What's in a name?.
Proposal #1 Horse racing dataset.¶
The dataset is derieved from tips that tipsters provided to bettors.
Tipsters are people that give bettors their best guess aka tip on how to place a bet on a horse race.
This dataset could be used to predict horse races.
Classification tutorial with an unbalanced data set
Russian Trolls: Proposal #2¶
This dataset released by NBC on Feb. 14 contains 200K tweets from Russian Trolls:
Interesting Dataset; Beginner Tutorial
- Create a beginner tutorial on topic modeling
- Tutorial on predicting number of retweets(Caveats: Very hard to predict)
- Proposal #1 :Use FreqDist to show most frequent words; Use t-SNE to visualize clustering
For feature extraction we used Sci-Kit Learns, tf-idf vectorizer. It is a count vectorizer combined with idf. The count vectorizer measures term frequency(tf), ie how often a word appears in a title. If we do this for the following sentences then we produce the matrix below.
This notebook is an example of how to tune hyperparameters for a sci-kit learn machine learning model
This Notebook is for feature analysis and addresses the following¶
- Broad overview of the data we are working with
- Evaluation of importance of each feature
In regards to feature analysis:¶
We needed a way to determine which features to include in our analsys. Below you will find 3 graphs depicting the same information in different ways What we concluded from this analysis was that some features are highly correlated thus provide no additional information. Because of this, we have chosen to use only the features: wordcount, polarity, subjectivity, and noun-phrases
At University of Maryland, Baltimore County's annual hackathon, 5 guys came together to produce an single page web app. The app was named Calzone. Calzone's purpose was to predict the number of upvotes a Redditt post would receive based solely on the title.
This blog article is a 12 part series that will describe the data collection, feature extraction, and regression analysis(prediction) used to build Calzone.
As a way to keep my skills fresh, I look for small jobs on Upwork that require python. Here is one of the job request I came across yesterday.
A tech company, Company X, with servers all over the world wanted to determine which two servers within a regional zone are closest to their web application users. Company X wants us to report back the name of cities that the servers are in and their IP addresses. Company X provided a dictionary object (see partial dict below, the full dict is provided at places.py
I often troll the learnpython subreddit for questions from python beginners to answer. It helps me to stay sharp but also contribute to the community. Today, a post asked to explain elif statements like the person was a 5yr old. I found a great explanation on Tutorialpoint. I also expanded ...