Spring 2017

Kaggle Subteam

Making predictive models for competitions on

Kaggle is a platform where companies and researchers post their data. Statisticians and data miners from all over the world compete to produce the best models. This semester, we're competing in two competitions.

Through feature engineering and statistical techniques, each team member implements various ML algorithms on the dataset, including logistic regression, word2vec, and random forests. Team members will then improve their models using optimization techniques. Finally, we use ensemble methods to combine all algorithms into a single 'super-algorithm' that we'll submit to Kaggle.

Algorithmic Trading

Finding opportunities to capitalize using data-driven approaches

By using financial data, we hope to formulate accurate predictions for various market sectors. We hope to achieve these goals by using well-known algorithmic trading approaches, and then supplementing them with Machine Learning techniques. We hope to use clustering algorithms, among other unsupervised learning algorithms, to find equities that we wish to trade. Additionally, we aim to use supervised learning techniques and Natural Language Processing techniques to create accurate predictions of future prices.

Yelp Dataset Challenge

Producing a research paper on bias factors affecting star-ratings in Yelp reviews

The data we're working on includes detailed information on users, businesses, reviews, and data collected through Yelp features like check-ins and tips. The biggest challenge for the team is perhaps the sheer size and complexity of the data, which contains 15 gigabytes of JSON text. The size of the data requires a heavy emphasis on data engineering and parallel computing. Our team consists of students with strong machine learning and statistical backgrounds, and we are excited to explore areas like recommendation systems, time-series analysis, Natural Language Processing, and sentiment analysis.

Past Projects

Formula One Team Performance

Optimizing hypothetical bet returns on Formula One races using driver data

Reddit Recommendation Engine

Engine to generate a set of recommended posts for users of the social website Reddit given prior voting history, previous posts, and comments

EEG Signal Classification with SVMs

Using data science tools to describe and classify each stage in the sleep cycle. We analyze brain waves sent between neurons when experiencing each of these stages. By using EEG, electroencephalography, we can capture these signals and pass them through our machine learning algorithms for classification.

Cayuga County Health Department Data Science Project

Quantifying the change in pollution and phosphorus loading in the Lake Owasco Watershed since 2011. This analysis will help steer water management practices going forward.

Misinformation Response in Social Networks

Working with ongoing research regarding social media interactions (particularly Reddit) as well as providing members with an opportunity to experiment with data as well as practice processing / cleaning / scraping techniques surrounding the Reddit API.