CS 229 Machine Learning
Project Suggestions, Autumn 2012

Below is a list of suggexted projects. If you see a project that interests you, contact that person directly to see if there's a chance to work together. Please do not blanket email everyone listed here.

Project Title. Contact Name (Contact Email Address).
Project Description

Prerequisites: Prerequisites

Speech Audio Denoising. Andrew Maas (
Use neural networks to reduce environmental noise and channel distortion in audio signals


Predicting Movie and TV Preferences from Facebook Profiles. Andrew Maas (
Given the vast and ever-growing amount of content available now adays, it is difficult to sort through the mess and find media that is highly interesting to you. It is postulated that by inspecting Facebook user's data it is possible to provide content recommendations that suit that user's fancy. Use the Facebook Graph API to scrape a large amount of Facebook data from users with public profiles. Download the IMDB database and use it as your document space. Use "liked" movies/TV programs only as correct answers to measure recommendation quality (omit them from the data used to create the recommendations). Record the approaches you take, initial thoughts and rationale behind them, measured recommendation quality for each implementation, and a postmortem overview. We have starter scripts prepared to scrape data from FB and IMDb for this task.


Speech Recognition. Andrew Maas (
Get a complete speech recognizer working on an established dataset, and build improvements into the recogniaer. The Kaldi open source recognizer is quickly improving and is the first time high quality speech recognition is becoming available as an open source project: I will work with a team to get Kaldi running and propose improvements to implement.

Prerequisites: Good linux hacking skills

Separating Speech from Noise Challenge. Andrew Maas (
See the ChiMe challenge website below. Data is provided and existing approaches / results are already documented.

Prerequisites: Previous experience with audio data a plus

Predicting trends using google insight data. hardik (
I have worked on the project of predicting tourism patterns using google insight data , I am interested in exploring the predicting power of the google insight data in various fields. Basic study on this is already done to show its predicting power in automotive sales. I am willing to help the student group willing to take this up with getting the data and initial setting up ideas and also find the areas where this can be a useful tool.

Prerequisites: basic idea of statistics tools and use of matlab or C++ to implement machine learning algorithm of large data set.

Predicting transcription factor binding. Sofia Kyriazopoulou (
Predicting the genome-wide binding of transcription factors is very important for modeling gene regulation. We are looking for students who will help develop a method that integrates different data sources to accurately and efficiently predict the binding of thousands of transcription factors in hundreds of cell types.

Prerequisites: Some background in genetics (eg. CS273a, CS274, CS262, or equivalent) is required.

Clustering genomic positions based on ChIP-seq signal. Sofia Kyriazopoulou (
We recently developed an algorithm for clustering genomic positions based on the magnitude and shape of the signal of chromatin marks around them ( The discovered classes of genomic regions have distinct functional properties. We would like to add new features to this method (eg. alternative clustering techniques, simultaneous clustering on multiple signals), in order to make it applicable to a broader range of biological questions.

Prerequisites: Some background in biology will definitely help but is not a requirement.

Associating enhancers to the genes they regulate. Sofia Kyriazopoulou (
Enhancers are regions of DNA where proteins can bind in order to regulate the transcription of target genes. Enhancers can be far away from the gene they regulate and associating enhancers to their target genes remains a hard problem. By leveraging large ChIP-seq datasets however, we can identify patterns of interactions between enhancers and genes. We are looking for students who will experiment with alternative feature sets in order to identify the features that are most predictive of true enhancer-gene associations.

Prerequisites: Some background in genetics (eg. CS273a, CS274, CS262, or equivalent) is required.

Number Induction with Deep Learning. Will Zou (
In this project, we attempt to to model integer induction with deep learning models and neural networks. In recent machine learning literature, Deep learning has been widely successful in solving a range of problems in pattern recognition. However, one of its fundamental criticisms is the lack of success in representing higher-level cognitive functions such as semantic compositionality, induction or reasoning. This is a joint project with Stanford Psychology to explore the possibility of finding artificial neurons that are 'excited' about induction, or addition with integers. In this project, I have constructed starting MATLAB code to learn invariant number sense in deep neural networks. There are concretely defined next steps for the research project, at the same time, lots of options to explore which are likely to spin-off into multiple topics (induction, addition, symbolic representations, working with development psychology). So if you are interested in either executing a successful research project, or like to explore interesting ideas around the centroid of learning mathematical representations with deep learning, please send me an email and let's arrange a time to talk.

Prerequisites: Linear Algebra, MATLAB, [great to have] number theory

Predicting the future with deep learning. Will Zou (
There is massive amounts of data in the form of time-series, generated by electricity power plants, financial markets, yelp restaurant ratings, data-centers or cloud computing platforms. Supervised training with discriminative learning algorithms such as deep neural networks, combined with good optimization techniques are known to succeed in applications where there is a lot of data, and the training objective is well-defined. This project attempts to first explore how deep learning performs against successful algorithms in time-series modeling, such as Gaussian Processes, and what are techniques which could possibly lead to successful applications of deep learning in predicting the future in time-series.

Prerequisites: MATLAB, Linear Algebra, [desired not required] prior experience with time-series modeling

Kaggle ML Competitions. CS229 Staff (N/A).
Find an open competition on Kaggle and compete:


Crowdsourcing huge amount of data with Amazon Mechanical Turk. Tao Wang (
Obtaining large amount of high-quality labeled data is a challenging yet critical component for many machine learning applications. Manual labeling tends to be tedious, repetitive and costly in terms of time and money. Crowdsourcing infrastructures such as Amazon Mechanical Turk (AMT) provides an inexpensive way of outsourcing the labeling process to the general public (AMT workers). However, due to worker variation and spammers, data obtained using AMT tend to be noisy. We aim to build a crowdsourcing annotation tool using the Amazon Mechanical Turk APIs for image data (bounding boxes and potentially pixel-level annotations). The graphical user interface (GUI) needs to be designed cleverly to ensure the quality of the data. Since the amount of data is huge, we will explore automatic quality control using machine learning techniques (eg: spammer detection). This project can go well beyond the scope of CS229 and continue after the class.

Prerequisites: Must have web programming experience. Familiarity with AMT APIs or the field of human-computer interface (HCI) is a plus

Machine Learning demos in Javascript. Andrej Karpathy (
It is fun to write Machine Learning algorithms entirely in Javascript. There is a Neural Networks implementation (brainjs), and over last few weeks I've written Random Forests (forestjs) and SVMs (smvjs). Since everything runs entirely in the browser, fun demos can be easily made that showcase these algorithms. For example, see . The goal of the project would be to create fun, interactive and hopefully educational Machine Learning demos in Javascript, similar to the Neural Networks demo I linked above, but better! Additional benefits of this approach is that all demos are exceptionally easy to share and access through a simple URL, and could even be linked to from future Machine Learning courses to get intuitions for various algorithms. Alternatively, it would also be useful to create Machine Learning or Convex Optimization libraries in Javascript for others to use, as there are quite a few applications for Machine Learning to run in the browser. For example, the brainjs Neural Networks library ( has 1300 stars and 100 forks on github.

Prerequisites: very basic html, javascript

Intelligent Technology for Innovation in Urban Disasters. Cat Chang & Julian Jordan (,
We are eagerly researching technological solutions to improve disaster preparedness and disaster response/recovery scenarios. We believe that technological solutions provide a ripe space in which to improve city and citizen disaster preparedness and post-disaster resiliency of physical infrastructure. The model and the project (done in collaboration) will need to be well-designed and flexible in the hopes of understanding subjects like: What are the economic consequences? How do communications systems respond after a disaster? What are the limits of existing systems to facilitate evacuation, emergency response services, distribution of crucial information, supplies, and food? We think that there is a very engaging and dynamic machine-learning opportunity to better understand, in short, how a disaster-affected area best returns to normal.


Game Design - Creating Prototypes and Architecture for New Pervasive Game. Julian Jordan (
What if you could play a game that knew enough about you to weave itself seamlessly into the fabric of your life when you wanted it to? Maybe this makes you think of a benign version of the movie "The Game" OR a project by I was part of design team last spring that developed an idea and tested an initial prototype of a product that is a game-play platform representing a different way of thinking about games - single or multi-player games. Our project was called PORTER and it got a strong reception when presented last year. It has the potential to be a gateway to a new universe of games that utilize users’ personal profiles and online activity to enhance their real world life with content from their digital/online experiences. Now, we are eager to test and prototype it in a more integrated way. Our hope is to select a group of people to test game play on --> with their permission, use relevant API to scrap their online data including, but not limited to, social network (e.g. public Facebook, Twitter feeds, foursquare, the town one lives in) --> map relevant data onto adventure/challenge game narrative template we are creating to build an immersive story --> we'd keep track of players progress through game narrative and would need to build in a way for the program to incorporate feedback from players (since each player may take a unique game path we've designed - sort of like "Choose you own adventure"). Please be in touch if you are interested to talk about this in any regard