CS 229
|
Ideas |
Neural Language Identification
David Jurgens (jurgens@stanford)
This project will try to build a large-scale language identification system for many languages using a deep learning architecture. Ideally the system would use Theano or some open source library for learning and figure out how to effectively combine the language data in a way that is scalable for learning. Experience with Python/Numpy/Theano or Java/DeepLearning4J desirable but not critical. Some natural language processing experience is also a bonus.
How bad is that for me? Calorie estimation from images
David Jurgens (jurgens@stanford)
This project aims to predict the calorie content of a meal from an image showing one or more food objects. The machine learning system needs to recognize both what kind of food is in the image and how big it is (5 french fries or 500?) to create some rough estimate of how many calories are being shown. The team would need to combine already-existing labeled images with calorie fact sheets to create noisy training data. Experience with Python/Numpy/Theano or Java/DeepLearning4J desirable but not critical. Some computer vision experience is also a bonus.
Fast Tree-structured neural networks
Sam Bowman (sbowman@stanford)
In a tree-structured recursive neural network model, each example in a data set is processed using a neural network with a connection structure specific to that example, with the model weights, but not the connection graph, shared across examples. The lack of a stable graph makes common acceleration techniques like batch matrix multiplication and GPU training difficult to implement, leading to relatively poor adoption both in academia and industry despite many prominent state-of-the-art results. The goal of this project is to find ways to efficiently use batched matrix multiplication (and possibly GPUs) in training these models, with the goal of reaching training speeds comparable to typical fixed-graph models like LSTM RNNs. Some experience with deep neural networks would be ideal. Students should also be familiar with Python, Matlab, or Lua.
Bayesian network modeling for metadata prediction
Olivier Gevaert (ogevaert@stanford)
Many biomedical databases exist nowadays that contain thousands of studies with molecular data. These studies however, are often not properly annotated with metadata – data about the data. In this project, the goal is to build a Bayesian network framework to model the relationships between metadata of biomedical studies.
Data fusion for predicting cancer survival
Olivier Gevaert (ogevaert@stanford)
Due to technological innovations, the biomedical community has seen an explosion of multi-modal data. Novel approaches are needed to integrate different data sets to solve classification problems. In this project, the goal is to implement and evaluate a data fusion framework based on common machine learning frameworks such as kernel methods, Bayesian models or regularized regression. These frameworks will be evaluated in the context of pancancer multi-omics data to predict survival cancer patients.
Unsupervised learning for analyzing brain tumors
Olivier Gevaert (ogevaert@stanford)
Multimodal imaging is nowadays common practice to diagnose diseases. Approaches are needed to combine multimodal imaging data to define spatial clusters with similar characteristics. In this project, we will investigate the use of unsupervised learning techniques to discover spatial regions of brain tumors with different properties.
Deconvolution of DNA methylation data
Olivier Gevaert (ogevaert@stanford)
Most genomic data is done on mixtures of tissue, especially in cancer research, resulting in in pure genomic data. Machine learning techniques have been developed that are successful in deconvoluting gene expression data. However, DNA methylation provides a cleaner signal of cell type and in this project we want to investigate how to deconvolute DNA methylation data from cancer data.
Deconvolution of methylome data in cancer patients
Olivier Gevaert (ogevaert@stanford)
The importance of epigenomics when studying cancer cannot be understated. Changes on top of DNA have shown to activate oncogenes or deactivate tumor suppressor genes. Large multi-cancer data sets are are now available that profile the epigenome for over 10,000 cancer cases. Mathematical approaches are needed to deconvolve the mixture of cell types that make up each cancer sample. Epigenomics provide a key data resource to do this, as the epigenome is marking tissue in healthy cells.
Advanced machine learning techniques for thyroid cancer diagnosis
Olivier Gevaert (ogevaert@stanford)
High throughput gene expression data is now readily available for many diseases and has been used extensively to develop classifiers to help physicians diagnosing and treating disease. Recently, a study showed the potential for diagnosing thyroid cancer using gene expression data. Using support vector machines the investigators reported a new diagnostic test based on 167 genes. This diagnostic test however still has limited sensitivity and specificity. The goal of this project is to investigate novel machine learning approaches such as deep learning to develop a classifier that outperforms thyroid cancer diagnosis.
Machine learning algorithms for the search for supersymmetric particles at the Large Hadron Collider
Ariel Schwartzman (sch@slac.stanford)
This project consists of applying machine learning techniques to extract new features from data produced by the ATLAS experiment (http://home.web.cern.ch/about/experiments/atlas) at the the Large hadron collider (LHC), the largest and most powerful particular accelerator in the world which has recently discovered the Higgs boson (http://home.web.cern.ch/about/updates/2015/09/atlas-and-cms-experiments-shed-light-higgs-properties). Data from high energy proton-proton collisions collected by the ATLAS experiment can be interpreted as images and input to machine learning algorithms with the goal of interpreting these data in new ways. The goal of this project is to apply supervised and unsupervised machine learning methods to identify and discover events produced by new super-symmetric particles predicted in models of new physics. No physics knowledge is necessary.
New methods for Higgs boson identification at the Large Hadron Collider
Ariel Schwartzman (sch@slac.stanford)
This project consists of applying machine learning techniques to extract new features from data produced by the ATLAS experiment (http://home.web.cern.ch/about/experiments/atlas) at the the Large hadron collider (LHC), the largest and most powerful particular accelerator in the world which has recently discovered the Higgs boson (http://home.web.cern.ch/about/updates/2015/09/atlas-and-cms-experiments-shed-light-higgs-properties). Data from high energy proton-proton collisions collected by the ATLAS experiment can be interpreted as images and input to machine learning algorithms with the goal of interpreting these data in new ways. The goal of this project is to find new ways to identify Higgs boson particles utilizing state-of-the art machine learning algorithms. In particular, we would like to focus on the rich features of data from the decay of the Higgs boson into pairs of b-quarks using multi-vertex signatures and other physics observables. No physics knowledge is necessary.
Higgs boson identification in the presence of noise
Ariel Schwartzman (sch@slac.stanford)
This project consists of applying machine learning techniques to extract new features from data produced by the ATLAS experiment (http://home.web.cern.ch/about/experiments/atlas) at the the Large hadron collider (LHC), the largest and most powerful particular accelerator in the world which has recently discovered the Higgs boson (http://home.web.cern.ch/about/updates/2015/09/atlas-and-cms-experiments-shed-light-higgs-properties). Data from high energy proton-proton collisions collected by the ATLAS experiment can be interpreted as images and input to machine learning algorithms with the goal of interpreting these data in new ways. The goal of this project is to apply machine learning algorithm to reduce the influence of noise on pattern recognition algorithms used to find Higgs bosons at the LHC. We would like to develop learning algorithms that can use local information to predict the amount of noise, particle by particle, with the goal of reducing the impact of noise fluctuations on the accuracy of Higgs boson property measurements. No physics knowledge is necessary.
Higgs boson identification at the Large Hadron Collider
Ariel Schwartzman (sch@slac.stanford)
This project consists of applying machine learning techniques to extract new features from data produced by the ATLAS experiment (http://home.web.cern.ch/about/experiments/atlas) at the the Large hadron collider (LHC), the largest and most powerful particular accelerator in the world which has recently discovered the Higgs boson (http://home.web.cern.ch/about/updates/2015/09/atlas-and-cms-experiments-shed-light-higgs-properties). Data from high energy proton-proton collisions collected by the ATLAS experiment can be interpreted as images and input to machine learning algorithms with the goal of interpreting these data in new ways. The goal of this project is to extend the concept of jet images (http://arxiv.org/abs/1407.5675) to utilize charged particle tracks and apply it to the identification of Higgs boson particles. Charged tracks provide very accurate position information that can be exploited with machine learning methods to enhance the precision of Higgs physics measurements.
Learning the electric and quantum charge of high energy particles at the Large Hadron Collider
Ariel Schwartzman (sch@slac.stanford)
This project consists of applying machine learning techniques to extract new features from data produced by the ATLAS detector (http://home.web.cern.ch/about/experiments/atlas) at the the Large hadron collider (LHC), the largest and most powerful particular accelerator in the world which has recently discovered the Higgs boson
Price forecast and virtual bidding of electricity contracts in the wholesale market
Junjie Qin (jqin@stanford.edu)
Great volatility exists in the wholesale electricity prices. With increasing penetration of renewable generation, such volatility could be magnified given the intermittent nature of the renewable energy sources. This project aims to develop efficient learning schemes that predict the real time electricity markets based on day-ahead information. Features such as historical day-ahead prices and load forecasts from PJM interconnection will be used for the prediction. The prediction would then be used for developing strategies of portfolio management in virtual bidding (see e.g. https://en.wikipedia.org/wiki/Virtual_bidding).
Predict food properties from a single image
Lamberto Ballan (lballan@cs.stanford.edu)
We are interested in building regressors that will predict some properties of food images, such as tastes, calories, etc. We would like to compare the performance of a framework based on CNN activations (deep learning) against a more conventional framework (i.e. SIFT + Bag-of-Words + Fisher Vectors + SVM). Prerequisites: MATLAB (required), Python (preferable).
Deep Learning for Medical Record Understanding
Kelvin Guu (kguu@stanford.edu)
We will develop new methods in deep learning to extract structured knowledge from natural language medical reports written by physicians. A key challenge of the project is to leverage large-scale unlabeled data to train models in a setting where on-domain medical training data is scarce. We will teach you best practices for rapid experimentation and introduce you to the latest libraries for developing neural networks. You will interact with other members of Percy Liang’s group.
Inferring ancestry from genomic sequence
Volodymyr Kuleshov (kuleshov@stanford.edu)
Inferring the ancestry of an individual is a key part of many scientific/medical studies. For example, one can find the mutations that cause a given disease or track the evolution of human populations. This project explores new machine learning models for inferring an individual's ancestry and using it to improve the statistical power to detect of harmful mutations in the human genome. Tools from CS229 that will be used include SVMs, Naive Bayes, and feature selection techniques. There is also opportunity to learn and implement more advanced structured prediction techniques such as HMMs or CRFs.
Unsupervised modeling of character archetypes in fiction
Ethan Fast (ethaen@stanford.edu)
Do heroes and villains tend to look or speak differently or take different kinds of actions? What about vampires, or romantic interests, or characters destined for a tragic death? Can we identify these character archetypes automatically? In this project, we are investigating how techniques from deep learning can help us identify and model character archetypes in a 2 billion word corpus of modern, amateur fiction. Ideally, your group should have experience in Python, Theano, and Keras and a working knowledge of natural language processing.
Predicting the reception of stories in a community of amateur writers
Ethan Fast (ethaen@stanford.edu)
What kinds of stories draw stronger readership or higher ratings? Is writing quality more important to readers, or does the content of stories (themes like violence, danger, or sex) have a stronger effect? In this project, we are investigating linguistic predictors of ratings and readership for more than 600,000 stories in an online community of amateur writers. Ideally, your group should have a strong grounding in natural language processing.
Unsupervised Learning of Writing Style
Pranav Rajpurkar (pranavsr@stanford.edu)
How unique is a writer’s form? What elements of their writing are most distinctive in style? In this project, we are investigating, on a dataset of 600,000 stories, how well we can capture writing style. Experience with deep learning recommended, but not required.
Predicting Short-Range Displacements From Sensor Data
Pranav Rajpurkar (pranavsr@stanford.edu)
Mobile devices can capture motion events from accelerometer and gyroscope data. In this project, we are investigating the use of machine learning to outperform simple physics-based approaches for predicting short-range (\le; 10 cm) displacements from sensor data.
Comments to cs229-qa@cs.stanford.edu. |