Investigating the Breast Cancer Proteome on Kaggle
Earlier this month, I was at a Pydata London meetup, and somebody asked me if I'd ever done anything on Kaggle. I said I'd had a look at it, but hadn't found any competitions that I cared enough about to enter. He told me about a sample of protein activity from breast cancer patients, and I thought that that would be an interesting and potentially worthwhile thing to work on.
Previous investigations had involved clustering the patients, so I decided to cluster the proteins. Using hierarchical clustering I classified the proteins as belonging to 8 clusters. Then, I projected each patient's protein activity onto the space of these clusters, and attempted to use these to predict the patients' clinical data, mainly using Logistic Regression.
My results can be seen in this Kaggle Kernel. They are as good as I could have hoped for, in that they appear to contain information that might help to treat cancer. In particular, patients with activity in a particular cluster of proteins appear to have a much better chance of survival than other patients. Unfortunately, the analysis takes too long to run on Kaggle's server, so you'll have to download it and run it on your own machine.