Finding clusters of proteins activity in a sample of data from breast cancer patients and predicting clinical data
At a Pydata London meetup, and somebody asked me if I'd ever done anything on Kaggle. I said I'd had a look at it, but hadn't found any competitions that I cared enough about to enter. He told me about a sample of protein activity from breast cancer patients, and I thought that that would be an interesting and potentially worthwhile thing to work on.
Previous investigations had involved clustering the patients, so I decided to cluster the proteins. Using hierarchical clustering I classified the proteins as belonging to 8 clusters. Then, I projected each patient's protein activity onto the space of these clusters, and attempted to use these to predict the patients' clinical data, mainly using Logistic Regression.
My results can be seen in this Kaggle Kernel. They are as good as I could have hoped for, in that they appear to contain information that might help to treat cancer. In particular, patients with activity in a particular cluster of proteins appear to have a much better chance of survival than other patients. When I originally created the kernel, it took too long to run on Kaggle's servers, but is now possible to run it on GPUs and the results can be seen.