Journal Club Review: “Mitigating Data Scarcity in Protein Binding Prediction using Meta-Learning”

In our recent journal club, we discussed the pre-print, “Mitigating Data Scarcity in Protein Binding Prediction using Meta-Learning”. Here, Luo et al. apply the recent trend of meta-learning to predicting phosphorylation sites for a poorly characterized kinase family. Meta-learning algorithms aim to discover models that generalize to new predictions with few additional training examples: this aim corresponds well with biological problems, where data for new classes is often scarce.

The authors break down their dataset by kinase families, and these subsets correspond to different tasks for a recently introduced model-agnostic meta-learning algorithm. The model is trained in two phases. In the meta-learning phase, the authors train upon the combined set of kinase families. But in the few-shot learning phase, they specialize the model with only a few examples of a held-out kinase family, previously unseen by the model.

Overall, we liked how the authors introduced meta-learning to the field of biology: we think that meta-learning has the potential to challenge conventional thinking about neural networks, as traditionally, they are believed to only work when an enormous amount of data is available. However, their approach still only works when the problem is part of a broader category of problems for which we do have a lot of data.

We also liked how the authors set up their controls in rigorous ways. For example, they pick negative training samples from the same sequences, which results in the same background for positive and negative examples. While this comes with a risk of incorporating false negatives from experimentally undetected phosphorylation sites, it is making the predictor more conservative and minimizes false positives.

While the outlined methodology is directly applicable to predictions of other types of protein-peptide binding, an outstanding question is how to apply meta-learning to other genomic problems. The data representation and architecture used for the problem presented in this preprint are relatively straight-forward. Future work will need to explore how to represent longer sequences with complex distal interactions, in order to arrive at a truly general meta-learning framework for genomic data.

Leave a Reply