Behind the paper: Dissecting AlphaFolds Capabilities with Limited Sequence Information

📖¿Dónde está la AlphaFold?📖

Half of the 2024 Nobel Prize in Chemistry was awarded to John Jumper and Demis Hassabis for their contributions to the AlphaFold project that produced powerful deep learning protein folding models [1,2,3]. Their models enable accurate protein structure predictions that have not only revolutionized computational biology, but their resounding success has also attracted the attention of classical biologists, who are now interacting more with computational methods.

Simplified overview of AlphaFold 2 [2] with the search, MSA, template and structure parts.

AlphaFold is not just a machine learning model, it is a whole pipeline (displayed in the following image). First, the given protein sequence gets queried in huge sequence databases and template databases. These searches result in a multiple sequence alignment (MSA) and templates respectively. It is understood that AlphaFold 2 relies much more heavily on the multiple sequence alignment input. The template input is often neglected either by the user or by the model itself when there is a strong multiple sequence alignment. We wanted to find out what is the effect of templates and if we can use them for other tasks. To direct the focus on templates, we ran experiments with minimal multiple sequence alignment information, i.e. only the query sequence), and curated templates.

Why these synthetic tasks?

We started two types of experiments to determine the capabilities of AlphaFold 2 relying on templates. The first type of experiment is side-chain packing; side-chain packing is the task of placing (packing) side-chains given only the protein backbone. Side-chains are the difference makers between residue types, while the backbone is the same for each residue type. Side-chain packing is important in pipelines based on residue type invariant algorithms such as ProteinMPNN [4] or to complete experimental structures, where only the more easily detectable backbone has been identified with a low error. This task is considered a local one, since only the close neighbourhood of the side-chain plays a role in its placement.

Ground truth and some perturbed versions of the same structure.

The second type of experiment is structure recovery. We wanted to find out how AlphaFold 2 reacts to a perturbed structure and if the model can be used to recover the unperturbed structure. As far as we know, this task is less well studied in the literature and we had to invent our own synthetic perturbations. We chose to focus on synthetic perturbations because they are easier to generate and, more importantly, provide us with greater control over the structure. In contrast, using experimental templates can introduce unwanted variability in the results due to differences in template quality. By relying on synthetic perturbations, we ensure consistency across the dataset, maintaining a uniform level of difficulty for each problem. This allows us to fine-tune the challenge presented to AlphaFold while avoiding the unpredictable effects of varying template quality. Ultimately, this approach provides a more controlled and reliable testing environment. The chosen perturbations involved applying independent and identically distributed Gaussian noise to each atom, keeping only one or two principal components, or using RFdiffusion [5] for a few iterations. This task is considered more global, since it requires considering the protein as a whole.

How much structure does AlphaFold know?

Before addressing the tasks, we first note that AlphaFold 2 correctly identifies the template when given the correct template and minimal multiple sequence alignment. For the initial local task of side-chain packing, AlphaFold 2 heavily relies on the C-Beta position: Without a good C-Beta, the side-chain packing performance is abysmal. However, when a reasonable C-Beta is provided, AlphaFold performs well, comparable to the much faster FASPR [6], though not as well as specialized machine learning models like AttnPacker [7]. Fortunately, a reasonable C-Beta can be placed with a simple heuristic [8]. Interestingly, AttnPacker’s packed structures perform slightly better than AlphaFold 2’s predictions, even when the full, correct template was provided.

Morphing gif of structure perturbed by only taking two principal components to the prediction by AlphaFold using this as template. Thanks to Symela Lazaridi

For the second task of structure recovery, AlphaFold 2 performs well in recovering structures from Gaussian noise with a standard deviation of 1Å and slightly improves RFdiffusion templates for larger structural perturbations. The most surprising results come from the principal component experiments: AlphaFold 2 recovers already respectably from only one principal component, and recovers the structure quite well from two principal components. The results from two principal components are very similar to the predictions from the full multiple sequence alignment and no template. The transition from two principal components to the AlphaFold prediction can be seen in the GIF.
With a custom build of the open source replication of AlphaFold 2, OpenFold [9], we found out that structure recycling has a negligible effect on the prediction; the benefit of recycling comes from recycling the multiple sequence alignment representation. Attempts to override this previous prediction parameter with a multimer template were unfortunately also unsuccessful.

¿Qué significan nuestros resultados?

On the theoretical side, these results allow for more precise reasoning about the strengths and weaknesses of AlphaFold 2. Collectively, these results support the hypothesis that AlphaFold2 has learned an accurate biophysical energy function. However, this function seems most effective for local interactions. This reasoning should lead to better development of tools and pipelines around AlphaFold 2. On the practical side, based on these results, we are now using this pipeline to fill in missing residues from experimental structures and to standardise structures. We have also conducted successful experiments to refine back-mapped coarse-grained models into atomistic models for molecular dynamics simulations.[10]

References

[1] Senior, Andrew W., et al. "Improved protein structure prediction using potentials from deep learning." Nature 577.7792 (2020): 706-710.
[2] Jumper, John, et al. "Highly accurate protein structure prediction with AlphaFold." nature 596.7873 (2021): 583-589.
[3] Abramson, Josh, et al. "Accurate structure prediction of biomolecular interactions with AlphaFold 3." Nature (2024): 1-3.
[4] Dauparas, Justas, et al. "Robust deep learning–based protein sequence design using ProteinMPNN." Science 378.6615 (2022): 49-56.
[5] Watson, Joseph L., et al. "De novo design of protein structure and function with RFdiffusion." Nature 620.7976 (2023): 1089-1100.
[6] Huang, Xiaoqiang, Robin Pearce, and Yang Zhang. "FASPR: an open-source tool for fast and accurate protein side-chain packing." Bioinformatics 36.12 (2020): 3758-3765.
[7] McPartlon, Matthew, and Jinbo Xu. "An end-to-end deep learning method for protein side-chain packing and inverse folding." Proceedings of the National Academy of Sciences 120.23 (2023): e2216438120.
[8] Roney, James P., and Sergey Ovchinnikov. "State-of-the-art estimation of protein model accuracy using AlphaFold." Physical Review Letters 129.23 (2022): 238101.
[9] Ahdritz, Gustaf, et al. "OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization." Nature Methods (2024): 1-11.
[10] Wassenaar, Tsjerk. “Martini Tutorials - Reverse Coarse-Graining with Backward”. (2024), https://cgmartini.nl/docs/tutorials/Martini3/Backward/.

Contact

Clone the code or ask a question on GitHub. Otherwise, you can contact me via email: jannik.gut@unibe.ch.