Proteins, the innate molecules responsible for crucial cellular functions within the body, constitute the fundamental basis of all diseases. Understanding and characterizing proteins can unveil the mechanisms underlying diseases, offering insights into strategies for slowing down or potentially reversing them. Moreover, the ability to design proteins from scratch holds the promise of creating entirely new categories of drugs and therapeutic solutions.
However, the present-day process of designing proteins in a laboratory setting is prohibitively expensive, both in terms of computational resources and human effort. This process involves first conceptualizing a protein structure capable of performing a specific function within the body and then identifying a protein sequence, which comprises the arrangement of amino acids making up the protein. This sequence should ideally “fold” into the envisioned structure, as proteins must adopt precise three-dimensional shapes to execute their intended functions.
But it need not be this convoluted.
Recently, Microsoft unveiled a versatile framework known as EvoDiff, which the company asserts can produce “high-fidelity” and “diverse” proteins based solely on a given protein sequence. Unlike other protein-generation frameworks, EvoDiff eliminates the need for any structural information about the target protein, thereby streamlining what is typically the most labor-intensive phase of the process.
EvoDiff is openly accessible, and its applications extend to crafting enzymes for innovative therapeutics, drug delivery techniques, as well as new enzymes for industrial chemical processes, as stated by Kevin Yang, a senior researcher at Microsoft and one of EvoDiff’s creators.
“We envision that EvoDiff will expand capabilities in protein engineering beyond the structure-function paradigm towards programmable, sequence-first design,” Yang explained in an email interview with TechCrunch. “With EvoDiff, we’re demonstrating that we may not actually need structure, but rather that ‘protein sequence is all you need’ to controllably design new proteins.”
At the heart of the EvoDiff framework is a 640-million-parameter model trained on data encompassing various species and functional protein classes. Parameters define an AI model’s abilities based on its training data, and in this case, they govern the model’s skill in generating proteins. The training data was sourced from the OpenFold dataset for sequence alignments and UniRef50, a subset of data from UniProt, a database containing protein sequences and functional information maintained by the UniProt consortium.
EvoDiff operates as a diffusion model, sharing architectural similarities with modern image-generating models such as Stable Diffusion and DALL-E 2. EvoDiff learns the gradual reduction of noise from an initial protein composed largely of noise, methodically moving towards a protein sequence.
The application of diffusion models has been expanding into diverse domains beyond image generation, from inventing novel proteins like EvoDiff to generating music and even synthesizing speech.
Ava Amini, another key contributor to EvoDiff and a senior researcher at Microsoft, emphasized the central idea that protein generation based on sequence offers versatility, scalability, and modularity. The diffusion framework empowers the control of protein design to meet specific functional objectives.
In addition to creating entirely new proteins, EvoDiff can also fill in the “gaps” in existing protein designs. For instance, given a portion of a protein responsible for binding to another protein, the model can generate a protein amino acid sequence around that region to meet defined criteria.
Crucially, EvoDiff designs proteins within the “sequence space” rather than focusing on the structural aspects of proteins. This allows it to synthesize “disordered proteins” that do not fold into a final three-dimensional structure. Disordered proteins play vital roles in biology and disease, influencing the activity of other proteins, either enhancing or inhibiting their function.
It is essential to note that the research supporting EvoDiff has not yet undergone peer review. Sarah Alamdari, a data scientist at Microsoft who contributed to the project, acknowledges that there is still much work to be done in terms of scaling up the framework for commercial use.
“This is just a 640-million-parameter model, and we may see improved generation quality if we scale up to billions of parameters,” Alamdari explained via email. “While we demonstrated some coarse-grained strategies, to achieve even more fine-grained control, we would want to condition EvoDiff on text, chemical information, or other methods for specifying the desired function.”
As the next step, the EvoDiff team plans to experiment with the proteins generated by the model in a laboratory setting to determine their viability. If successful, they will embark on developing the next iteration of the framework.