We do this at Serotiny [1]. And others are working on this as well for different specific purposes. We use some very basic ML (the data is very sparse, wide, but any hit above a low baseline is very valuable).
The curious part, from our perspective is that biology has massive surface area - and the surface area is 3D. It not only scales between species/functions, but it also scales up and down, from atoms to organs. And the expertise/abstraction layers that work at one scale become complicated if you try too hard to account for all variation at a different scale. HIV's genome is a backwards, upside-down, mirrored, fugue of an engineering design that uses exotic molecules, exotic regulation, exotic proteins, and exotic physics. We're starting at a different place, just trying to write very simple scales.
In our case, we've chosen a single size scale to work with - proteins, but are wide enough to look across every species & discipline to understand those proteins as common tools. We compile all of our designs for a particular function that we're interested in, down to DNA, literally. And finding the niche where we do not have to deal with all of the DNA-regulation, or cellular regulation, or tissue synthesis, etc. allows us to expand and build in complexity at the protein level - while keeping other parts of biology constant. And that also allows us to interact and work with others who are working at different scales.
And there are others that build in complexity at other biological levels (gene regulation, pathway flux, etc.). Companies like Asimov [2] are involved in similar work at some of those abstraction layers. The open-source design language, SBOL is an attempt to standardize the DNA layer [3]. And this contributes to the challenge in that a lot of people/companies/labs have projects to build an abstraction layer that compiles down to DNA - but they might be talking past each other and be doing separate projects.
We've built an entire API of 'high-level' commands at an abstraction layer above DNA, where the output compiles down to literally, a JSON file specifying the DNA sequence to be manufactured by a 3rd party, as well as human-level citations to enable turning the new designs into intellectual property.
There is still a LOT of data missing, and there's a lot of empirical work to do - and you need to keep your compiling system constant enough that when you make changes at your abstraction layer you know that when you hit a roadblock you know it's because of a change you made, and not just a bug in the system.
Wow very cool. I’m actually a graduate student at Harvard getting my master’s in biology and CS, so this is right up my alley. I’m going to dive deeper into Serotiny and check out some of the awesome work you guys are doing. Do you offer internships for folks like me?
I’d heard about Asimov from their original work on Cello at MIT, but SBOL is news to me. I’ll check it out as well.
It sounds like this space is becoming pretty competitive, which is interesting.
I think the fun part is that it's not necessarily super competitive, yet. A lot of these tools are still complementary. But they're also not quite integrated yet either.
Always looking for good work. The field is growing rapidly right now. You've got a good combination of talents to help.
I'm not OP, but what about the realm outside of rational protien design? DNA base paring rules are pretty well understood and we should be able to build useful tools using them. Is there any work out there using only for computation?
Yep - and because DNA's base pairing rules are so well-studied, so predictable, and information-carying, we can use DNA for its material properties in addition to or even separate from its genetic properties. In terms of software, Shawn Douglas built CADNano [1] - software to do precisely that. By using DNA as a material it can be useful in its own right - with all sorts of interesting 3D structures, 3D logic, and with atomic precision, built into the encoded base pairs. But these structures generally do not interact with DNA at a genetic level in an organism.
In terms of protein design at that atomic level, the computation traditionally has relied on knowing or guessing at the structure (atomic arrangement) of the protein. And without that, there's not much to do (that's where our work picks up). A lot of that kind of protein design computation work is being done with software like Rosetta [2].
For instance, though not specific to HIV, and not exotic from a physicist's perspective, but certainly exotic in terms of taking physics into account at a different abstraction layer when compiling down to DNA:
DNA seems to be able to detect lesions and mis-matches based on conductivity of electrons down the double-strands themselves [1].
The curious part, from our perspective is that biology has massive surface area - and the surface area is 3D. It not only scales between species/functions, but it also scales up and down, from atoms to organs. And the expertise/abstraction layers that work at one scale become complicated if you try too hard to account for all variation at a different scale. HIV's genome is a backwards, upside-down, mirrored, fugue of an engineering design that uses exotic molecules, exotic regulation, exotic proteins, and exotic physics. We're starting at a different place, just trying to write very simple scales.
In our case, we've chosen a single size scale to work with - proteins, but are wide enough to look across every species & discipline to understand those proteins as common tools. We compile all of our designs for a particular function that we're interested in, down to DNA, literally. And finding the niche where we do not have to deal with all of the DNA-regulation, or cellular regulation, or tissue synthesis, etc. allows us to expand and build in complexity at the protein level - while keeping other parts of biology constant. And that also allows us to interact and work with others who are working at different scales.
And there are others that build in complexity at other biological levels (gene regulation, pathway flux, etc.). Companies like Asimov [2] are involved in similar work at some of those abstraction layers. The open-source design language, SBOL is an attempt to standardize the DNA layer [3]. And this contributes to the challenge in that a lot of people/companies/labs have projects to build an abstraction layer that compiles down to DNA - but they might be talking past each other and be doing separate projects.
We've built an entire API of 'high-level' commands at an abstraction layer above DNA, where the output compiles down to literally, a JSON file specifying the DNA sequence to be manufactured by a 3rd party, as well as human-level citations to enable turning the new designs into intellectual property.
There is still a LOT of data missing, and there's a lot of empirical work to do - and you need to keep your compiling system constant enough that when you make changes at your abstraction layer you know that when you hit a roadblock you know it's because of a change you made, and not just a bug in the system.
[1] https://serotiny.bio
[2] https://www.asimov.io/
[3] http://sbolstandard.org/