
Promising efforts at disentangling the effects of genes and the environment on complicated traits may have been confounded by statistical problems.
Sapere Aude
Tag: dna

Promising efforts at disentangling the effects of genes and the environment on complicated traits may have been confounded by statistical problems.
Fragile DNA Enables New Adaptations to Evolve Quickly
If highly repetitive gene-regulating sequences in DNA are easily lost, then that may explain why some adaptations evolve quickly and repeatedly
This explains a lot why pharma companies are so terrible at coming up with new drugs.
There is perhaps no better example of this than protein structure prediction, a problem that is very close to these companies’ core interest (along with docking), but on which they have spent virtually no resources. The little research on these problems done at pharmas is almost never methodological in nature, instead being narrowly focused on individual drug discovery programs. While the latter is important and obviously contributes to their bottom line, much like similar research done at tech companies, the lack of broadly minded basic research may have robbed biology of decades of progress, and contributed to the ossification of these companies software and machine learning expertise
2020-11-30: Nature perspective on AlphaFold 1
DeepMind has made a gargantuan leap in solving one of biology’s grandest challenges — determining a protein’s 3D shape from its amino-acid sequence. “This is a big deal. In some sense the problem is solved.”
Perspective by someone in the field
Which brings me to what I think is the most exciting opportunity of all: the prospect of building a structural systems biology. In almost all forms of systems biology practiced today, from the careful and quantitative modeling of the dynamics of a small cohort of proteins to the quasi-qualitative systems-wide models that rely on highly simplified representations, structure rarely plays a role. This is unfortunate because structure is the common currency through which everything in biology gets integrated, both in terms of macromolecular chemistries, i.e., proteins, nucleic acids, lipids, etc, but also in terms of the cell’s functional domains, i.e., its information processing circuitry, its morphology, and its motility. A structural systems biology would take this seriously, deriving the rate constants of enzymatic and metabolic reactions, protein-protein binding affinities, and protein-DNA interactions all from structural models. We don’t yet know how much easier, if at all, it will be to predict these types of quantities from structure than from sequence—we need to put the dogma of “structure determines function” to the test. Even if the dogma were to fail in some instances, which it almost certainly will, partial success will open up new avenues.
2021-07-23: AlphaFold 2
DeepMind has used its AI to predict the shapes of nearly every protein in the human body, as well as the shapes of 100Ks of other proteins found in 20 of the most widely studied organisms, including yeast, fruit flies, and mice. So far the trove consists of 350k newly predicted protein structures. DeepMind says it will predict and release the structures for more than 100m more in the next few months—more or less all proteins known to science. In the new version of AlphaFold, predictions come with a confidence score that the tool uses to flag how close it thinks each predicted shape is to the real thing. Using this measure, DeepMind found that AlphaFold predicted shapes for 36% of human proteins with an accuracy that is correct down to the level of individual atoms. Previously, after decades of work, only 17% of the proteins in the human body have had their structures identified in the lab. Drug discovery is all about those biological effects – what else could it be concerned with? And these are higher-order things than just the naked protein structure, as valuable as that can be. Remember, our failure rate in the clinic is around 90% overall, and none of those failures were due to lack of a good protein structure. They were caused by much harder problems: what those proteins actually do in a living cell, how those functions differ in health and disease, how they differ between different sorts of human patients and between humans in general and the animal models that were used to develop the compounds, what other protein targets the drug candidate might have hit and the downstream effects (usually undesirable) that those kicked off, and on and on. So structural biology has been greatly advanced by these new tools. But it has not been outmoded, replaced, or rendered irrelevant. It’s more relevant than ever, and now we can get down to even bigger questions with it.
2022-04-12: Protein complexes 
ColabFold later incorporated the ability to predict complexes. And in October 2021, DeepMind released an update called AlphaFold-Multimer that was specifically trained on protein complexes, unlike its predecessor. It predicted around 70% of the known protein–protein interactions.
Elofsson’s team used AlphaFold to predict the structures of 65k human protein pairs that were suspected to interact on the basis of experimental data. And a team led by Baker used AlphaFold and RoseTTAFold to model interactions between nearly every pair of proteins encoded by yeast, identifying more than 100 previously unknown complexes. Such screens are just starting points. They do a good job of predicting some protein pairings, particularly those that are stable, but struggle to identify more transient interactions. “Because it looks nice doesn’t mean it is correct. You need some experimental data that show you’re right.”
Attempts to apply AlphaFold to various mutations that disrupt a protein’s natural structure, including one linked to early breast cancer, have confirmed that the software is not equipped to predict the consequences of new mutations in proteins, since there are no evolutionarily-related sequences to examine.
The AlphaFold team is now thinking about how a neural network could be designed to deal with new mutations. This would require the network to better predict how a protein goes from its unfolded to its folded state. That would probably need software that relies only on what it has learnt about protein physics to predict structures. “One thing we are interested in is making predictions from single sequences without using evolutionary information. That’s a key problem that does remain open.”
AlphaFold-inspired tools could be used to model not just individual proteins and complexes, but entire organelles or even cells down to the level of individual protein molecules. “This is the dream we will follow for the next decades.”
2022-07-28: AlphaFold goes from 350k to 214m predictions.
Researchers have used AlphaFold to predict the structures of 214m proteins from 1m species, covering nearly every known protein on the planet. According to EMBL-EBI, around 35% of the 214m predictions are deemed highly accurate, which means they are as good as experimentally determined structures. Another 45% were deemed confident enough to rely on for many applications. DeepMind has committed to supporting the database for the long haul, and he could see updates occurring annually.
2022-08-03: AlphaFold is open source with no commercial restrictions. What is the end game for Deepmind?
DeepMind has made policy decisions that have played a significant part in the transformation in structural biology. This includes its decision last July to make the code underlying AlphaFold open source, so that anyone can use the tool. Earlier this year, the company went further and lifted a restriction that hampered some commercial uses of the program. It has also helped to establish, and is financially supporting, the AlphaFold database maintained with EMBL-EBI. DeepMind deserves to be commended for this commitment to open science.
2022-11-02: Meta enters the fold with a large language model. The amazing generality of language models continues.
ESMFold isn’t quite as accurate as AlphaFold, but it is 60x faster at predicting structures. “What this means is that we can scale structure prediction to much larger databases.”
As a test case, they decided to wield their model on a database of bulk-sequenced ‘metagenomic’ DNA from environmental sources including soil, seawater, the human gut, skin and other microbial habitats. The vast majority of the DNA entries — which encode potential proteins — come from organisms that have never been cultured and are unknown to science. The team predicted the structures of 617m proteins. Of these 617m predictions, the model deemed 33% to be high quality. Millions of these structures are entirely novel, and unlike anything in databases of protein structures determined experimentally or in the AlphaFold database of predictions from known organisms. A good chunk of the AlphaFold database is made of structures that are nearly identical to each other, and ‘metagenomic’ databases “should cover a large part of the previously unseen protein universe”.
In terms of what % of protein space has been covered by these models, estimates vary widely. But it’s possible that life itself has explored all of protein space. If we take a median estimate of 1030 proteins, and 108 with structure, we have a long way to go.
To examine how much of sequence space could have been explored, it is simplest to make upper and lower limit estimates for the number of unique amino acid sequences produced since the origin of life. Considering the upper limit, it is clear that bacteria dominate the planet in terms of the product of the number of cells (1030) multiplied by the number of genes in each genome (104). Let us assume that every single gene in this total of 1034 is unique and that evolution has been working on these genes for 4 Ga completely changing each gene to some other unique, new gene every single year. This gives an extreme upper limit of 4×1043 different amino acid sequences explored since the origin of life. The contribution to this number of sequences by viral and eukaryotic genomes is difficult to estimate but it is very unlikely to be orders of magnitude greater than the 4×1043 sequences from bacteria. If their contribution is similar or smaller, then it can be ignored in our rough calculation. A lower limit to the number of sequences explored is more difficult to estimate but it has been estimated that there are 109 different bacterial species on Earth. If we assume that each species has a unique complement of 103 sequences (an underestimate) and that only 1 sequence has changed per species per generation (a reasonable estimate based upon analysis of mutation rates in bacteria), and that the generation time is 1 year (a considerable underestimate for many modern bacteria, but perhaps reasonable for an ancient organism or one growing slowly in a poor environment), then we arrive at a figure of 4×1021 different protein sequences tested since the origin of life.
Although the oft-quoted 10130 size of sequence space is far above these limits, the other more plausible estimates for the size of sequence space, particularly with limited amino acid diversity or reduced length, are near to or within these 2 limits. Considering the upper limit, all sequences containing 20, 8 and 3 types of amino acids have been explored if the chains are 33, 50 and 100 amino acids in length, respectively. Considering the lower limit, then virtually all chains of length 33 and 50 amino acids containing 5 or 3 types of amino acid, respectively, could have been explored. (The exploration of longer chains of 100 amino acids with only 2 types of residue is obviously much less complete but it is not a negligible fraction of the total.) Therefore it is entirely feasible that for all practical (i.e. functional and structural) purposes, protein sequence space has been fully explored during the course of evolution of life on Earth (perhaps even before the appearance of eukaryotes).

2022-11-26: An open source reimplementation of AlphaFold does even better.
OpenFold is trained from scratch. Compared to AlphaFold2, OpenFold runs on proteins that are 1.7x larger, runs 2x as fast on short proteins, and is slightly more accurate. As more people can help drive this technology, we’ll get more and better discoveries.
2023-07-03: Foldseek
Sequence searches are fast, like searching a hard drive for a file name. But they often miss good matches because proteins with similar shapes can have vastly different sequences. Structure-based search methods look for shapes instead of sequences, but this can take thousands of times longer, because it’s computationally difficult to compare complex 3D objects. With Foldseek, researchers got the best of both worlds: the software represents a protein’s shape as a string of letters — a ‘structural alphabet’ — thereby offering the sensitivity of shape-based searches but at the speed of sequence-based ones. Foldseek outperformed 2 popular structure-based search tools, TM-align and Dali — performing 24% and 8% better, respectively — and 35k times and 20k times faster. Compared with a structural-alphabet-based tool called CLE-SW, Foldseek was 23% better, and 11x as fast
2023-10-12: Create vaccines for predicted mutations
EVEscape is an impressive SARS-CoV-2 soothsayer. 50% of the mutations the model predicted in a region of the cell-invading spike protein most prone to change have already been observed in real-world SARS-CoV-2 variants, a figure that should grow as the virus continues to evolve. The team used the model to create a set of potential sequences for the SARS-CoV-2 spike protein, some containing as many as 46 mutations from the ancestral strain, with the hope of anticipating the virus’s future evolution and contributing to the development of experimental vaccines.
The model isn’t limited to SARS-CoV-2. It could also predict the evolution of HIV, influenza, Nipah and the virus that causes Lassa haemorrhagic fever. When a new virus with pandemic potential pops up, the team hopes to be ready with predictions for its evolution — and perhaps even vaccines based on those predictions.
Researchers have discovered a new kind of organism that doesn’t fit into the plant, animal, or any other kingdom of known organisms. 2 species of the microscopic organisms, called hemimastigotes, were found in dirt. Hemimastigotes were first seen and described in the 19th century, and ~10 species have been described over the past 100 years. But up to now, no one could figure out how they fit into the evolutionary tree of life. Based on the new genetic analysis, it looks like you’d have to go back 1 ga before you could find a common ancestor of hemimastigotes and any other known living thing.
The geno-economists seem confident that human genes have a measurable influence on human outcomes. But publicizing whatever predictive power does lie in our genes runs the risk of misleading the rest of us into believing that control of our genes is control of our future. They’re adamant that their motives are in forestalling the dystopian implications of the work, in fighting off misinformation and misguided policies. “The world in which we can predict all sorts of things about the future based on saliva samples — personality traits, cognitive abilities, life outcomes — is happening in the next 5 years. Now is the time to prepare for that.”
Genetic reading and writing may improve another 1M times
Saildrone’s investors are taking a longer view, and that a global database of the oceans will benefit the company’s future more than chasing whatever business it can book today. “The most important asset is the data, and getting data that no one else can accumulate”. Still, Jenkins has been paired with Chief Operating Officer Sebastien de Halleux, who has a long track record of turning startups into big businesses. It’s de Halleux who convinced Jenkins that money could be made from understanding the weather. “Sebastien will keep it tethered, while Richard does his thing as a creative genius”. Saildrone plans to sell data to all comers, building a software platform that almost anyone can tap, and to go after commercial work, particularly with fisheries and logistics companies. Later this year, possibly by August, they plan to revive the goal of replicating Magellan’s voyage with a couple of saildrones. In order to make the circumnavigation official with the World Sailing Speed Record Council, the drones must start out in the Northern Hemisphere, cross every longitude line, and cross the equator twice. “We’ll fulfill all of our contracts first, and as soon as there is a boat available, we’ll set them off. It’s all about priorities, right?”
2023-03-11: Semi-autonomous ocean mapping
The Saildrone SD 1200 uncrewed surface vessel has successfully surveyed more than 45k km2 of previously unmapped ocean floor around the Aleutian Islands in Alaska and a region off the Californian coast. An Environmental Sample Processor went along for the trip too – gathering environmental DNA.

Evidence of the octopus evolution show it would have happened too quickly to have begun here on Earth. “Thus the possibility that cryopreserved Squid and/or Octopus eggs, arrived in icy bolides several 100M years ago should not be discounted as that would be a parsimonious cosmic explanation for the Octopus’ sudden emergence on Earth 270 ma BP.”
2022-01-29:
3 hearts, pumping blue-green blood because their oxygen carrying metal is copper (versus iron in the heme of our blood). They can spend 30 minutes out of the water, to scoot between tidepools.
Alien intelligence: from a distant branch in the tree of life, the octopus is the only invertebrate to have developed a complex, clever brain. Our common evolutionary ancestor is a tubule so ancient, neither brains nor eyes yet existed. They evolved independently, on land and by sea. From the Cambrian explosion of sensing, body plans, and predation, minds evolved in response to other minds. It was an information revolution. It’s where experience begins.
The octopus brain rings around its throat. 500M neurons, similar to dog (vs. human: 86B, fly: 100K).
The octopus has over 50 different functional brain lobes (versus 4 in human)
And furthermore, 60% of its neurons are out in the arms, with a high degree of autonomy. A severed arm can carry on as if nothing has changed for several hours.
It is a distributed mesh of ganglia (knots of nerves) in a ladder-like nervous system. Recurrent neural loops serve as a local short-term memory latch.
“The octopus is suffused with nervousness; the body is not a separate thing that is controlled by the brain or nervous system.” Unconstrained by bone or shell, “the body itself is protean, all possibility. The octopus lives outside the usual body/brain divide.” (PGS)
Structurally, our eyes ended up strikingly similar to the octopus (camera-like with a focusing lens, through a transparent cornea and iris aperture to a retina backing the optic nerves). But octopus eyes have a wide-angle panoramic view, and they move independently like a chameleon.
Their horizontal slit pupil stays horizontal as the body moves, like a steady cam. This is made possible by special balance receptors called statocysts (a sac with internal sensory hairs and loose mineralized balls that roll around with movement and gravity).
They can see polarized light, but not color (making their color-matching camouflage skills all the more intriguing; they also see with their skin).
Their playful interactions with humans exhibit mischief and craft, a sign of mental surplus
Humans internalized language as a tool for complex thought (we can hear what we say and use language to arrange and manipulate ideas). Octopuses are on a different path.
Their entire skin is a layered screen, with about a megapixel directly controlled by the brain.
Skin color, pattern and fleshy texture can change in 0.7 seconds.
3 layers of skin cells control elastic sacks of pigments, internal iridescent reflections, even polarization (which the octopus can see), over a white underbody. They are regulated by acetylcholine, one of the earliest neurotransmitters in evolution.
The octopus can create a voluntary light show on its skin, e.g., a dark cloud passing over the local landscape, or a dramatic display to confuse a predator while fleeing.
30 ritualized displays for mating and other signaling.
Some octopuses have regions of constant kaleidoscopic restlessness, like animated eye shadow.
1600 suckers. 16 kg of lift capacity per sucker. 10k tasting chemoreceptors per sucker. Each is controlled individually.
Octopus muscles have radial + longitudinal fibers (agile like our tongues, not our biceps).
Opposing waves of activation can create temporary elbows at the region of constructive overlap, or pass food sucker-to-sucker like a conveyor belt.
The octopus’ arm muscles can pull 100x its own weight.
It can squeeze through a hole about the size of its eyeball.
Their ink squirts contain oxytocin (perhaps to soothe prey) and dopamine, the “reward hormone” (perhaps to trick predators that they had caught the octopus in the billowy cloud).
2022-02-17:
Soft-bodied cephalopods such as the octopus are exceptionally intelligent invertebrates with a highly complex nervous system that evolved independently from vertebrates. Because of elevated RNA editing in their nervous tissues, we hypothesized that RNA regulation may play a major role in the cognitive success of this group. We thus profiled mRNAs and small RNAs in 18 tissues of the common octopus. We show that the major RNA innovation of soft-bodied cephalopods is a massive expansion of the miRNA gene repertoire. These novel miRNAs were primarily expressed in neuronal tissues, during development, and had conserved and thus likely functional target sites. The only comparable miRNA expansions happened, strikingly, in vertebrates. Thus, we propose that miRNAs are intimately linked to the evolution of complex animal brains.
A novel electric propulsion technology for nanorobots allows molecular machines to move 100Kx faster than with the biochemical processes used to date. This makes nanobots fast enough to do assembly line work in molecular factories.
We found the ancient Egyptian samples falling distinct from modern Egyptians, and closer towards Near Eastern and European samples (Fig. 4a, Supplementary Fig. 3, Supplementary Table 5). In contrast, modern Egyptians are shifted towards sub-Saharan African populations. Model-based clustering using ADMIXTURE37 (Fig. 4b, Supplementary Fig. 4) further supports these results and reveals that the 3 ancient Egyptians differ from modern Egyptians by a relatively larger Near Eastern genetic component, in particular a component found in Neolithic Levantine ancient individuals36 (Fig. 4b). In contrast, a substantially larger sub-Saharan African component, found primarily in West-African Yoruba, is seen in modern Egyptians compared to the ancient samples.
2021-11-05: Mummification is also older than previously thought:
The preserved body of a high-ranking nobleman called Khuwy, discovered in 2019, has been found to be far older than assumed and is, in fact, 1 of the oldest Egyptian mummies ever discovered. It has been dated to the Old Kingdom, proving that mummification techniques 4 ka BP were highly advanced. The sophistication of the body’s mummification process and the materials used – including its exceptionally fine linen dressing and high-quality resin – was not thought to have been achieved until 1 ka later.
