“How do you get better as a software developer?” What does expert performance look like? To facilitate continuous development of their employee’s software development skills, employers should:
- Encourage learning (e.g. training courses, conference attendance, and access to a good analog or digital library)
- Encourage experimentation (e.g. through side projects and by building a work environment that is open to new ideas and technologies)
- Improve information exchange between development teams, departments, and even companies. E.g. lunch and learn sessions, rotation between teams, pairing, mentoring, and code reviews.
- Grant freedom (primarily in the form of less time pressure) to allow developers to invest in learning new technologies or skills.
Tag: science
Subsurface biomass
70% of Earth’s bacteria and archaea live in the subsurface
Barely living “zombie” bacteria and other forms of life constitute an immense amount of carbon deep within Earth’s subsurface—245-385x greater than the carbon mass of all humans on the surface
2023-07-27: Dark Oxygen
In groundwater reservoirs 200 meters below the fossil fuel fields of Alberta, Canada, they discovered abundant microbes that produce unexpectedly large amounts of oxygen even in the absence of light. The microbes generate and release so much of what the researchers call “dark oxygen” that it’s like discovering “the scale of oxygen coming from the photosynthesis in the Amazon rainforest”. The quantity of the gas diffusing out of the cells is so great that it seems to create conditions favorable for oxygen-dependent life in the surrounding groundwater and strata. Instead of taking in oxygen from its surroundings like other aerobes, the bacteria created its own oxygen by using enzymes to break down the soluble compounds called nitrites (which contain a chemical group made of nitrogen and 3 oxygen atoms). The bacteria used the self-generated oxygen to split methane for energy. When microbes break down compounds this way, it’s called dismutation. Until now, it was thought to be rare in nature as a method for generating oxygen.
Jupiter’s moon Europa has a deep, frozen ocean; sunlight may not penetrate it, but oxygen could potentially be produced there by microbial dismutation instead of photosynthesis.
Solving protein structures
This explains a lot why pharma companies are so terrible at coming up with new drugs.
There is perhaps no better example of this than protein structure prediction, a problem that is very close to these companies’ core interest (along with docking), but on which they have spent virtually no resources. The little research on these problems done at pharmas is almost never methodological in nature, instead being narrowly focused on individual drug discovery programs. While the latter is important and obviously contributes to their bottom line, much like similar research done at tech companies, the lack of broadly minded basic research may have robbed biology of decades of progress, and contributed to the ossification of these companies software and machine learning expertise
2020-11-30: Nature perspective on AlphaFold 1
DeepMind has made a gargantuan leap in solving one of biology’s grandest challenges — determining a protein’s 3D shape from its amino-acid sequence. “This is a big deal. In some sense the problem is solved.”
Perspective by someone in the field
Which brings me to what I think is the most exciting opportunity of all: the prospect of building a structural systems biology. In almost all forms of systems biology practiced today, from the careful and quantitative modeling of the dynamics of a small cohort of proteins to the quasi-qualitative systems-wide models that rely on highly simplified representations, structure rarely plays a role. This is unfortunate because structure is the common currency through which everything in biology gets integrated, both in terms of macromolecular chemistries, i.e., proteins, nucleic acids, lipids, etc, but also in terms of the cell’s functional domains, i.e., its information processing circuitry, its morphology, and its motility. A structural systems biology would take this seriously, deriving the rate constants of enzymatic and metabolic reactions, protein-protein binding affinities, and protein-DNA interactions all from structural models. We don’t yet know how much easier, if at all, it will be to predict these types of quantities from structure than from sequence—we need to put the dogma of “structure determines function” to the test. Even if the dogma were to fail in some instances, which it almost certainly will, partial success will open up new avenues.
2021-07-23: AlphaFold 2
DeepMind has used its AI to predict the shapes of nearly every protein in the human body, as well as the shapes of 100Ks of other proteins found in 20 of the most widely studied organisms, including yeast, fruit flies, and mice. So far the trove consists of 350k newly predicted protein structures. DeepMind says it will predict and release the structures for more than 100m more in the next few months—more or less all proteins known to science. In the new version of AlphaFold, predictions come with a confidence score that the tool uses to flag how close it thinks each predicted shape is to the real thing. Using this measure, DeepMind found that AlphaFold predicted shapes for 36% of human proteins with an accuracy that is correct down to the level of individual atoms. Previously, after decades of work, only 17% of the proteins in the human body have had their structures identified in the lab. Drug discovery is all about those biological effects – what else could it be concerned with? And these are higher-order things than just the naked protein structure, as valuable as that can be. Remember, our failure rate in the clinic is around 90% overall, and none of those failures were due to lack of a good protein structure. They were caused by much harder problems: what those proteins actually do in a living cell, how those functions differ in health and disease, how they differ between different sorts of human patients and between humans in general and the animal models that were used to develop the compounds, what other protein targets the drug candidate might have hit and the downstream effects (usually undesirable) that those kicked off, and on and on. So structural biology has been greatly advanced by these new tools. But it has not been outmoded, replaced, or rendered irrelevant. It’s more relevant than ever, and now we can get down to even bigger questions with it.
2022-04-12: Protein complexes 
ColabFold later incorporated the ability to predict complexes. And in October 2021, DeepMind released an update called AlphaFold-Multimer that was specifically trained on protein complexes, unlike its predecessor. It predicted around 70% of the known protein–protein interactions.
Elofsson’s team used AlphaFold to predict the structures of 65k human protein pairs that were suspected to interact on the basis of experimental data. And a team led by Baker used AlphaFold and RoseTTAFold to model interactions between nearly every pair of proteins encoded by yeast, identifying more than 100 previously unknown complexes. Such screens are just starting points. They do a good job of predicting some protein pairings, particularly those that are stable, but struggle to identify more transient interactions. “Because it looks nice doesn’t mean it is correct. You need some experimental data that show you’re right.”
Attempts to apply AlphaFold to various mutations that disrupt a protein’s natural structure, including one linked to early breast cancer, have confirmed that the software is not equipped to predict the consequences of new mutations in proteins, since there are no evolutionarily-related sequences to examine.
The AlphaFold team is now thinking about how a neural network could be designed to deal with new mutations. This would require the network to better predict how a protein goes from its unfolded to its folded state. That would probably need software that relies only on what it has learnt about protein physics to predict structures. “One thing we are interested in is making predictions from single sequences without using evolutionary information. That’s a key problem that does remain open.”
AlphaFold-inspired tools could be used to model not just individual proteins and complexes, but entire organelles or even cells down to the level of individual protein molecules. “This is the dream we will follow for the next decades.”
2022-07-28: AlphaFold goes from 350k to 214m predictions.
Researchers have used AlphaFold to predict the structures of 214m proteins from 1m species, covering nearly every known protein on the planet. According to EMBL-EBI, around 35% of the 214m predictions are deemed highly accurate, which means they are as good as experimentally determined structures. Another 45% were deemed confident enough to rely on for many applications. DeepMind has committed to supporting the database for the long haul, and he could see updates occurring annually.
2022-08-03: AlphaFold is open source with no commercial restrictions. What is the end game for Deepmind?
DeepMind has made policy decisions that have played a significant part in the transformation in structural biology. This includes its decision last July to make the code underlying AlphaFold open source, so that anyone can use the tool. Earlier this year, the company went further and lifted a restriction that hampered some commercial uses of the program. It has also helped to establish, and is financially supporting, the AlphaFold database maintained with EMBL-EBI. DeepMind deserves to be commended for this commitment to open science.
2022-11-02: Meta enters the fold with a large language model. The amazing generality of language models continues.
ESMFold isn’t quite as accurate as AlphaFold, but it is 60x faster at predicting structures. “What this means is that we can scale structure prediction to much larger databases.”
As a test case, they decided to wield their model on a database of bulk-sequenced ‘metagenomic’ DNA from environmental sources including soil, seawater, the human gut, skin and other microbial habitats. The vast majority of the DNA entries — which encode potential proteins — come from organisms that have never been cultured and are unknown to science. The team predicted the structures of 617m proteins. Of these 617m predictions, the model deemed 33% to be high quality. Millions of these structures are entirely novel, and unlike anything in databases of protein structures determined experimentally or in the AlphaFold database of predictions from known organisms. A good chunk of the AlphaFold database is made of structures that are nearly identical to each other, and ‘metagenomic’ databases “should cover a large part of the previously unseen protein universe”.
In terms of what % of protein space has been covered by these models, estimates vary widely. But it’s possible that life itself has explored all of protein space. If we take a median estimate of 1030 proteins, and 108 with structure, we have a long way to go.
To examine how much of sequence space could have been explored, it is simplest to make upper and lower limit estimates for the number of unique amino acid sequences produced since the origin of life. Considering the upper limit, it is clear that bacteria dominate the planet in terms of the product of the number of cells (1030) multiplied by the number of genes in each genome (104). Let us assume that every single gene in this total of 1034 is unique and that evolution has been working on these genes for 4 Ga completely changing each gene to some other unique, new gene every single year. This gives an extreme upper limit of 4×1043 different amino acid sequences explored since the origin of life. The contribution to this number of sequences by viral and eukaryotic genomes is difficult to estimate but it is very unlikely to be orders of magnitude greater than the 4×1043 sequences from bacteria. If their contribution is similar or smaller, then it can be ignored in our rough calculation. A lower limit to the number of sequences explored is more difficult to estimate but it has been estimated that there are 109 different bacterial species on Earth. If we assume that each species has a unique complement of 103 sequences (an underestimate) and that only 1 sequence has changed per species per generation (a reasonable estimate based upon analysis of mutation rates in bacteria), and that the generation time is 1 year (a considerable underestimate for many modern bacteria, but perhaps reasonable for an ancient organism or one growing slowly in a poor environment), then we arrive at a figure of 4×1021 different protein sequences tested since the origin of life.
Although the oft-quoted 10130 size of sequence space is far above these limits, the other more plausible estimates for the size of sequence space, particularly with limited amino acid diversity or reduced length, are near to or within these 2 limits. Considering the upper limit, all sequences containing 20, 8 and 3 types of amino acids have been explored if the chains are 33, 50 and 100 amino acids in length, respectively. Considering the lower limit, then virtually all chains of length 33 and 50 amino acids containing 5 or 3 types of amino acid, respectively, could have been explored. (The exploration of longer chains of 100 amino acids with only 2 types of residue is obviously much less complete but it is not a negligible fraction of the total.) Therefore it is entirely feasible that for all practical (i.e. functional and structural) purposes, protein sequence space has been fully explored during the course of evolution of life on Earth (perhaps even before the appearance of eukaryotes).

2022-11-26: An open source reimplementation of AlphaFold does even better.
OpenFold is trained from scratch. Compared to AlphaFold2, OpenFold runs on proteins that are 1.7x larger, runs 2x as fast on short proteins, and is slightly more accurate. As more people can help drive this technology, we’ll get more and better discoveries.
2023-07-03: Foldseek
Sequence searches are fast, like searching a hard drive for a file name. But they often miss good matches because proteins with similar shapes can have vastly different sequences. Structure-based search methods look for shapes instead of sequences, but this can take thousands of times longer, because it’s computationally difficult to compare complex 3D objects. With Foldseek, researchers got the best of both worlds: the software represents a protein’s shape as a string of letters — a ‘structural alphabet’ — thereby offering the sensitivity of shape-based searches but at the speed of sequence-based ones. Foldseek outperformed 2 popular structure-based search tools, TM-align and Dali — performing 24% and 8% better, respectively — and 35k times and 20k times faster. Compared with a structural-alphabet-based tool called CLE-SW, Foldseek was 23% better, and 11x as fast
2023-10-12: Create vaccines for predicted mutations
EVEscape is an impressive SARS-CoV-2 soothsayer. 50% of the mutations the model predicted in a region of the cell-invading spike protein most prone to change have already been observed in real-world SARS-CoV-2 variants, a figure that should grow as the virus continues to evolve. The team used the model to create a set of potential sequences for the SARS-CoV-2 spike protein, some containing as many as 46 mutations from the ancestral strain, with the hope of anticipating the virus’s future evolution and contributing to the development of experimental vaccines.
The model isn’t limited to SARS-CoV-2. It could also predict the evolution of HIV, influenza, Nipah and the virus that causes Lassa haemorrhagic fever. When a new virus with pandemic potential pops up, the team hopes to be ready with predictions for its evolution — and perhaps even vaccines based on those predictions.
Cyborg Botany
Elowan is a “plant-robot hybrid” that uses its own bio-electromechanical signaling to drive itself around toward light sources.
Inside Bruegel
“New imaging technology, created by a project known as “Inside Bruegel” offers some insight into these questions, by allowing us to pull the painting’s layers apart. “It’s a huge advancement if you want to look at Bruegel. You can actually see the creative process. You can follow the artist in how he makes decisions.”
536 was the worst year
A mysterious fog plunged Europe, the Middle East, and parts of Asia into darkness, day and night—for 18 months. “For the sun gave forth its light without brightness, like the moon, during the whole year,” wrote Byzantine historian Procopius. Temperatures in the summer of 536 fell 1.5°C to 2.5°C, initiating the coldest 10 years in the past 2300 years. Snow fell that summer in China; crops failed; people starved. The Irish chronicles record “a failure of bread from the years 536–539.” Then, in 541, bubonic plague struck the Roman port of Pelusium, in Egypt. What came to be called the Plague of Justinian spread rapidly, wiping out one-third to one-half of the population of the eastern Roman Empire and hastening its collapse
ML: Careful What You Wish
I’m not going to stand on the sidelines shouting for a fight, though. This whole episode has (I hope) been instructive, because machine learning is not going away. Nor should it. But since we’re going to use it, we all have to make sure that we’re not kidding ourselves when we do so. The larger our data sets, the better our models – but the larger our data sets, the greater the danger that we don’t understand irrelevant patterns in those numbers that we didn’t intend to be there, patterns which the ML algorithms will seize on in their relentless way and incorporate into their models. I think that the adversarial tests proposed by the UCSF group make a lot of sense, and that machine-learning results that can’t get past them need to be put back in the oven at the very least. Our biggest challenge, given the current state of the ML field, is to avoid covering the landscape with stuff that’s plausible-sounding but quite possibly irrelevant.
Hemimastigotes
Researchers have discovered a new kind of organism that doesn’t fit into the plant, animal, or any other kingdom of known organisms. 2 species of the microscopic organisms, called hemimastigotes, were found in dirt. Hemimastigotes were first seen and described in the 19th century, and ~10 species have been described over the past 100 years. But up to now, no one could figure out how they fit into the evolutionary tree of life. Based on the new genetic analysis, it looks like you’d have to go back 1 ga before you could find a common ancestor of hemimastigotes and any other known living thing.
Geno-economics
The geno-economists seem confident that human genes have a measurable influence on human outcomes. But publicizing whatever predictive power does lie in our genes runs the risk of misleading the rest of us into believing that control of our genes is control of our future. They’re adamant that their motives are in forestalling the dystopian implications of the work, in fighting off misinformation and misguided policies. “The world in which we can predict all sorts of things about the future based on saliva samples — personality traits, cognitive abilities, life outcomes — is happening in the next 5 years. Now is the time to prepare for that.”
Biology is in charge
If you were around pre-1900s, and wanted to contribute to biology, you should have been a physicist (Robert Hooke, a physicist discovers the first cell, making a better microscope is a major driver of progress). In which field should you work to maximize progress in biology today? …But something interesting happened around the 1950s. If you look at the most important techniques in biology, in the second half of the 1900s, they’re all driven by tools discovered in biology itself. Biologists aren’t just finding new things – they’re making their new tools from biological reagents. PCR (everything that drives PCR, apart from the heater/cooler which is 1600s thermodynamics, is either itself DNA or something made by DNA), DNA sequencing (sequencing by synthesis – we use cameras/electrical detection/CMOS chips as the output, but the hijacking the way the cell makes DNA proteins remains at the heart of the technique), cloning (we cut up DNA with proteins made from DNA, stick the DNA into bacteria so living organisms can make more copies of it for us), gene editing (CRISPR is obviously made from DNA and with RNA attached), ELISA (need the ability to detect fluorescence – optics – and process the signal, but antibodies lie at the heart of this principle), affinity chromatography (liquid chromatography arguably uses physical principles like steric hindrance, or charge, but those can be traced back to the 1800s – antibodies and cloning have revolutionized this technique), FACS uses the same charge principles that western blots do, but with the addition of antibodies…