Tag: opensource

Sophia

unclear how staged this is, but impressive either way.

Little Sophia can walk, talk, sing, play games and, like her big sister, tell jokes. With Little Sophia’s software, and included tutorials through Hanson’s AI Academy, she is a unique programmable, educational companion for kids, inspiring children to learn through a safe, interactive, human-robot experience

SpaceX telemetry

Neat: due to some nuclear weapons treaty, rocket communications are transmitted more or less in the clear, and a group of enthusiasts have decoded additional internal sensor readings & pictures from spacex, but also some chinese ones(?). Kind of surprising that there’s not more industrial espionage going on, or if there is, others don’t seem to suspiciously catch up with spacex.

SoC Opensourcing

Beyond the NDA blocks, there is typically a deeper layer of completely unpublished documentation for disused silicon, such as peripherals that were designed-in but did not make the final cut, internal debugging facilities, and pre-boot facilities. Many of these disused features aren’t even well-known within the team that designed the chip! Thus a typical SoC mask set starts with lots of extra features, spare logic, and debug facilities that are chiseled away (disused) until the final shape of the SoC emerges. From a security standpoint, the presence of such “dark matter” in SoCs is worrisome. Forget worrying about the boot ROM or CPU microcode – the BIST (Built in Self Test) infrastructure has everything you need to do code injection, if you can just cajole it into the right mode. Furthermore, SoC integrators all buy functional blocks such as DDR, PCI, and USB from a tiny set of IP vendors. This means the same disused logic motifs are baked into 100Ms of devices, even across competing brands and dissimilar product lines. Herein lies a hazard for an unpatchable, ecosystem-shattering security break!

Trustable Hardware?

I’ve concluded that open hardware is precisely as trustworthy as closed hardware. Which is to say, I have no inherent reason to trust either at all. While open hardware has the opportunity to empower users to innovate and embody a more correct and transparent design intent than closed hardware, at the end of the day any hardware of sufficient complexity is not practical to verify, whether open or closed. Even if we published the complete mask set for a modern billion-transistor CPU, this “source code” is meaningless without a practical method to verify an equivalence between the mask set and the chip in your possession down to a near-atomic level without simultaneously destroying the CPU.

So where does this leave us? Do we throw up our hands in despair? Is there any solution to the hardware verification problem?

I’ve pondered this problem for many years, and distilled my thoughts into 3 core principles:

  1. Complexity is the enemy of verification
  2. Verify entire systems, not just components
  3. Empower end-users to verify and seal their hardware

2023-03-11: The next step, inspecting the hardware itself

The Infra-Red, In Situ (IRIS) inspection method is capable of seeing through a chip already attached to a circuit board, and non-destructively imaging the construction of a chip’s logic. Each pixel corresponds to 1.67 micron. While these images cannot precisely resolve individual logic gates, the overall brightness of a region will bear a correlation to the type and density of logic gate used. With a reasonable amount of design-level hardening, we may be able to up the logic footprint for a hardware trojan into something large enough to be detected with IRIS. Fortunately, there is an existing body of research on hardening chips against trojans, using a variety of techniques including logic locking, built in self test (BIST) scans, path delay fingerprinting, and self-authentication methods

Solving protein structures

This explains a lot why pharma companies are so terrible at coming up with new drugs.

There is perhaps no better example of this than protein structure prediction, a problem that is very close to these companies’ core interest (along with docking), but on which they have spent virtually no resources. The little research on these problems done at pharmas is almost never methodological in nature, instead being narrowly focused on individual drug discovery programs. While the latter is important and obviously contributes to their bottom line, much like similar research done at tech companies, the lack of broadly minded basic research may have robbed biology of decades of progress, and contributed to the ossification of these companies software and machine learning expertise

2020-11-30: Nature perspective on AlphaFold 1

DeepMind has made a gargantuan leap in solving one of biology’s grandest challenges — determining a protein’s 3D shape from its amino-acid sequence. “This is a big deal. In some sense the problem is solved.”

Perspective by someone in the field

Which brings me to what I think is the most exciting opportunity of all: the prospect of building a structural systems biology. In almost all forms of systems biology practiced today, from the careful and quantitative modeling of the dynamics of a small cohort of proteins to the quasi-qualitative systems-wide models that rely on highly simplified representations, structure rarely plays a role. This is unfortunate because structure is the common currency through which everything in biology gets integrated, both in terms of macromolecular chemistries, i.e., proteins, nucleic acids, lipids, etc, but also in terms of the cell’s functional domains, i.e., its information processing circuitry, its morphology, and its motility. A structural systems biology would take this seriously, deriving the rate constants of enzymatic and metabolic reactions, protein-protein binding affinities, and protein-DNA interactions all from structural models. We don’t yet know how much easier, if at all, it will be to predict these types of quantities from structure than from sequence—we need to put the dogma of “structure determines function” to the test. Even if the dogma were to fail in some instances, which it almost certainly will, partial success will open up new avenues.

2021-07-23: AlphaFold 2

DeepMind has used its AI to predict the shapes of nearly every protein in the human body, as well as the shapes of 100Ks of other proteins found in 20 of the most widely studied organisms, including yeast, fruit flies, and mice. So far the trove consists of 350k newly predicted protein structures. DeepMind says it will predict and release the structures for more than 100m more in the next few months—more or less all proteins known to science. In the new version of AlphaFold, predictions come with a confidence score that the tool uses to flag how close it thinks each predicted shape is to the real thing. Using this measure, DeepMind found that AlphaFold predicted shapes for 36% of human proteins with an accuracy that is correct down to the level of individual atoms. Previously, after decades of work, only 17% of the proteins in the human body have had their structures identified in the lab. Drug discovery is all about those biological effects – what else could it be concerned with? And these are higher-order things than just the naked protein structure, as valuable as that can be. Remember, our failure rate in the clinic is around 90% overall, and none of those failures were due to lack of a good protein structure. They were caused by much harder problems: what those proteins actually do in a living cell, how those functions differ in health and disease, how they differ between different sorts of human patients and between humans in general and the animal models that were used to develop the compounds, what other protein targets the drug candidate might have hit and the downstream effects (usually undesirable) that those kicked off, and on and on. So structural biology has been greatly advanced by these new tools. But it has not been outmoded, replaced, or rendered irrelevant. It’s more relevant than ever, and now we can get down to even bigger questions with it.

2022-04-12: Protein complexes

ColabFold later incorporated the ability to predict complexes. And in October 2021, DeepMind released an update called AlphaFold-Multimer that was specifically trained on protein complexes, unlike its predecessor. It predicted around 70% of the known protein–protein interactions.
Elofsson’s team used AlphaFold to predict the structures of 65k human protein pairs that were suspected to interact on the basis of experimental data. And a team led by Baker used AlphaFold and RoseTTAFold to model interactions between nearly every pair of proteins encoded by yeast, identifying more than 100 previously unknown complexes. Such screens are just starting points. They do a good job of predicting some protein pairings, particularly those that are stable, but struggle to identify more transient interactions. “Because it looks nice doesn’t mean it is correct. You need some experimental data that show you’re right.”
Attempts to apply AlphaFold to various mutations that disrupt a protein’s natural structure, including one linked to early breast cancer, have confirmed that the software is not equipped to predict the consequences of new mutations in proteins, since there are no evolutionarily-related sequences to examine.
The AlphaFold team is now thinking about how a neural network could be designed to deal with new mutations. This would require the network to better predict how a protein goes from its unfolded to its folded state. That would probably need software that relies only on what it has learnt about protein physics to predict structures. “One thing we are interested in is making predictions from single sequences without using evolutionary information. That’s a key problem that does remain open.”
AlphaFold-inspired tools could be used to model not just individual proteins and complexes, but entire organelles or even cells down to the level of individual protein molecules. “This is the dream we will follow for the next decades.”

2022-07-28: AlphaFold goes from 350k to 214m predictions.

Researchers have used AlphaFold to predict the structures of 214m proteins from 1m species, covering nearly every known protein on the planet. According to EMBL-EBI, around 35% of the 214m predictions are deemed highly accurate, which means they are as good as experimentally determined structures. Another 45% were deemed confident enough to rely on for many applications. DeepMind has committed to supporting the database for the long haul, and he could see updates occurring annually.

2022-08-03: AlphaFold is open source with no commercial restrictions. What is the end game for Deepmind?

DeepMind has made policy decisions that have played a significant part in the transformation in structural biology. This includes its decision last July to make the code underlying AlphaFold open source, so that anyone can use the tool. Earlier this year, the company went further and lifted a restriction that hampered some commercial uses of the program. It has also helped to establish, and is financially supporting, the AlphaFold database maintained with EMBL-EBI. DeepMind deserves to be commended for this commitment to open science.

2022-11-02: Meta enters the fold with a large language model. The amazing generality of language models continues.

ESMFold isn’t quite as accurate as AlphaFold, but it is 60x faster at predicting structures. “What this means is that we can scale structure prediction to much larger databases.”

As a test case, they decided to wield their model on a database of bulk-sequenced ‘metagenomic’ DNA from environmental sources including soil, seawater, the human gut, skin and other microbial habitats. The vast majority of the DNA entries — which encode potential proteins — come from organisms that have never been cultured and are unknown to science. The team predicted the structures of 617m proteins. Of these 617m predictions, the model deemed 33% to be high quality. Millions of these structures are entirely novel, and unlike anything in databases of protein structures determined experimentally or in the AlphaFold database of predictions from known organisms. A good chunk of the AlphaFold database is made of structures that are nearly identical to each other, and ‘metagenomic’ databases “should cover a large part of the previously unseen protein universe”.

In terms of what % of protein space has been covered by these models, estimates vary widely. But it’s possible that life itself has explored all of protein space. If we take a median estimate of 1030 proteins, and 108 with structure, we have a long way to go.

To examine how much of sequence space could have been explored, it is simplest to make upper and lower limit estimates for the number of unique amino acid sequences produced since the origin of life. Considering the upper limit, it is clear that bacteria dominate the planet in terms of the product of the number of cells (1030) multiplied by the number of genes in each genome (104). Let us assume that every single gene in this total of 1034 is unique and that evolution has been working on these genes for 4 Ga completely changing each gene to some other unique, new gene every single year. This gives an extreme upper limit of 4×1043 different amino acid sequences explored since the origin of life. The contribution to this number of sequences by viral and eukaryotic genomes is difficult to estimate but it is very unlikely to be orders of magnitude greater than the 4×1043 sequences from bacteria. If their contribution is similar or smaller, then it can be ignored in our rough calculation. A lower limit to the number of sequences explored is more difficult to estimate but it has been estimated that there are 109 different bacterial species on Earth. If we assume that each species has a unique complement of 103 sequences (an underestimate) and that only 1 sequence has changed per species per generation (a reasonable estimate based upon analysis of mutation rates in bacteria), and that the generation time is 1 year (a considerable underestimate for many modern bacteria, but perhaps reasonable for an ancient organism or one growing slowly in a poor environment), then we arrive at a figure of 4×1021 different protein sequences tested since the origin of life.

Although the oft-quoted 10130 size of sequence space is far above these limits, the other more plausible estimates for the size of sequence space, particularly with limited amino acid diversity or reduced length, are near to or within these 2 limits. Considering the upper limit, all sequences containing 20, 8 and 3 types of amino acids have been explored if the chains are 33, 50 and 100 amino acids in length, respectively. Considering the lower limit, then virtually all chains of length 33 and 50 amino acids containing 5 or 3 types of amino acid, respectively, could have been explored. (The exploration of longer chains of 100 amino acids with only 2 types of residue is obviously much less complete but it is not a negligible fraction of the total.) Therefore it is entirely feasible that for all practical (i.e. functional and structural) purposes, protein sequence space has been fully explored during the course of evolution of life on Earth (perhaps even before the appearance of eukaryotes).


2022-11-26: An open source reimplementation of AlphaFold does even better.

OpenFold is trained from scratch. Compared to AlphaFold2, OpenFold runs on proteins that are 1.7x larger, runs 2x as fast on short proteins, and is slightly more accurate. As more people can help drive this technology, we’ll get more and better discoveries.

2023-07-03: Foldseek

Sequence searches are fast, like searching a hard drive for a file name. But they often miss good matches because proteins with similar shapes can have vastly different sequences. Structure-based search methods look for shapes instead of sequences, but this can take thousands of times longer, because it’s computationally difficult to compare complex 3D objects. With Foldseek, researchers got the best of both worlds: the software represents a protein’s shape as a string of letters — a ‘structural alphabet’ — thereby offering the sensitivity of shape-based searches but at the speed of sequence-based ones. Foldseek outperformed 2 popular structure-based search tools, TM-align and Dali — performing 24% and 8% better, respectively — and 35k times and 20k times faster. Compared with a structural-alphabet-based tool called CLE-SW, Foldseek was 23% better, and 11x as fast

2023-10-12: Create vaccines for predicted mutations

EVEscape is an impressive SARS-CoV-2 soothsayer. 50% of the mutations the model predicted in a region of the cell-invading spike protein most prone to change have already been observed in real-world SARS-CoV-2 variants, a figure that should grow as the virus continues to evolve. The team used the model to create a set of potential sequences for the SARS-CoV-2 spike protein, some containing as many as 46 mutations from the ancestral strain, with the hope of anticipating the virus’s future evolution and contributing to the development of experimental vaccines.

The model isn’t limited to SARS-CoV-2. It could also predict the evolution of HIV, influenza, Nipah and the virus that causes Lassa haemorrhagic fever. When a new virus with pandemic potential pops up, the team hopes to be ready with predictions for its evolution — and perhaps even vaccines based on those predictions.

RISC-V

RISC-V is a open instruction set architecture originally developed at UC Berkeley for research and education that has been seeing a lot of exciting developments lately. You can buy a RISV-V based microcontroller right now. It is officially supported by GCC. The lowRISC project, founded by some of the same people responsible for Raspberry Pi, aims to provide a fully open source Linux system-on-a-chip. UC Berkeley has developed a (relatively) high performance, super-scalar, out-of-order RISC-V core.

2023-02-11: RISC-V status update. I remain skeptical because only losing players have adopted it, probably out of a position of weakness. Citing government investments as helpful is hilarious.

RISC-V is inevitable. RISC-V is going to have the best processors. And RISC-V is going to have the best ecosystem. All the technical stuff in RISC-V is amazing, but it’s really this change in the business model that makes RISC-V inevitable. And just think about this: Once you move to a high-quality open standard, you never go back to sole-source proprietary standards.

Open Hardware

A 180nm implementation of the J2 design costs around 3 cents per chip, with no royalties required. “That’s disposable computing at the ‘free toy inside’ level.

a completely open hardware design can be produced cheaply.
2015-07-13:

Instead of running in fear of obsolescence, open-source hardware developers now have time to build communities around platforms; we can learn from each other, share blueprints and iterate prototypes before committing to a final design. The extra time also allows hardware product development to be leaner — one doesn’t have to burn money to meet a tight schedule. A team of 2 can now take 3 years, working mostly in their spare time, to build a laptop from scratch as a hobby. This is a great time to be developing hardware products, particularly open-source ones.

the only benefit of the slowdown of moore’s law i’ve ever heard of