Avalo.ai: The David & Goliath of Large Genomic Models

Of all the languages an AI could master, without a doubt one of the hardest and most impactful is the genetic language of life itself.

So while our world is enthralled with the ability of LLMs to understand all human languages and programming languages, these LLMs also now understand the symbolic languages of math and musical notation and chemistry, and can infer the structure of proteins. But a little away from that action is a giant push to master the grandaddy of them all, the language of the genome.

It’s important to appreciate the magnitude of difficulty in jumping from human language to genetic language. In human language, adjectives and adverbs are usually right nearby the noun or verb they modify, though in some languages they pop up at the end of the sentence, changing the meaning with the final word. Not so with genetic languages. In genetic languages, these modifying bits of code can be nowhere nearby at all, floating billions of base pairs away. And 99% of the genome does not code for proteins, so effectively there are untold billions of these modifying introns, promoters, repeats, and telomeres. If the human genome were a book, it would be 262,000 pages long, and imagine 99% of the words in that book were essentially in random place. Oh, also – just to make it harder – imagine this book got printed with all the words jammed together, unspaced, so you never knew when one word began and the other ended. No small feat to make sense of this, indeed. Yet every living thing on earth is run by this programming code.

Five years ago, we at SOSV SF formed a thesis that, as this language gradually revealed itself, one of the best ways to take advantage was in the world of agricultural crops. The reasoning was quite simple: a wheat plant doesn’t have privacy rights like all human medical data does. Also, breeding plants is fine – breeding humans not. So quite simply, if you wanted to really push fast on playing God with life, plants was a better place to start. In 2021, Avalo.ai was formed, lead by our founders, Brendan Collins and Mariano Alvarez.

Avalo is the David to the Goliath formed a year later. That was when InstaDeep, an existing AI company, got a huge investment and partnership with Google to analyze plant genomes, with the goal of improving crop sustainability, resilience, and nutrition. A year later, it was bought by BioNTech, the inventor of the Pfizer COVID vaccine, for around $500 million, and soon partnered with the crop sciences giant Syngenta. In January of 2024 they released AgnoNT, a Large Genomic Model for the plant genome.

AgroNT is the “corporate” competitor, and Avalo is the venture-backed leader; it’s been funded by a syndicate of VCs all of whom have a lot of experience in agtech – Germin8, Alexandria, AtOne, Better Ventures, and SOSV.

A public version of AgroNT is available on Hugging Face. It’s a major step forward helping researchers identify which genetic mutations will likely have the biggest impact on traits without needing extensive lab experiments first.

Meanwhile, Avalo.ai has not released its attention model Gaius to the public. They are instead using it to breed plants to maturity and prove that their model’s predictions actually work, which would mean they have their weights and biases down pat. In fact they’re doing more than validating their predictions, they’re actually growing and selling their first product, which happens to be a drought tolerant, fashion grade long-staple cotton.

With this David and Goliath story happening at full speed, I sat down with Avalo.ai’s Brendan Collins and Mariano Alvarez to discuss this race for superintelligence over plants.

Po Bronson: The promise of Large Genomic Models is the capability to make big leaps forward using Ai-guided crop evolution. Before we get to that future, let’s talk about what you’ve already done. You bred a new cotton. You planted it. You harvested it. And you just got the data on the harvest. So I’ll ask you this: does that data really suggest a massive step change?

Brendan Collins: Absolutely. The results were staggering. Gaius, our LGM, delivered gains that were unheard of. We were breeding for multiple traits at the same time. The single best way to interpret the results is a simple number: how much more valuable is the cotton? One of our varieties produced 343% the amount of cotton, in weight. But it’s more valuable than that, because the fiber length, elongation, strength, and spinnability also improved. It went up two to three grades in the cotton grading system.

Mariano Alvarez: And it was grown using 75% less fertilizer. And no irrigation water. That both lowers farmers input costs, and qualifies as a low-carbon cotton, which carries an additional premium.

Po Bronson: The farmers you worked with must love you.

Brendan Collins: Yes, yes they do.

Po Bronson: And to compare that gain … a normal breeding program improves a crop by how much?

Brendan Collins: A conventional breeding program usually delivers a genetic gain of about 1%.

Po Bronson: So, one might say you accomplished 500 years of evolution – but in a single year.

Brendan Collins: Yes, AI guided evolution will change everything.

Po Bronson: Has AgroNT released any similar data?

Mariano Alvarez: Publicly, they made predictions for the world’s researchers to go test, but I haven’t seen results out of the field yet.

Po Bronson: Can we talk about the difference between gene editing and AI guided genetic recombination? CRISPR was first applied to plants in 2013. It also promised a revolution. But it’s been slow.

Brendan Collins: Very slow. And very expensive. Pairwise is nine years old, it raised $155 million, and it pulled its only product that was on the market, a less spicy mustard green. The regulatory process is seven years. Not to mention there’s no appetite in Europe.

Mariano Alvarez: Gene editing only works on a few genes at a time. Inari, the largest editing startup, also raised $770 million to do corn and soy. In every cycle of transformation, they can only edit four genes, so it takes many years to address a complex trait.

Po Bronson: And how different is your cotton, after just one year?

Mariano Alvarez: Thousands of genes will change in a single breeding cycle. Because for every trait we want, Gaius identifies as much as 700 genes involved in that complex trait. Then we are breeding for multiple traits at once.

Po Bronson: What is most different about plant development in the age of Large Genomic Models?

Mariano Alvarez: We don’t start with the highest yielding plants and try to nudge them forward. Corteva does that, their corn is very inbred, so there’s not much genetic diversity left in it. Machine learning needs variation to learn from, it needs edge cases. So we start by planting the most genetically diverse seeds from all the seeds in all the world’s seed banks. Whether it’s high yield or poor yield, doesn’t matter. In Texas, we planted over 500 variants of cotton. And then we started looking at 5 million genetic markers. The big seed companies like Corteva only look at 30 to 40 markers in corn. It’s not that they can’t look at more, it’s that there isn’t variety in the crop to look for.

Po Bronson: Would you call Avalo’s Gaius a Foundation Model?

Brendan Collins: Absolutely. One way to define a Foundation Model is whether it’s used for all sorts of questions and purposes. So at Avalo, we can ask Gaius for a very complex breeding plan of what plants to cross, and a prediction of the results. That question is rooted in the deep layers that understand the genome and the probability of genetic gain. But let’s say our customer doesn’t have the budget to do the entire breeding program – let’s say this year they only want to spend half. Gaius will optimize a breeding plan to any budget. Gaius has already run a simulation of 10,000 years of evolution. So we can ask the LGM financial questions, staffing questions, resourcing questions – and the answer is informed, always, by genetic predictions.

Po Bronson: What do you mean, exactly?

Brendan Collins: It basically informs every company decision. We’re booking cotton ginning capacity for next October in Lakeview, Texas – and Gaius predicts just how much capacity to reserve, based on genetic gain predictions. I can ask it how many field hands we need to hire in our three Texas locations. It tells us how much reinsurance to buy on our crops for the coming year. It even tells us how many fashion brands our business development team should to be talking to, to make sure there is offtake for this year’s cotton. The answer is always rooted in its deep understanding of the genetics we are working with.

Po Bronson: Can you talk about the compute efficiency for inference for Avalo’s Gaius versus other Large Genomic Models? Is it expensive to ask Gaius variations on those questions?

Mariano Alvarez: Sure. So as a frame of reference, when the world met Chat GPT 3.5 in late 2022, it was a dense neural network, so it took a huge amount of compute to calculate 175 billion parameters. It made huge gains in efficiency with different architecture, so the electricity demand today is 1/10th of what it used to be, per query, even though the task complexity of queries has gone up 500x. They did this by shifting to a Mixture of Experts architecture. So queries are routed to a few submodels, called experts, that are best suited to answer it, and only a portion of the model is active at any given time. This is part of what’s called sparsifying. Not boiling the ocean on each query.

Avalo’s Gaius is also a Mixture of Experts model. But this is where our ability to decipher which genes matter and which ones can be ignored is so crucial. Because our predictive accuracy on that can be as high as 99%, we can sparsify in a whole new way. For any one trait, or any one query, we can ignore most of the genome. A traditional dynamic marker sequencing panel for cotton will look at 1.5 million genetic markers, or SNPs – positions on the genome. But Gaius has determined that only 44,000 of those SNPs matter for the three traits we are engineering. This is massive for compute efficiency, and also cuts sequencing costs.

Po Bronson: How much did you have to do fine tuning or LoRA (Low-Rank Adaptation) of Gaius for cotton, specifically?

Mariano Alvarez: So what you’re really asking is how much our general model applies to any one crop, and how much the weights and biases need to be adjusted for each crop. Core to our thesis is you must start with maximal genetic diversity within a species, not just across species. And labs that only look across species will fail – it dilutes their model. So when we bring that species diversity data into Gaius, it autotunes for that crop and builds a new expert for that crop. Gaius has a cotton expert, and has a separate sugar expert, et cetera.

Po Bronson: Okay on that front – the Arc Institute is training Evo-2, its frontier model for all forms of life, across all species’ genomes. Are you worried about competition from that?

Brendan Collins: We’re not. Plant genomes are fundamentally different. Other life forms – such as humans and animals and even bacteria – lose genes if they’re not critical. Making DNA is metabolically expensive for them – so if they don’t use it, they lose it. Plants aren’t like that. They’ll save genes for a rainy day on a massive scale. Where humans have 2 sets of chromosomes, cotton is an allotetraploid, it has 4 sets of chromosomes. It’s like every cotton plant has Multiple Personality Disorder, it has another version of itself fighting to get out. Every now and then, that mutant has extraordinary abilities which are untapped. But very useful for our recombinations.

Po Bronson: Is Avalo going to release Gaius to public access? I’m not saying open source necessarily, but at least at some subscription cost?

Brendan Collins: We can. We might. But this is where Avalo is so very very different from InstaDeep. Because our Large Genomic Model doesn’t just control software. It also controls hardware. Gaius runs a drone selection system, and a seed selection system. The whole process.

Po Bronson: Can you tell me more what the drones and seed sorters do, exactly?

Mariano Alvarez: So, we’re all familiar with how drones can monitor plant growth, using multi-spectral cameras. In our case, the drone imagery are fed to Gaius. And this is where our ability to infer the genotype from the visible phenotype is so important. Gaius chooses certain plants for selection because it has inferred their genome.

Po Bronson: You’re sequencing – from the sky.

Mariano Alvarez: Kind of. It’s probabilistic.

Po Bronson: And your seed sorter? They normally look for defects, right? Disease, nicks, discoloration.

Mariano Alvarez: Right. But we have the only seed sorter in the world that infers genetics of the seed. We use Near Infrared Hyperspectral Imaging to see the presence of metabolites in the seed, up to 5 millimeters deep. From which metabolites are present, Gaius infers their genetics, selects the best, and the sorter sends the seed left or right through the hopper.

Po Bronson: All at 20 seeds per second.

Mariano Alvarez: Yes.

Po Bronson: And this is why Avalo doesn’t just open Gaius to the world. They’d need the hardware, too.

Brendan Collins: Another way to say it is, without the drone selection and seed selection, they’d have to sequence all the seeds from all the plants in a field. That’s 45 million seeds! We just choose the best 2,000 seeds to sequence. The cost savings is miraculous.

Po Bronson: I’d like to talk about what the world can do with Large Genomic Models and with Gaius specifically, now that you have this workflow down. What crops around the world need this deep genomics work, and why?

Brendan Collins: First of all, global agriculture is one of the biggest markets of the economy. It’s a $13 trillion dollar business. And each crop is sufficiently giant. Sugar, which we are reinventing with Coca Cola, is a $70 billion business. Rice, which we’ve worked on, is a $300 billion business. Cotton is $50 billion. Natural rubber is a $50 billion market. We’re doing all of them.

Mariano Alvarez: And all of those are experiencing massive disruptions in their supply chains. It’s just as important to be able to grow wheat as it is to make steel. No variable better proves this than the cost of crop insurance. In the last 25 years, the cost of crop insurance has gone up 546%. That’s 8 times faster than general inflation. Insurance has gone up because the losses have gone up, driven by all manner of stresses, aquifer depletion, extreme weather, hot summer nights, and sprawling pathogen range. France just had the smallest wheat harvest in 40 years. Citrus from Florida is at 1936 production levels, when we were in the midst of The Great Depression. Rice yield drops 7% for every 1C rise in nighttime temperature.

Po Bronson: What do you think will be the impact of Avalo’s Gaius on farmers?

Brendan Collins: Farmers love us. They look forward to getting better economics from their fields. They’re hurting now. The losses are devastating. And they are being sold so many inputs to put in their soil. What’s going on is that all these inputs functionally control the environment, they homogenize it – so the same seeds can be grown everywhere. But this is diminishing the genetic pools, vastly reducing the genetic diversity we need to plumb to make crops work in every environment. Gaius enables something very different – a crop that is designed for your environment. And we don’t even have to wait for the new seed. We can look at a farmer’s environment and tell them “of all the seeds available today, grow this certain seed, its genetics match your land the best.” We have farmers every day asking to join our cotton program – way more than we can serve.

Cookie	Duration	Description
AWSELB	session	Associated with Amazon Web Services and created by Elastic Load Balancing, AWSELB cookie is used to manage sticky sessions across production servers.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__Host-airtable-session	1 year	Contains a specific ID for the current Airtable.com session. This is necessary for the integartion of Airtable lists.
__Host-airtable-session.sig	1 year	Contains a specific ID for the current Airtable.com session. This is necessary for the integartion of Airtable lists.
brw	1 year	Detects and logs potential errors on third-party provided functions of airtable.com integration.
community_token	30 days	This cookie is used to track authenticated users in SOSV's Community Portal.
login-status-p	past	This cookie is necessary for the integartion of Airtable lists.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_UA-45933216-18	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.

Cookie	Duration	Description
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.

Did you know?