Large Looting Models

From lawsuits to protests, much attention has focused on the way Generative AI (GenAI) rides roughshod over the rights of authors, artists and other creative workers by mass appropriating digitized media. However, GenAI is now also being applied to reshape the biophysical world with an even more consequential violation of collective and community rights. As artificial intelligence appropriates the living stuff that feeds, clothes, and shelters people, digital piracy of media is now evolving into an AI-driven mass biopiracy.

As artificial intelligence appropriates the living stuff that feeds, clothes, and shelters people, digital piracy of media is now evolving into an AI-driven mass biopiracy.

This shift can be seen in the case of Generative Biology (GenBio), a field that involves training Large Language Models (LLMs), not with digital text but with biological datasets to generate novel blueprints for the molecules of life (e.g. new viruses, DNA, or proteins). These digital genetic blueprints are the raw materials for the ‘synthetic biology’ (SynBio) industry which uses DNA-synthesizing machines and new genetic engineering techniques, such as gene editing, to build or change physical DNA molecules from scratch. This enlarges the reach of AI tools to re-make life.

To compare: For a familiar GenAI platform, such as ChatGPT, an LLM is trained on billions of pieces of text, images or sound. Then that model can be queried (inference) to predict novel synthetic recombinations. In GenBio the underlying training data is digital genetic code, for example, the letters of DNA- G,T,C and A of proteins. Bioengineers can query the model to generate novel synthetic protein codes or a novel virus sequence that can be quickly built in a lab.

In GenBio the underlying training data is digital genetic code, for example, the letters of DNA- G,T,C and A of proteins.

In this way, AI systems may generate actual biological material. Some will be released to the wild or sold as food ingredients, drugs, or seeds. Moving from digital to living materials has been referred to as the “biodigital” realm. As with all GenAI, this process comes with hallucinations, errors, bias, and more. While an error-riddled piece of AI art can be amusing, a hallucination in a novel virus may prove deadly.

The rise of GenBio

The engine of current GenBio developments are biological foundation models” trained on trillions of tokens of DNA or protein sequences. Google DeepMind’s AlphaFold—whose principal developers won the Nobel Prize in Chemistry in 2024—is trained on all known protein sequences and in turn, can generate novel proteins. Evo 2 from the tech bro-funded Arc institute and Nvidia (with a little help from OpenAI), is trained on 9.3 trillion nucleotides of DNA from over 100,000 species. Last year it produced working bacteria-killing viruses. A third platform, M-Optimus, bills itself as a world model for biology,” integrating many different types of biological training data.

Biological foundation models, in turn, are used by drug or ingredient companies: a GenBio model may be prompted to generate a synthetic DNA or protein sequence for drugs or flavors. To do so, the AI model reaches into its training data—trillions of DNA sequences scraped from science papers and online databases. These training sequences came from biological samples originally taken from communities, farmers, peasants, or patients. They are the original holders and stewards of the underlying genetic resources that make GenBio possible. It is from them that large-scale theft and commercial appropriation of their biological resources, known as biopiracy, is underway.

The new age biopiracy: Scraping the building blocks of life

Using AI models may be a new twist on genetic engineering, but biopiracy stretches back hundreds of years to the collection of germplasms for botanical collections and the collection of biological material by the biotech industry, the so-called “bioprospecting”. There are now decades of work by advocates and communities trying to protect their collective rights of farmers, peasants, and traditional Indigenous communities as original stewards of genetic resources. Long before the current AI wave, Global South governments fought hard against biopiracy. In the 1990s, the Convention on Biological Diversity (CBD) was set up with one of its three core aims being “the fair and equitable sharing of benefits from the use of genetic resources”—that is, preventing biopiracy. The CBD established guidelines on Access and Benefit Sharing (ABS), whereby if a scientist or corporation wished to access genetic resources, they had to agree to “mutually agreed terms” on which to share some benefits with the “provider” communities based on free prior informed consent. This ABS regime was codified into the Nagoya Protocol on Access and Benefit Sharing.

Unfortunately, despite warnings from civil society,the Nagoya Protocol only governed use of physical biological material, not the digital versions (as digital sequences).

Unfortunately, despite warnings from civil society,the Nagoya Protocol only governed use of physical biological material, not the digital versions (as digital sequences). The biotech industry shifted to sequencing more genetic codes to upload, distribute, and store. These in turn, became training data for Evo 2, Bioptimus, AlphaFold, and others.

Cali fund: A retrograde step for accountability?

To close the loophole of digital biopiracy, the Global South pushed hard for Digital Sequence Information (DSI) to also be covered by ABS rules. In November 2024, a landmark agreement was made to set up a pooled fund called the Cali Fund. Governments could now ask large companies that use DSI to pay into the fund as compensation. The DSI agreement, made between 196 governments, specifies that if a company is big enough and uses DSI in its business, then it "should" contribute either 0.1% of its overall annual sales or 1% of its corporate profits.

Previously, a company would be expected to make an agreement with a specific community. Now by digitizing the genetic resource, the company can instead pay into a general fund.

The Cali Fund marked a significant shift in the ABS regime. Previously, a company would be expected to make an agreement with a specific community. Now, by digitizing the genetic resource, the company can instead pay into a general fund. The direct accountability has been broken and replaced with a general catch-all compensation. In this sense, the Cali Fund is a retrograde step.

From the point of view of AI governance, however, there is an interesting twist: the agreement recognizes the role of AI companies as users of DSI and lists them among those expected to pay into the fund. By one rough calculation, six AI giants (NVIDIA, Microsoft, Salesforce, Amazon, and Alibaba) in 2024 would have been collectively on the hook for paying between USD 1.5 billion to 3.74 billion into the Cali Fund. None of them did.

The Cali decisions may be the first international UN decision to acknowledge that AI companies should pay providers for training data.

The Cali decisions may be the first international UN decision to acknowledge that AI companies should pay providers for training data. This may be a good precedent for other training data—including creative works—to be recognized as requiring rights and compensation. On the downside, that decision severs the specific link of accountability between those who steward genetic resources and those who exploit them. The bargaining power of a provider community to verify, say no, or determine the terms and conditions of use is being swept away.

A new twist on old theft

How does this look in practice? In the case of Evo 2, there appears to be clear and broad mass biopiracy at work. Evo is trained on 9 trillion tokens, basically all the publicly available DNA, RNA, and protein sequences that can be found on the internet or public databases. Evo 2 claims to be an open source model, and so, it appears companies can freely access and commercially leverage that data with impunity. There does not appear to be any attempt by the Arc Institute to recognize the rights of provider communities to negotiate, recompense, or give them agency over the use of their genetic heritage.

There does not appear to be any attempt by the Arc Institute to recognize the rights of provider communities to negotiate, recompense, or give them agency over the use of their genetic heritage.

AlphaFold has a similar biopiracy story. Originally, Google DeepMind provided AlphaFold as an open source model (similar to Evo 2), but later the model was made proprietary and has a thicket of patents around it that may let Google maintain monopoly. The underlying data or the weights are no longer made public. Instead, they license the model for a fee and have set up a private company, Isomorphic Labs,working with pharmaceutical companies to develop drugs. Google DeepMind does not appear to be seeking to involve the rights or consent of communities.

A third example is Basecamp Research, a UK-based private AI Biotech firm that has built its own database of genetic training data as well as its own proprietary GenBio platform. Unlike other GenBio firms, Basecamp Research has paid close attention to the CBD discussions and touts its AI platform as “Nagoya-ready”. Its training data is entirely proprietary and largely collected through direct agreements with communities, national parks, and national environmental agencies. While Basecamp Research has put some small funds into the Cali Fund, it claims to have made agreements that are Access and Benefit Sharing with communities from whom genetic resources were taken. If Basecamp Research makes money commercializing products, then a small part of the profit will go back to that original community.

While this arrangement may appear more fair on the surface, closer attention to details belies that view. Basecamp Research trumpets that it pays communities, but this is to collect biological samples—essentially paying labor costs to self-bioprospect.

While this arrangement may appear more fair on the surface, closer attention to details belies that view. Basecamp Research trumpets that it pays communities, but this is to collect biological samples—essentially paying labor costs to self-bioprospect. It additionally offers training in DNA sequencing as a “non-monetary benefit.” This benefits Basecamp Research, which needs trained workers in communities to capture more data for them. Theoretically, there might be further funds if there is profit from a product, but this relies on trusting Basecamp to track which training data led to the creation of which novel genetic sequences through an opaque and private LLM. Essentially, there is no third-party way to track decisions made within the AI model.

Driving incentive: Data hoard as the ultimate asset

It’s also a misunderstanding of where value currently lies in the emerging GenBio industry. A recent study published by ETC Group shows that, despite industry hype about future products (drugs, foods, chemicals), for now, it is the accumulation of genetic data that really holds monetary value. In line with Silicon Valley business models, the value of Basecamp Research lies less in future speculative products than in the investor value of the aggregated data it has collected. Its ability to attract investor capital or set up partnerships depends on the size and promise of its data hoard. Every piece of genetic data Basecamp Research extracts from communities for close to free increases its value as an investable asset. Yet, none of that investment value is returned to communities.

A recent study published by ETC Group shows that, despite industry hype about future products (drugs, foods, chemicals), for now, it is the accumulation of genetic data that really holds monetary value.

Two centuries ago, biopiracy looked like gentleman prospectors stealing commercial plants away to set up rubber plantations and sell ornamental flowers. Two decades ago, biopiracy looked like biotech companies smuggling out DNA samples so they could genetically engineer a crop or bacteria to make a profitable product. Today, biopiracy looks like a startup paying pennies to peasant communities to send them DNA samples so they can sell their data hoard and AI model as an asset to investors on Wall Street, Silicon Valley, or the City of London.

The series brings together expert voices and was commissioned to inform the development of the issue brief by IT for Change, ‘Governing AI for the Cultural Commons: Beyond Intellectual Property’, under the AI, Culture and Intellectual Property Subgroup of the UNESCO Global Civil Society Organizations (CSO) and Academic Network on AI Ethics and Policy.

New article every Wednesday! Watch this space for more thinkpieces and read the issue brief here.

Large Looting Models

The rise of GenBio

The new age biopiracy: Scraping the building blocks of life

A new twist on old theft

Driving incentive: Data hoard as the ultimate asset

Articles you might be interested in

Contact Information

Sign up for our newsletter

Large Looting Models

The rise of GenBio

The new age biopiracy: Scraping the building blocks of life

A new twist on old theft

Driving incentive: Data hoard as the ultimate asset

Articles you might be interested in

Contact Information

Follow Bot Populi

Sign up for our newsletter