Contemporary debates often frame disputes around the training data of large language models (LLMs) as questions of copyright law and the rights of individual creators. These discussions focus on whether companies may legally use publicly available material to train models, whether creators deserve compensation for such use, and how such compensation could possibly be determined. Restricting the framing to intellectual property rights leaves unexamined the political economy through which knowledge, culture, and value live within AI systems. It makes LLM extractivism appear solely as a legal dispute rather than a structural shift in how the knowledge commons is used and governed.

Restricting the framing to intellectual property rights leaves unexamined the political economy through which knowledge, culture, and value live within AI systems.

LLMs learn statistical patterns in language and cultural expression from their training data. LLM companies have achieved the scale necessary for training by ingesting materials from the open internet and converting them into proprietary computational capacity. As a result, peer-produced archives, journalism, creative works, and public knowledge repositories are treated as raw material for private model training. Take the example of GPT-3. It was trained on a mixture of large text corpora—including filtered subsets of Common Crawl, WebText, books, and Wikipedia—derived from web-scale archives that initially measured in the tens of terabytes. Through multiple stages of filtering, deduplication, and weighting, portions of this material were incorporated into the model’s training dataset. Similarly, the LAION-5B dataset, used widely in image-generation models, contains more than 5 billion image–text pairs scraped from the web. These datasets aggregate cultural material produced by a large number of individuals and communities over time. Once incorporated into proprietary models, the underlying knowledge is assimilated within systems whose internal structures and processes remain opaque and privately controlled.

In terms of cultural rights, several forms of harm become visible.

  1. Appropriation: Cultural resources are incorporated into training datasets without systematic mechanisms for consent, attribution, or benefit-sharing. Authors, journalists, translators, artists, and community archivists typically have little visibility into how their work is used in model development. Several artists filed lawsuits against AI companies such as Stability AI, Midjourney, and DeviantArt, arguing that models trained on scraped artworks can generate images in their recognizable artistic styles without the artists’ permission.
  2. Disembedding: Cultural production is embedded within social relations, contexts of authorship, and political-economic histories. During model training, however, language and other cultural artefacts are processed primarily as data. Shorn of contextual meaning, cultural artefacts become statistical tokens. Situated knowledge is converted into model parameters optimized for predictive probability. This is epistemic erasure.
  3. Commodification: The outputs of generative systems increasingly compete with the workers, creators, and communities whose labor enabled those systems in the first place. Concept artists working in the video game and entertainment industries have reported losing freelance commissions as clients experiment with image-generation tools such as Stable Diffusion and Midjourney. Cultural production is treated as an input for algorithmic systems whose outputs generate commercial value for AI companies rather than a living domain governed by rights, reciprocity, and social obligation.
  4. Concentration: Taken together, the above three harms contribute to the concentration of economic value and institutional power within AI firms. Cultural knowledge is produced through distributed forms of labor. However, the economic returns generated from transforming that knowledge into LLM data are captured primarily by companies that own and operate the model infrastructure. The communities whose knowledge and cultural labor enabled these systems remain largely absent from their governance.
  5. Homogenization: A feedback cycle develops in which a homogenized way of knowledge production becomes the default. Over time, it whittles down the landscape of human expression itself.

Journalism, translation, editing, moderation, community documentation, and even certain artworks constitute forms of knowledge production and maintenance. In many economies, these professions are precarious and unevenly compensated. The increasing integration of AI systems reorganizes this labor ecosystem. Editors and moderators increasingly absorb responsibility and liability for verifying or correcting automated outputs while facing pressures of deskilling, job layoffs, and wage stagnation. This labor dynamic intersects with existing inequalities within media and knowledge industries. In India, for example, women journalists are disproportionately represented in editorial and content moderation roles. These are positions essential to maintaining editorial quality but historically undervalued within newsroom hierarchies. As automated systems are increasingly integrated into editorial/ newsroom workflows, these roles often absorb the reputational risk associated with algorithmic errors while remaining invisible within narratives of technological innovation.

Editors and moderators increasingly absorb responsibility and liability for verifying or correcting automated outputs while facing pressures of deskilling, job layoffs, and wage stagnation.

LLM development produces a dilemma in the Global Majority world. On one hand, dominant training datasets systematically underrepresent many languages and knowledge traditions. According to a survey of multilingual NLP resources, over 90% of the world’s languages remain classified as ‘low-resource’ in computational datasets, with English accounting for a disproportionate share of training data. This imbalance leads to uneven model performance across linguistic and cultural contexts and produces representational injustice. On the other hand, attempts to address these gaps often reproduce extractive dynamics. When cultural materials from the Global Majority world are incorporated into training datasets, they frequently enter through ungoverned scraping rather than community-led data stewardship. Communities become sources of data rather than epistemic authorities or participants in decisions about how their knowledge is collected, used, or represented. The result is ‘extractive inclusion’—participation without decision-making power.

Public knowledge repositories illustrate some of the tensions created by these practices. Collaborative repositories such as Wikipedia and its sister projects are built via volunteer labor and donations of money and content within a public-interest governance model. Because of their scale, reliability, and open licensing, these repositories are also highly valuable inputs for LLM training. However, the economic value generated by AI systems trained on these resources rarely flows back into the maintenance and growth of the commons. Volunteer labor and expertise that sustains public knowledge infrastructures eventually goes into building private model capacity without reciprocal institutional support.

Volunteer labor and expertise that sustains public knowledge infrastructures eventually goes into building private model capacity without reciprocal institutional support.

These developments reflect historical patterns; intellectual property regimes have long mediated the relationship between public knowledge and private innovation. Patent and copyright protections were designed to balance incentives for innovation with the expansion of the public domain. However, contemporary AI development extends knowledge enclosure beyond these mechanisms. Major AI firms classify training datasets, model architectures, and optimization techniques as another kind of intellectual property: trade secrets. Scholars have noted that the expansion of trade secrets can inhibit cumulative scientific progress by restricting access to foundational research materials. In the context of LLMs, such secrecy limits the ability of public-interest researchers to audit datasets, evaluate representational bias, or develop alternative models. Copyright law is invoked as a mechanism to regulate training data, yet trade secrets simultaneously shield the infrastructures through which data is transformed into economic value. Intellectual property frameworks, therefore, contribute both to the cannibalization of public domain knowledge and to the enclosure of newly generated knowledge resources.

Addressing ‘extractive inclusion’ requires more than copyright reform. A governance framework centered solely on individual rights cannot adequately protect collective cultural interests or address the structural inequalities of knowledge production. Alternative approaches have begun to emerge within policy debates. One approach is the development of data commons governance frameworks that treat training data as collectively stewarded resources rather than privately appropriated inputs. Such frameworks could incorporate mechanisms for community consent (or its refusal), attribution standards, and benefit-sharing arrangements. Another approach draws from movements for indigenous data sovereignty, which emphasize community governance over how cultural knowledge is collected, stored, and reused in digital systems.

Publicly funded datasets, multilingual corpora, and open-source models can counterbalance proprietary LLM infrastructures. Initiatives such as BigScience’s multilingual BLOOM model, developed through an international research collaboration, demonstrate the feasibility of large-scale open research efforts that prioritize transparency and public accessibility.

Publicly funded datasets, multilingual corpora, and open-source models can counterbalance proprietary LLM infrastructures.

If LLM development continues to rely primarily on extractive models of data acquisition and proprietary enclosure, the knowledge commons that enabled the digital public sphere will gradually erode. Addressing these issues requires governance that treats shared knowledge resources as infrastructure to be stewarded collectively rather than extracted for private gain. The governance challenges posed by LLMs are not only legal but also epistemic and institutional. Policy questions should not be restricted to (1) whether LLM training practices violate copyright, and (2) whether creators receive (fair) compensation. A fuller analysis must also address the political economy that shapes how knowledge is organized, governed, and ultimately monetized by LLMs. Who governs the infrastructures through which culture becomes machine-readable, whose knowledge is preserved within those infrastructures, and would future innovations remain anchored in a shared public domain?

The series brings together expert voices and was commissioned to inform the development of the issue brief by IT for Change, ‘Governing AI for the Cultural Commons: Beyond Intellectual Property’, under the AI, Culture and Intellectual Property Subgroup of the UNESCO Global Civil Society Organizations (CSO) and Academic Network on AI Ethics and Policy.

New article every Wednesday! Watch this space for more thinkpieces and read the issue brief here