A significant amount of data, while remaining uncollected, theoretically exists, providing an opening for synthetic data algorithms to step in and transform how we use data. Advancements in this field rely on algorithms that tune and are tuned by what we expect would happen. This might sound like guesswork, but today we can push reality to its limit by running thousands, if not millions, of simulations.
While hitherto being the stuff of sci-fi, recent breakthroughs have driven legislators, researchers, and developers to seriously consider the implications of synthetic data technology, especially since Big Tech corporations as well as state arms like the US Army have begun deploying it.
Synthetic data has the potential to fill the maw of eternally hungry data-processing algorithms, no matter how large or complex. Not only can it bypass privacy concerns (since simulated people don’t have addresses or personal attributes unless we want them to), it also allows smaller companies like start-ups, who neither have competitive data collection capabilities nor the capacity to pay data brokers for the same to compete in a market where artificial intelligence (AI) training data commands more than a billion dollars in market share. Two years ago, MIT released the ‘Synthetic Data Vault’, designed to help those without access to data, learn and compete in the AI market.
While hitherto being the stuff of sci-fi, recent breakthroughs have driven legislators, researchers, and developers to seriously consider the implications of synthetic data tech, especially since Big Tech as well as state arms like the US Army have begun deploying it.
Indeed, we may be on the cusp of a level playing field vis-à-vis the tech sector. But for that to materialize, it’s key that we load the synthetic data dice in favor of such an outcome.
The Tech and Its Use Cases
Imagine a pair of dice. You could roll it as many times as you want and write down the numbers. But, then…why bother rolling the dice? You could also just write down a number from 1 to 6 as many times as you needed to. Instead of collecting data, you’re ‘sampling’ from your brain, creating a synthesized dataset of dice rolls. Synthesizing this data is easier than rolling and recording. Also, you likely know that the average of all rolls should be around 3.5, which would allow you to make a more accurate (although still synthesized) dataset.
Now imagine this process being done ‘algorithmically’. Simply put, synthetic data is data created by a computer. Like random sampling from your brain, a competent synthetic dataset draws from huge swathes of real-life samples. Most people, even AI engineers, would be pretty hard-pressed to distinguish it from ‘real’ data.
While the sampling and generation method uses and scrambles pre-existing lists of common, available data, a more advanced and concerning method is the use of a Generative Adversarial Network (GAN). A GAN comprises two AIs working together: a generator and a discriminator. The generator produces data and the discriminator discerns if it’s possible for the data to exist in the dataset. By reinforcing each other, both algorithms acquire greater intelligence and success with prediction.
While both methods have their flaws, the most obvious being missing or over-confirming a relationship in the dataset thereby creating biases, they have had significant uptake already. The GAN technique has found huge successes, especially in image generation, and is partially responsible for the boom in AI-generated art that has taken social media by storm in recent times.
But GAN’s applicability goes much further. GAN-based simulations can be finetuned to an incredible degree, for assessing possible configurations of car crashes, or shots taken by an attack drone. These technologies today also sit on top of many academic ecosystems. A theoretical use case is seen in mathematics, where AI algorithms have begun generating proofs for existing theorems, and synthesis algorithms are poised to start proposing theorems. Synthetic data demonstrates great utility for research in the field of healthcare and social sciences, where it can allow for comparative research without violating subjects’ privacy.
At the forefront of development and marketing are synthetic data producers like TONIC, who advertise themselves as the “fake data company”, selling datasets created from extremely advanced (proprietary) algorithms. Their data, according to TONIC, is gainfully useful and more secure than collected personal data and subject to fewer regulations.
Synthetic data demonstrates great utility for research in the field of healthcare and social sciences, where it can allow for comparative research without violating subjects’ privacy.
Popular online advertising corporations have also begun to use synthesized data to train ad delivery services in a bid to ensure effective oversampling. Corporations using this tech are poised to significantly challenge the Meta-Google duopoly over online advertising, especially with looming regulations on personal data in large jurisdictions like the European Union (EU), California, and India.
Too Good to be True?
The benefits of synthetic data seem quite obvious. Protecting privacy, allowing faster AI training, and breaking the monopoly of Big Tech over reliable datasets should be reason enough for policymakers and legislators to promote its usage and push for its widespread adoption.
But while synthetic data technology seems like a great solution to many problems of the modern tech sphere, it is certainly no magic bullet. Much like cryptocurrencies, data synthesis is currently being oversold and overhyped in an unregulated market to consumers and businesses who don’t fully understand the scope, issues, or biases of the technology. The tech industry is playing a reckless game in trying to use synthesized data to bypass present data governance regulations. And once again, when crisis hits, as the FTX fiasco has proven, regulators will have to step in, a moment too late, to pick up the pieces of an otherwise promising advancement.
Questions of fairness, reliability, and trust concern everyone involved in the industry, from annotators to engineers. The urgency of competent regulation before the technology robs value chains cannot be overstated. For countries in the Global South, a general wariness of Big Tech’s promises is expected. Even the recent success of AI-based text synthesis algorithms like ChatGPT is partially due to the efforts of laborers being paid pittances across ponds. Biotech firms have already set up data annotation centers in developing countries, with hundreds of workers labeling data – such as tumors in brain scans and configurations of wrist joints to train AIs for healthcare – who receive minuscule cuts of the profits. Much of the successes of AI come from unpaid, unvalued labor, and synthesis will clearly be no different without active intervention from states.
While the technology itself may seem to protect privacy, synthetic data can amplify auxiliary systems of surveillance. Synthesizing data means there’s more data to learn how to devastate personal privacy. As data synthesis advances, surveillance algorithms will be better trained from sets built from real people, with which comes the threat of automated policing. The German Federal Constitutional court has already had to strike down predictive policing algorithms provided by the US data analytics firm Palantir. Palantir’s Anonymization Foundry already integrates synthesized data. Software can be adapted to whatever data is available, and synthesizing criminals from citizens is easier when criminal datasets are fortified and obfuscated through generated humans. Policing systems are going to be affected by these algorithms, and legislation needs to demand transparency, competency, and interpretability before crime and punishment can be trusted to algorithms, if ever.
Much of the successes of AI come from unpaid, unvalued labor, and synthesis will clearly be no different without active intervention from states.
Regulators must also note that synthetic data isn’t free from bias. In fact, mimicked data creates a feedback loop of biases, as the flaws in the original dataset quickly replicate during both generation and discrimination. The majority of AIs operating in current markets require very finely tuned datasets, but maximizing their ‘accuracy’ narrows the scope, an issue already prevalent in AI research. Dataset selection remains an issue even in healthcare, achieved by tuning datasets rather than the algorithmic parameters, resulting in very slow adoption rates for a field that claims to be industry-driven. Dataset tuning is an academic and industrial norm, a problem that will be sharpened if data is synthesized for better results.
Bias is likely to be a huge bottleneck to the promises of simulated realities, as it is unlikely that anyone will even know what biases to look for. For a simple example, a picture of a human silhouette is often simply seen as male, extra characters have to be added to distinguish it as female. This extra processing power will be a deterrent from pluralistic datasets, turning present defaults (white, male, and heterosexual) into mathematical constants. Policymakers should maintain a suspicious view of tech companies and their data-processing algorithms while questions about the shape of synthetic datasets remain unanswered.
Regulators must also note that synthetic data isn’t free from bias. In fact, mimicked data creates a feedback loop of biases, as the flaws in the original dataset quickly replicate during both generation and discrimination.
Another issue is that of storage. With extremely invasive data collection, storage has already become a problem. Much of the industry claims this is due to heavy regulations on personal data, but once data synthesis becomes the norm, a bloat in storage is unavoidable. Video data codecs have already begun tuning to handle terabytes, and to train AIs, a vast amount of space will need to be created and allocated. While this may seem like a small hassle now, demand for hardware often informs the software being developed. With a global chip shortage tightening belts, the true scale of storage needs of synthetic data-driven algorithms will be more evident and pressing in a few years.
Looming over all these drawbacks, the core issue is the structure that synthetic data exists within. Corporations collating datasets don’t really own the data they use to hone their algorithms, and also often don’t allow open access to datasets. This exists alongside the possibility of data theft, where corporations may stake datasets created by open-source data processors. Much of the current system relies on patenting forks of open-source code or blatant copyright infringement, for example, StabilityAI, which is also a GAN-based system that trains itself on artists’ work, without their consent, churning their work to create facsimiles with questionable ownership.
The current hierarchies in Big Tech sustain themselves by buying the tech of smaller engineers. While open-source models do sometimes thrive, the bulk of projects today have begun by crowdsourcing development with the promise of universal access, only to then close off the tech in the stage of finalization. There is little reason to assume that data synthesis technologies will not be captured by Big Tech and existing datasets will be used to entrench their monopolies, unless regulatory steps are taken to ensure otherwise.
While synthesis algorithms are useful, they can only ever deal with symptoms of the current tech sphere, not the causes. Oil spills are still a consequence of refusing to move away from fossil fuels; inequality, biased algorithms, privacy breaches, and the monopolies of Big Tech are a consequence of surveillance capitalism and will still need to be addressed with structural solutions.
Future-proofing Synthetic Data
The monopoly on data-heavy algorithms, especially on collection is already a hotbed for policy, so the regulation of data generation, and of Big Data algorithms in general, should be waiting in the wings for when the political inertia is finally overcome. In the EU, lawmakers have acted promptly, ensuring synthesized personal data is granted the same protections as real data, denoting that synthesized data must be afforded the same protections as real data by delegating it as pseudonymous data. However, this is nowhere near enough protection for a system that doesn’t actually ensure the anonymity of the people on whom the synthesis algorithms are trained. India’s revised Data Protection Bill has failed to include a provision for the same, a loophole that is likely to be abused by companies claiming that protections don’t need to exist for their data, which is synthetic. Amazon’s Alexa has already begun using synthetic data to better understand Hindi. Nations that are still debating privacy laws, like India that are using decade-old legislation today, must take serious cognizance of data synthesis technology before the window of opportunity passes. Any legislation that refuses to engage with mimicked datasets will be ineffective if it does not address what kinds of advancements the data synthesis and processing systems are trying to achieve.
There is little reason to assume that data synthesis technologies will not be captured by Big Tech and existing datasets will be used to entrench their monopolies, unless regulatory steps are taken to ensure otherwise.
Policy needs to be drawn conclusively for ‘bad’ AI decisions, especially when those decisions are made in part because of synthetic data. The culpability of failure needs to be legislated and regulated. A wave of platform regulation has passed over us, and is likely to come again. If a company is providing AI-based services based on ‘perceived’ datasets, all responsibility lies on them. Synthesized loopholes must be identified and closed before they can be used to damage the fairness of the market and the privacy of consumers.
Drawing from this report, a set of direct and indirect regulations needs to be drawn. Offering bounties for biases in datasets, real and synthesized, is a sure way to rectify them quickly. To ensure that the tech keeps its position as an equalizer, compute measuring as well as computational support, especially for academia, need to be enshrined in law as well. Regulation must begin with questions of corporate power, then expand its scope to a general AI-regulatory framework, which needs to include synthetic data. We’re still stumbling on the first step, but once the ball is rolling, a well-regulated tech sphere will do wonders for transparency, knowledge production and access, and general quality of life, all of which are immediate issues for the nations of the Global South.
If future development is contingent on the need to create systems that chew up human actions and spit out variables to train another system, maybe it’s time to take a deep look at what incentive sets we’re locked into. Synthetic data has the capacity to synthesize a much fairer future, but it can only be wrought with competent regulation.
This essay has been published as part of IT for Change’s Big Tech & Society Media Fellowship 2022.