Wing your business on untested data and AI will build the wrong answer quickly and sound certain doing it. By the time the error shows up in the numbers it is already inside your decisions, and sometimes the business does not survive it.
Every prior generation of analytics announced its own weakness. A bad spreadsheet returned an obvious error. A broken query crashed. A flawed regression produced a confidence interval wide enough to make a careful analyst pause. Artificial intelligence removes those warning signs. It absorbs poor foundational data and returns fluent, formatted, decisive output with no native expression of doubt, which is precisely what makes the failure mode dangerous to an operating margin.
The mechanism is an inversion. Good tooling normally lowers cost and raises quality together. AI built on weak data does something stranger: it lowers the cost of producing an answer while raising the cost of discovering the answer was wrong. Solutions ship faster and present more confidently, so they propagate further into operations, pricing, hiring, and strategy before anyone audits the basis. By the time the error is visible in the numbers, it has already been capitalized into decisions that are expensive to unwind.
Those three numbers tell the whole arc in miniature. Poor data already carried a heavy, well-documented price before AI. Gartner's recurring estimate puts the average cost of poor data quality at roughly $12.9 million a year per organization, though that number comes from a survey of large enterprises sophisticated enough to already be buying data-quality software, so it describes big-company scale rather than a small business. The size-neutral version is more universal: survey work summarized by MIT Sloan and data-quality researcher Thomas Redman repeatedly lands on 15 to 25 percent of revenue lost to bad data, at any size. Layer machine learning on top of that foundation and the failure does not shrink, it accelerates: MIT's 2025 GenAI Divide study found that about 95 percent of enterprise generative-AI pilots produced no measurable profit-and-loss impact, against $30 to $40 billion in spending.
The most rigorous evidence for the deep version of this problem comes from model collapse. In a 2024 Nature paper, Shumailov and colleagues showed that when generative models are trained recursively on data produced by earlier models, they progressively forget the rare, low-probability events at the tails of the distribution, then degrade toward repetitive, narrowed output. The effect held across large language models, variational autoencoders, and Gaussian mixture models, suggesting it is a general property of learned generative systems rather than a quirk of one architecture.
The alarming part is the dose-response curve. Follow-up work (Dohmatob and colleagues) found that synthetic contamination as low as roughly one percent of the training set can be enough to push a model toward collapse, and that simply making the model or the dataset bigger does not reliably rescue it. Bad foundational data is not a linear tax you can dilute by adding volume. Past a threshold, more data of the wrong kind makes things worse, not better.
The second failure is human. Decades of research on automation bias show that people over-trust automated outputs, especially under time pressure, and especially when the output looks authoritative. A 2026 Nature Scientific Reports study found that when participants received AI guidance that was correct only half the time, those with more positive attitudes toward AI actually performed worse at the underlying task, because the guidance crowded out their own judgment. The researchers note a structural trap: human communication carries hesitation, hedging, disfluency, all natural uncertainty cues. AI output arrives without them, so users mistakenly read fluency as reliability.
Put the pieces together and the margin story writes itself. A team using AI on a weak data foundation ships solutions faster (lower apparent cost), with higher confidence (less scrutiny), into more decisions (wider blast radius). The MIT GenAI Divide data shows where this lands: 80 percent of organizations explore AI tools, 60 percent evaluate enterprise solutions, 20 percent reach a pilot, and only about 5 percent reach production with measurable value. The failures rarely announce themselves; they manifest as projects that quietly stall, or worse, as deployed systems whose errors are absorbed into operations and discovered only when results diverge from reality.
Before AI enters the picture, the data it will feed on is, in most organizations, a mess. It is scattered across dozens of disconnected systems, largely unmanaged, frequently inaccurate, often insecure, and rarely owned by anyone accountable for its quality. AI does not fix this. It inherits it, then acts on it at machine speed.
The numbers describe an environment that is the opposite of "tested." A mid-sized or larger company commonly runs on the order of a hundred SaaS applications, and large enterprises run into the hundreds (counts drawn from SaaS-management vendors, which skew toward tech-forward firms), so the same customer, order, or part exists in slightly different and often conflicting forms across systems that were never designed to agree. More than half of all enterprise data is "dark," collected and stored but never used, and more than 90% of it is unstructured. Roughly two-thirds of organizations do not even maintain a unified catalog of what data they hold, and data silos are the single most-cited cause of unusable data, named by 82% of organizations.
The security picture is no better. In 2025 the average U.S. data breach reached a record $10.22 million, and about 30% of breaches involved data spread across multiple environments, which is exactly what that kind of system sprawl produces. Roughly a quarter of breaches trace to ordinary human error. The newest exposure is self-inflicted: "shadow AI," meaning employees feeding company data into unsanctioned tools, appeared in about 20% of breaches, and 97% of AI-related breaches occurred at organizations lacking basic AI access controls.
Here is the part that should worry any operator. The reason the data is bad is not primarily technical. When data and AI leaders are asked what blocks them from becoming data-driven, 92% point to people and organizational change, and only 8% to technology. Data quality is consistently named the single biggest barrier to getting value out of generative AI.
The mechanism is familiar from every other kind of work. People fill the form field with whatever passes validation. They type a placeholder in the box because the real answer takes ten minutes to find. They enter the date in the wrong format, skip the optional field, and duplicate the record rather than search for the one that already exists. This is treated as busy work, low-status data entry that "does not really matter," because nobody above them has ever signaled that it does. The shoddiness is rational behavior in an organization that rewards throughput and never measures quality, and most companies do not measure the cost of their own bad data at all.
The abstract risk becomes concrete when you look at what has already happened. In each case below the data foundation was wrong, weak, or fabricated, the automated system acted on it with confidence, and the consequence was severe, in several instances existential.
Zillow built an algorithmic home-buying business, Zillow Offers, on a pricing model trained largely on a stable housing market. When the post-pandemic market turned volatile, the model kept confidently overvaluing homes, and Zillow kept buying them. The correction was brutal: a writedown of more than $540 million, roughly 2,000 jobs cut (about 25% of the workforce), and the entire iBuying unit shut down. The data foundation could not handle conditions it had never been tested against, and it took a business unit down with it.
Large language models generate fluent, authoritative text whether or not it is true. In law, where citations are checkable, the failure is now well documented. Stanford researchers found general-purpose chatbots hallucinated between 58% and 82% of the time on legal queries, and even tools built specifically for lawyers still hallucinated on 17% (Lexis+ AI), 34% (Westlaw), and 43% (GPT-4) of queries. The real-world tally: courts have flagged AI-generated hallucinations in more than 1,300 filings worldwide, including a $110,000 sanction against lawyers who submitted 23 fabricated citations. The liability is not limited to individuals: a tribunal held Air Canada responsible for its own chatbot confidently inventing a refund policy that did not exist.
Bad data is not always an accident. Because frontier models train on enormous scrapes of the open internet, an attacker can seed it. A 2025 study by Anthropic, the UK AI Security Institute, and the Alan Turing Institute found that injecting just 250 malicious documents was enough to backdoor models ranging from 600 million to 13 billion parameters, with success depending on the absolute number of poisoned documents rather than their share of the data. For the largest model, those 250 documents were about 0.00016% of the training set. Scale does not dilute the poison.
Sometimes the false data is a convincing fake aimed at a human or automated decision. In February 2024 the engineering firm Arup lost $25 million when an employee, following what looked like instructions from senior leadership, joined a video call populated entirely by deepfaked colleagues, including a fabricated CFO, and authorized the transfers. The decision was rational given the inputs. The inputs were synthetic.
Getting the data foundation wrong is increasingly a balance-sheet event, not just an engineering one. Under the EU AI Act, prohibited practices can draw fines of up to €35 million or 7% of global annual turnover, and high-risk non-compliance up to €15 million or 3%, calculated against worldwide revenue. Add civil liability of the Air Canada kind, and the cost of deploying a confidently wrong system is no longer hypothetical.
The instinct is to ask "how many rows do I need?" That is the wrong unit. There is no universal data quantity that confers trust, because the answer depends on the complexity of the task, the diversity of real-world conditions, and how concentrated the consequences of error are. A model can be trained on billions of records and remain untrustworthy if those records are biased, stale, or missing the tails. A narrow model can be trustworthy on a few thousand clean, representative, well-labeled examples.
The more rigorous framing, and the one standard deviation speaks to directly, is this: trust is a statement about variance, not about mean accuracy. A system you can rely on is one whose performance is both high and stable across repeated runs, across data slices, and across the conditions it will actually meet in production. Standard deviation is the right tool because it measures exactly that stability.
The industry's answer to "the models need to be better" has largely been "build more and bigger compute." The scale is now genuinely large. The International Energy Agency's 2025 Energy and AI report projects that global data-center electricity consumption will roughly double from about 415 terawatt-hours in 2024 to around 945 TWh by 2030, slightly more than Japan's entire electricity consumption today, rising further toward 1,200 TWh by 2035.
Three structural facts shape what operators will actually do:
So "improving computing depth and time to serve" will mean a combination of denser liquid-cooled hardware, model efficiency techniques (quantization, distillation, smaller task-specific models), and co-location with dedicated generation. Notably, none of these address data quality. They make a possibly-wrong answer cheaper and faster to produce, which, per Section 1, can deepen the inversion rather than resolve it.
The money is real and enormous. Aggregate AI capital expenditure by the major hyperscalers was projected to exceed $405 billion in 2025 alone, with those firms reportedly directing close to 70 percent of operating cash flow toward AI-related investment. That spending shows up in the macro data: AI-related capex contributed an estimated 1.1 percent to U.S. GDP growth in the first half of 2025, and Goldman Sachs has forecast that AI could lift potential U.S. GDP growth toward roughly 2.4 percent by 2027.
This is why serious analysts invoke the railroad and fiber-optic analogies. In the 1990s telecom buildout, many investors lost money, yet the fiber got laid and powered two decades of growth. The same may hold for today's GPU clusters: even if a large share of current AI spending earns no direct return, the infrastructure persists. The bear case is the dot-com parallel, that synchronized, debt-and-cash-flow-funded capex on rapidly depreciating hardware (GPUs age far faster than rail) is a classic late-cycle mania, and that the 95-percent pilot-failure rate is the early warning.
The trustworthiness fix and the privacy cost pull in the same direction, which is the trap. Reducing the "confident and wrong" problem generally requires more, fresher, more granular, and more representative data, including the rare cases at the tails. Those tails are disproportionately personal and sensitive. The drive to improve foundational data is therefore also a drive to collect more of exactly the data people most want protected, and the more an organization concentrates such data, the larger the breach surface and the re-identification risk. There is no clean technical escape; differential privacy, federated learning, and synthetic data each trade some accuracy or some collapse-risk for privacy.
Context matters in both directions. Data centers are projected to reach about 3 percent of global electricity by 2030 and under 1 percent of global CO₂, smaller than air conditioning or electric vehicles as drivers of demand growth. But the load is geographically concentrated, straining specific local grids and watersheds. S&P Global projects that by the 2050s roughly 45 percent of data-center facilities will face high water-stress exposure. The per-query numbers are tiny; the aggregate, multiplied across billions of daily interactions and a doubling of total load, is not.
Physical-health effects are mostly indirect and local: emissions and heat from concentrated facilities, competition for water in stressed regions, and the public-health load that follows fossil-fueled baseload power. The mental-health evidence is more direct and more nuanced. A 2025 MIT Media Lab and OpenAI randomized controlled trial (about 1,000 participants over four weeks, paired with analysis of roughly 40 million interactions) found that higher daily chatbot use correlated with greater loneliness, greater emotional dependence, more problematic use, and less real-world socialization, across every interaction mode tested. Other work documents adolescents developing measurable AI dependencies and rare but serious cases of chatbots reinforcing delusional thinking. The same studies find genuine benefits at low-to-moderate use, so the finding is dose-dependent rather than uniformly negative, which is itself a data-quality lesson: the headline average hides the variance that matters.
The labor picture is genuinely two-sided. The World Economic Forum's Future of Jobs 2025 projects 92 million jobs displaced and 170 million created by 2030, a net gain of about 78 million. But, as the WEF itself stresses, the jobs destroyed and the jobs created are not the same jobs; they demand different skills, pay differently, and appear in different places. That mismatch is where inequality enters.
The structural concern is that AI is capital-biased in a specific way: the returns flow to whoever owns the models, the compute, and the proprietary data, while the costs (displacement, wage pressure on automatable tasks) fall on labor. Goldman Sachs estimates that expanding current AI applications could put about 2.5 percent of U.S. jobs at near-term displacement risk, a modest figure, but the IMF and others warn that without deliberate policy the benefits concentrate in advanced economies and among capital holders. AI could affect close to 60 percent of jobs in advanced economies versus roughly 26 percent in low-income ones, meaning the technology's reach itself is unequally distributed. There is a countervailing thread worth mentioning: some research finds less-experienced workers gain more from AI assistance than experts, which could compress within-firm inequality even as it widens it between capital and labor.
So the careful answer is layered. AI's labor effects could deepen the kind of regional and class divides that historically track with political polarization, but that link is inferred, not yet measured. AI's effect on the information commons, flooding it with cheap, confident, unverifiable content, is a more direct and more concerning polarization mechanism, and it is the same root failure this paper began with: foundational data you can no longer trust.
From contaminated training sets to a polarized information commons, the failures in this paper share one root: AI decouples the cost of producing an answer from the cost of producing a trustworthy one. Bad foundational data widens that decoupling, and confidence, human and machine, hides it until the bill arrives.
The defenses are unglamorous and they all run against the grain of "ship faster": measure performance with confidence intervals and across slices, not headline averages; treat data coverage and provenance as first-class, not data volume; keep a human accountable for outputs in a way that resists automation bias; and audit the foundation before, not after, the decisions compound. The failure mode that tells you this discipline has lapsed is specific and recognizable: solutions that arrive faster and more confidently than your ability to verify them. That is not a sign the system is working. It is the early symptom of the inversion.