PayPal’s Historic BigQuery Migration: The Unsexy Data Move Powering Its Next GenAI Wave

AI generated image for PayPal’s Historic BigQuery Migration: The Unsexy Data Move Powering Its Next GenAI Wave

Some technology stories begin with a shiny demo and end with a funding round. This one begins with the kind of sentence that makes executives reach for coffee and engineers reach for the exit: “We need to migrate hundreds of petabytes of production analytics data.”

On February 26, 2026, PayPal published a detailed account of what it calls a historically large data migration—one that consolidated sprawling, acquisition-fueled analytics infrastructure and culminated in moving its analytics foundation to Google BigQuery. The post, “PayPal’s historically large data migration is the foundation for its gen AI innovation”, is authored by Mani Iyer (SVP & Global Head of Data, AI & ML Technology, PayPal) and Vaishali Walia (Sr Director Data Analytics, PayPal). It’s a rare window into how a financial-services giant tries to modernize the part of the stack that nobody applauds—until it fails. citeturn1view0

And here’s the key point: PayPal isn’t pitching “genAI” as fairy dust sprinkled over transaction data. The thesis is more grounded (and more useful): if you want generative AI to be something other than an expensive hallucination machine, you need centralized, governed, fresh, high-quality data—and you need to deliver it quickly enough that models and features can keep pace with the business.

Let’s break down what PayPal claims it did, why BigQuery is central to the story, what “300+ petabytes” really implies, and what other organizations—especially those with legacy data warehouses, Hadoop hangovers, and a growing AI mandate—can learn from this migration.

The problem: 25 years of success, and a dozen data silos to show for it

PayPal’s core challenge reads like a case study in “congratulations, you scaled.” Over decades of growth—plus acquisitions including Venmo and Braintree—PayPal ended up with roughly 400 petabytes of data spread across a dozen siloed systems. The company processes billions of transactions and holds long-running behavioral and commerce signals, but fragmentation made it harder and slower to use that data across products and teams. citeturn1view0

This kind of data sprawl is particularly painful in fintech for three reasons:

  • Risk is real-time. Fraud patterns shift quickly; stale features can mean missed detections or false positives.
  • Regulation is non-negotiable. Governance, lineage, access controls, and auditability aren’t “nice-to-haves.”
  • Customer experience is increasingly personalized. Whether you’re a consumer trying to understand spending or a merchant trying to forecast cashflow, “one view” matters.

PayPal offers a concrete example: a small business owner might use PayPal for online sales and Venmo for local transactions, but providing a unified view required complex processes that were costly and slow. In plain language: the data existed, but the organization couldn’t easily see itself. citeturn1view0

Why genAI raises the stakes for data foundations

Even before generative AI, fragmented data slowed analytics and machine learning. GenAI makes the consequences more obvious—and more expensive.

Traditional BI can tolerate some latency. A weekly dashboard arriving on Tuesday is annoying, but survivable. A genAI system that’s supposed to:

  • draft customer-support responses,
  • explain transaction anomalies,
  • recommend next-best actions for merchants,
  • or assist risk teams in investigations

…needs current context, consistent definitions, and governed access. Otherwise, it becomes the world’s most confident generator of incorrect answers.

Google Cloud, meanwhile, has been pushing a “data-to-AI” narrative where analytics data platforms and model platforms converge—particularly through BigQuery’s integration with Vertex AI, including remote models accessible via SQL and model governance features. citeturn2search0turn2search5

PayPal’s migration story lands squarely in this industry trend: the data platform is no longer just for analysts; it’s for productized intelligence.

The scope: “one of the largest data migrations in history”

PayPal doesn’t just describe this as a big migration—it describes it as a historically large one. The company says it consolidated platforms including:

  • a Teradata system it describes as “what’s believed to be the world’s largest Teradata deployment,”
  • Hadoop clusters,
  • Amazon Redshift,
  • Snowflake,
  • and various other systems processing petabytes of transaction data.

And then the headline numbers: with help from Google Cloud Consulting and partners, PayPal says it migrated more than 300 petabytes of data, decommissioned around 25% of workloads, and did so with zero downtime and no customer impact. citeturn1view0

If you’ve ever migrated a few terabytes between warehouses and spent half your life rewriting SQL, those numbers may sound… aspirational. But scale changes the playbook. At hundreds of petabytes, migration isn’t a weekend cutover; it’s a multi-year program with governance, FinOps, and automation as first-class requirements.

What “300+ petabytes” implies (in human terms)

A petabyte is a million gigabytes. At this size, the hard parts aren’t merely copy speed and storage cost. The hard parts are:

  • Data contracts (which pipelines produce what, with what semantics)
  • Lineage and dependency graphs (what breaks if table X changes)
  • Security and access control (who can see what, under what policy)
  • Operational parity (ensuring the new platform produces the same business results)
  • Organizational change (new tooling, new workflows, new ownership models)

PayPal explicitly calls out discovery, analysis, and lineage as essential—because you can’t migrate what you can’t accurately inventory. citeturn1view0

Why BigQuery: architecture, scale, and “SQL as the lingua franca”

PayPal lists multiple reasons for choosing BigQuery, and they map to three common enterprise drivers:

  • Operational simplicity: BigQuery is fully managed and cloud-native.
  • Elasticity: disaggregated storage and compute can scale independently.
  • Developer adoption: a familiar SQL interface reduces friction.

The “separation of storage and compute” point is not just marketing. BigQuery has long emphasized this architectural principle: storage is decoupled from compute, enabling stateless compute scaling and minimizing resource contention between ingestion and analytics workloads. Google has described this design and its benefits in detail. citeturn2search2turn2search4

In practice, that separation matters when you’re running massive, multi-tenant analytics at unpredictable demand. Traditional warehouses often force trade-offs between loading data, running queries, and maintaining concurrency. A managed platform that can allocate resources dynamically changes the operational burden from “capacity planning” to “cost management,” which is a different kind of headache—but usually a more tractable one.

The AI angle: BigQuery as a “data-to-AI” hub

PayPal also highlights BigQuery’s “native integrations with AI.” That phrase can mean a lot of things, but Google’s own documentation and announcements provide more specificity:

  • BigQuery ML allows data teams to build and run ML workflows using SQL.
  • BigQuery can create remote models that invoke Vertex AI models (including Gemini) while keeping the workflow in BigQuery SQL.
  • Google has been expanding generative AI integrations so teams can apply LLMs to structured and unstructured enterprise data more directly within analytics workflows.

These capabilities are described in Google Cloud’s product documentation and related blog posts about Vertex AI integration and BigQuery ML. citeturn2search5turn2search0

For fintech, the attraction is obvious: if you can keep data governance and access controls consistent while enabling new model-driven features, you reduce the “shadow AI” risk—teams exporting sensitive data into ad-hoc environments to experiment.

How they did it: alignment, discovery, strategy, execution (and a lot of dashboards)

PayPal frames its migration execution as four pillars. This reads like a transformation playbook—but in this case, it’s backed by outcomes and scale numbers that make it worth paying attention to.

1) Alignment: make it an enterprise priority

PayPal says stakeholder alignment was the first hurdle, and it elevated the migration to an enterprise-wide priority. That’s a polite way of saying: if the business doesn’t agree the migration matters, it becomes an optional project competing with quarterly roadmaps—and optional projects die quietly. citeturn1view0

A useful lens here is that migrations of this magnitude aren’t “IT projects.” They are business continuity and innovation projects. If the company believes genAI will materially shape fraud prevention, merchant success tools, and customer experience, then the data platform becomes a revenue and risk lever.

2) Discovery and analysis: inventory everything, establish lineage

PayPal highlights detailed inventories of data, workloads, and data streams as crucial for scope, effort, and budget forecasting—and notes that establishing lineage helped build dependency graphs. citeturn1view0

This is the “measure twice, cut once” phase—except the wood is a living organism with 12 nervous systems and a union contract.

In more practical terms, discovery at scale often includes:

  • cataloging datasets, pipelines, dashboards, and ML features,
  • identifying data owners and domain experts,
  • mapping upstream/downstream dependencies,
  • and classifying data by sensitivity and regulatory constraints.

If you skip this, you might still migrate data, but you’ll fail to migrate trust—and analytics without trust becomes expensive fiction.

3) Strategy: principles, governance, security, and tracking consumption

PayPal says it set principles such as lift-and-shift vs. modernization, security principles, governance guardrails, and tracking consumption. citeturn1view0

The “tracking consumption” piece is a subtle FinOps tell: in pay-as-you-go analytics platforms, migration success can be undermined by a surprise bill driven by inefficient queries, duplicate pipelines, or uncontrolled experimentation.

One of the underappreciated benefits of a consolidation is that it becomes easier to implement consistent cost controls (quotas, reservations, chargeback/showback) across teams. But that benefit only appears if you treat cost visibility as a product feature of the platform, not as an afterthought.

4) Execution: automate everything possible, monitor constantly

PayPal says it automated every possible task, built live dashboards to monitor migrations, and integrated FinOps throughout with visibility into consumption and performance. citeturn1view0

This is where many migrations either become heroic (and brittle) or industrial (and repeatable). The “industrial” approach is the only one that works when the migration is not a single event but thousands of events: table moves, pipeline rewrites, validation runs, and cutovers for dependent applications.

At scale, dashboards are not cosmetics; they’re control surfaces. If you can’t quantify progress, you can’t manage risk, and you can’t defend timelines when the inevitable surprises arrive.

What PayPal says it gained: faster queries, fresher training data, fewer vendors

PayPal reports three main benefits after consolidating analytics on BigQuery.

1) Faster insights: queries 2.5x to 10x faster

PayPal says queries are now 2.5x to 10x faster, including complex queries used by data scientists. The company ties this to enabling more real-time insights and personalization across recommendations, offers, and support. citeturn1view0

Performance claims always deserve a small “it depends,” because query speed can be influenced by schema design, partitioning, clustering, caching behavior, and query rewrites—not only the platform choice. But at this scale, even modest improvements translate into real productivity gains and real cost reductions.

2) Better AI foundations: training data 16x fresher

PayPal claims data accessible for model training is now 16x fresher. That single metric might be the most important AI-related line in the entire post, because “freshness” often dictates how well fraud models track emerging patterns and how accurately customer-facing features reflect current reality. citeturn1view0

It also matters for genAI in less obvious ways. If you’re building retrieval-augmented generation (RAG) systems for internal support teams or merchant tools, stale documents and stale facts can turn helpful assistants into confident liars. Freshness is governance.

3) Operational simplification: one vendor instead of four

Finally, PayPal says it reduced data infrastructure vendors from four to one, eliminated data duplication between platforms, and streamlined operations. citeturn1view0

Vendor consolidation isn’t inherently good—multi-vendor strategies can reduce lock-in and improve resilience. But it can also reduce the tax paid in:

  • cross-platform data movement,
  • multiple security models,
  • inconsistent semantics,
  • and duplicated skill sets.

For an organization trying to accelerate AI development, lowering that operational friction can be as impactful as adding GPUs.

So what genAI innovation does this enable?

PayPal doesn’t announce a single killer genAI product in the blog post (that’s not the point), but it does outline the kind of AI-powered experiences it wants to pursue now that its data foundation is unified.

  • Predictive fraud prevention that identifies issues before they impact customers.
  • Personalized financial insights to help merchants optimize their businesses.
  • Seamless payment experiences that adapt to preferences and patterns.
  • Smarter risk assessment to expand access to underserved communities.
  • Exploration of agentic commerce and related future possibilities.

All of those ideas share a dependency: they require consistent, cross-product, cross-geo, cross-regulation data that can be queried and governed at scale. That’s why the migration is described as the foundation. citeturn1view0

And from Google’s side, the strategy is clear: keep enterprise data in BigQuery, then make genAI and ML accessible directly where the data lives—through BigQuery ML and Vertex AI integrations—so teams don’t have to build a parallel AI stack just to get started. citeturn2search0turn2search5

A realistic case study: genAI for merchant support (without leaking secrets)

Let’s make this concrete with a plausible example based on what PayPal describes.

Imagine a merchant contacts support: “Why did my settlement amount change this week?” A genAI assistant could help support agents by:

  • retrieving relevant account policy changes, fee schedules, and dispute outcomes,
  • summarizing recent transaction patterns and anomalies,
  • and drafting an explanation the agent can review and approve.

To do this safely, the system needs:

  • role-based access controls (the model should only see what the agent is allowed to see),
  • fresh data (yesterday’s disputes and today’s policy are not the same),
  • traceability (why did it answer that way),
  • auditing (what was accessed, and by whom).

Centralizing data doesn’t automatically solve those problems, but it makes them solvable with fewer moving parts.

Industry context: why fintech migrations are happening now

PayPal is far from alone. Across financial services, a few macro forces are colliding:

  • Legacy platforms are hitting scale and cost ceilings. Huge on-prem data warehouses and Hadoop estates are expensive to maintain and hard to modernize.
  • Cloud warehouses have matured. BigQuery, Snowflake, Redshift, Databricks, and others now compete on governance, performance, and AI integration—not just storage.
  • AI is moving from “lab” to “product.” That shift increases expectations for reliability, monitoring, and compliance.
  • Regulators and customers both want transparency. Risk decisions and automated interactions need stronger controls and explanations.

Google Cloud’s own positioning of BigQuery as an “AI-ready data platform” with built-in ML, governance capabilities, and multi-format support reflects the broader industry direction: analytics platforms are being asked to serve as both data warehouse and AI substrate. citeturn2search4turn2search6

Lessons worth stealing (without migrating 300 petabytes)

Most organizations will never migrate anything close to PayPal’s scale. But the underlying lessons travel well.

Lesson 1: genAI projects fail at the data layer, not the prompt layer

PayPal’s framing is blunt: fragmented data would “severely limit” its ability to create intelligent experiences customers expect. If your genAI initiative is struggling, the answer is often not a better prompt—it’s better data quality, better access patterns, and better governance. citeturn1view0

Lesson 2: inventory and lineage are migration accelerators

Many teams treat lineage as documentation. PayPal treats it as a way to understand dependency graphs and scope. That’s the difference between a migration that finishes and a migration that becomes a permanent lifestyle. citeturn1view0

Lesson 3: build FinOps into the migration, not after

“We’ll optimize costs later” is how you end up paying cloud prices with on-prem habits. PayPal’s emphasis on dashboards, consumption tracking, and FinOps integration is a signal that cloud data platforms require a new operating model. citeturn1view0

Lesson 4: vendor count matters less than duplication and friction

Reducing vendors from four to one is noteworthy, but the bigger win is eliminating duplication and complexity. A multi-vendor stack can work if governance and semantics are unified. A single-vendor stack can still fail if teams recreate silos inside it. Consolidation is a means, not an end. citeturn1view0

The flip side: risks and trade-offs PayPal doesn’t emphasize

No serious migration is pure upside. Even if PayPal executed cleanly, there are structural trade-offs worth keeping in mind if you’re using this story as inspiration.

Lock-in vs. acceleration

Consolidating analytics into one platform can speed innovation. It can also increase dependence on that platform’s pricing, roadmap, and availability. Some organizations mitigate this with open formats and portability strategies; others accept lock-in as the price of speed.

Google Cloud points out that BigQuery supports open table formats (like Iceberg, Delta, Hudi) and promotes a unified governance layer; those features can help reduce certain kinds of lock-in, but architectural decisions still matter. citeturn2search4turn2search6

Centralization can become a bottleneck

A “single source of truth” is wonderful until it turns into “single queue for approvals.” Governance has to scale culturally as well as technically. Otherwise, teams route around the platform and recreate silos with spreadsheets and shadow pipelines.

Performance gains require query and data model discipline

BigQuery can be extremely fast, but it also makes it easy to run extremely expensive queries. The organizations that succeed typically pair platform adoption with education, standardized modeling patterns, and guardrails.

What to watch next: from warehouse modernization to “agentic commerce”

PayPal hints at “agentic commerce”—the idea that AI agents could proactively execute commerce tasks, not just recommend them. That’s where modern data platforms become even more critical, because autonomous or semi-autonomous systems require tighter constraints, better observability, and stronger policy enforcement than traditional analytics.

If PayPal’s data is now centralized and fresher for training, two next steps are likely (even if PayPal doesn’t spell them out):

  • More real-time or near-real-time feature pipelines to feed risk and personalization systems.
  • More retrieval-based genAI systems grounded in governed internal knowledge, to reduce hallucinations and improve compliance posture.

Google’s ongoing push to bring generative AI capabilities into BigQuery workflows—such as invoking Gemini through BigQuery ML and integrating with Vertex AI—creates a plausible path for enterprises to build RAG and analytics-driven genAI without moving data into separate systems. citeturn2search0turn2search5

Bottom line

PayPal’s story is a reminder that “AI transformation” is often a polite rebranding of “data platform modernization,” and that modernization is mostly about:

  • reducing fragmentation,
  • improving governance,
  • accelerating access,
  • and making data fresh enough to be operationally useful.

The genAI era didn’t invent the need for good data foundations—it just removed the last excuse for not building them. And if PayPal can migrate 300+ petabytes with zero downtime, the rest of us can probably manage to decommission at least one unloved Hadoop cluster without drama. Probably.

Sources

Bas Dorland, Technology Journalist & Founder of dorland.org