Building a Crash-Proof Migration Pipeline for 35 TB

TL;DR

Migrated 35 TB (5.6M files) from Alibaba Cloud OSS to GCP and AWS with zero downtime
A zero-cost metadata scan found 60.3% duplication, saving 8.85 TB and ~$3,000 in year-one costs
UUID-based path transformation makes CDN URLs mathematically unguessable (10^22 years to brute-force one file)
3 VMs, 150 concurrent workers, streaming multipart uploads completed the transfer in 1.5 days
Built 15+ modular Python tools: each idempotent, resumable, non-destructive, communicating via CSV
Total compute cost for the entire migration: $18

Our client's DAM (Digital Asset Management) system had accumulated 35 TB of media files — product images, compliance documents, brand materials, design files — across 5.6 million objects in an Alibaba Cloud OSS bucket in Singapore. All served globally through CDN.

The mission: evaluate migration to both GCP Cloud Storage and AWS S3, restructure folder paths for CDN security, and eliminate terabytes of duplicate data. One engineer. Three VMs. One and a half days of actual transfer.

The Problem: Three Issues Hiding in 35 TB

Predictable CDN URLs — The CDN served files using raw bucket paths like cdn.example.com/asset/Create/BrandName/ProductCode/photo.jpg. Anyone who could guess the brand name or product code could enumerate and scrape assets. The folder hierarchy was a map of our internal structure, served publicly.

Massive duplication — The DAM system stored a full copy of every file on every edit. Version history, recycle bin, cache copies, and thumbnails all lived as independent objects. Storage was growing faster than content creation, but the true scale was unknown.

No audit trail — If we restructured file paths (which we needed to for the security fix), there was no mapping between old and new paths. We would need to build this from scratch for every single file.

Phase 1: Scanning 5.6 Million Files for Duplicates

Downloading 35 TB to compute checksums would be impractical. Instead, we built a zero-egress duplicate detection system using only OSS metadata.

Size + ETag Fingerprinting

OSS computes an ETag (MD5 hash) for every uploaded object at upload time. We used (file_size_bytes, etag) as a content fingerprint — same pair means identical content, guaranteed. No file downloads needed; ETags come back in the standard listing API.

We excluded zero-byte files (all share the same ETag) to avoid false positives.

Parallel Scanner Architecture

The bucket had 9 root-level folders. We scanned all simultaneously — one thread per folder, each with its own isolated OSS connection (shared connections cause race conditions at scale).

OSS LIST API ──> 9 parallel threads (one per folder)
                       │
                       ▼
                 In-memory hashmap: key=(size, etag), val=[paths]
                       │
                       ▼
                 Filter groups with 2+ members
                 Sort by last_modified → oldest = original
                       │
                       ▼
                 Output: 3.4M-row CSV

Robustness features:

Checkpointing every 10,000 objects — our first attempt crashed 46 minutes in; the restart resumed in seconds
Auto-retry with 10 attempts and 5-second waits for network errors
ETag normalization — OSS returns inconsistent quoting and casing

Result: 89 minutes. Zero egress. Complete picture of 5.6 million files.

What We Found

Metric	Value
Total files scanned	5,655,523
Duplicate groups	936,886
Files safe to delete	2,476,798
Reclaimable storage	8.85 TB
Duplication rate	60.3%

The version history folder alone held 3.3 million objects. We found groups of 11+ identical copies of the same PDF across different product folders. The recycle bin held thousands of "deleted" files never purged.

Multi-Factor Originality Detection

Simply picking the oldest file as "original" was not reliable — the DAM sometimes restored files from backups, making version-history copies older than canonical ones. We built a priority system:

Priority 1: Folder rank (primary assets > version history > recycle bin)
Priority 2: Filename contains "original"
Priority 3: Shorter path depth (closer to root)
Priority 4: Oldest timestamp (tiebreaker only)

Phase 2: CDN Safety — Do Not Delete What Is In Use

Many duplicates were actively served through CDN. Deleting them would break production pages. We obtained CDN URL exports from three sources: brand websites, product galleries, and service portals.

The matching worked at the group level — if ANY member of a duplicate group matched a CDN URL, the ENTIRE group was marked as protected (same ETag = same content). We ran two passes: exact match (filename + file size via HTTP HEAD) and filename-only as fallback.

Every file got classified:

KEEP — original file
DO_NOT_DELETE — duplicate, but CDN-matched
SAFE_TO_DELETE — duplicate, not in use

The SAFE_TO_DELETE list became a skip-file for migration. We never deleted anything from the source bucket — just excluded these files during transfer. Built-in rollback safety.

Phase 3: UID-Based Path Transformation

This was the core architectural decision. It solved the URL-guessing problem permanently.

Before (Predictable)

cdn.example.com/asset/Create/BrandName/ProductCode/product_shot.jpg

After (Unguessable)

cdn.example.com/asset/crea/a1f9c82d3b4e5067890abcdef1234567/original/product_shot.jpg

The Transformation Rule

Source:       asset/{Subfolder}/{Brand}/{ProductCode}/.../filename.ext
Destination:  asset/{shortened-subfolder}/{32-char-UUID}/original/filename.ext

Each file gets a UUID v4 — 128-bit random identifier. With ~3 million files and 2^128 possible UUIDs, an attacker making 1 billion guesses per second would need 10^22 years to find a single valid file.

Folder shortening keeps paths navigable: Compliance Documents becomes comp-docu, Supplier Relations becomes supp-rela. Administrators can browse by category; they just cannot guess individual file paths.

The /original/ segment reserves room for future derivatives under the same UUID:

asset/crea/{uuid}/original/photo.jpg      ← full-resolution
asset/crea/{uuid}/thumbnail/photo.jpg     ← auto-generated
asset/crea/{uuid}/web/photo.webp          ← WebP conversion

Every migration run produced a complete CSV mapping (source path, destination path, UUID, status, size) for CDN URL rewrites, application updates, and rollback capability.

Phase 4: Evaluating Both Cloud Destinations

We could not commit to a single destination upfront. We built migration tooling for both GCS and S3, tested both independently, and let real data inform the decision.

GCS Migration — Two Iterations

Iteration 1: Exact path preservation for validation — diff source and destination by path, confirm correctness.

Iteration 2: UID path transformation with skip-CSV support, streaming 8 MB chunks, mapping CSV output.

S3 Migration — Two Iterations

Iteration 1: Basic single-PUT upload. Simple but limited — loads files into memory, fails above 5 GB.

Iteration 2: Production-grade streaming multipart. Key improvements:

OSS response stream piped directly into S3 upload — no memory buffer
Files above 64 MB auto-split into parallel 64 MB parts
Workers start uploading while listing is still in progress
CSV rows written as uploads complete, not accumulated
Skip-check mode eliminates hundreds of thousands of HEAD requests on fresh runs

The path transform differed slightly: GCS kept the asset/ prefix with maximum URL opacity; S3 stripped it and kept brand-level folders for easier browsing.

Both migrations ran independently on the same VMs — fair comparison, real data, no guesswork.

Phase 5: Migration Architecture

Thread-Local Clients

The most insidious bug: shared SDK clients worked fine at 5-10 workers but caused intermittent ConnectionResetError at 50. SDK clients maintain internal connection pools that are not thread-safe at high concurrency. Fix: each thread creates and caches its own client.

_thread_local = threading.local()

def _get_source_client():
    if not hasattr(_thread_local, "source_client"):
        _thread_local.source_client = create_source_client(...)
    return _thread_local.source_client

After this change: zero connection errors across millions of uploads.

Idempotent and Resumable

Every tool was designed to be killed and restarted freely:

Existence checks before uploading — safe to re-run
Mapping CSV as checkpoint — completed uploads written immediately, skipped on restart
No source modifications — the source bucket stayed read-only throughout

Gap Detection and Fill

After bulk migration, a comparison tool diffed source vs destination by (filename, size), accounting for intentionally skipped files. A gap-fill tool migrated only missing files. Run repeatedly until missing count hit zero.

PM2 Orchestration

A script generator created per-subfolder shell scripts. PM2 managed all streams with per-stream logging, automatic restart on crash, and a unified monitoring dashboard. Combined with resume-from-CSV logic, a crashed stream picked up right where it left off.

Phase 6: Execution — 3 VMs, 36 Hours

VM	Spec	Role
VM 1	GCE `e2-standard-8` (8 vCPU, 32 GB)	Primary worker
VM 2	GCE `e2-standard-8` (8 vCPU, 32 GB)	Secondary worker
VM 3	GCE (8 GB)	Gap-fill + verification

VMs in the same region as the destination meant free GCS ingress and ~2ms latency vs ~150ms from a local machine — a 75x improvement. Each VM ran ~50 concurrent workers (~150 total).

The workflow: duplicate scan, CDN matching, originality refinement, generate batch scripts, deploy to VMs, PM2 processes, gap detection, gap fill, final verification.

Total data transfer time: ~1.5 days.

The Toolkit

We built 15+ purpose-built Python tools, each doing one thing well and communicating via CSV:

Category	Tools	Purpose
Migration	5 tools	Stream objects to GCS/S3 with various path transforms
Deduplication	3 tools	Scan, analyze, and classify duplicate files
CDN Safety	3 tools	Match CDN URLs, protect in-use files, map old-to-new paths
Verification	4+ tools	Diff source/destination, rebuild mappings, generate reports

Every tool was idempotent (safe to re-run), resumable (checkpoint to disk), non-destructive (source is read-only), and composable (CSV in, CSV out).

Lessons Learned

Deduplicate before migrating — Our 89-minute scan eliminated 8.85 TB of unnecessary transfers. Do not pay to move garbage.
Thread-local clients are non-negotiable — Shared SDK clients cause intermittent failures at high concurrency that are nearly impossible to debug. One client per thread, always.
Checkpoint everything — Our first scan crashed 46 minutes in. With checkpointing, zero work lost. Without it, 46 minutes gone. Build resumability from the start.
Skip-lists over deletes — Never deleted from the source bucket. Excluded duplicates via skip-files during migration. Source stayed intact as a rollback safety net.
Run compute near your data — Moving from local machine to same-region VMs: 150ms to 2ms latency. Saved an entire day of wall-clock time.
UUID paths solve multiple problems — Security (unguessable), collision avoidance (no filename conflicts), future-proofing (/original/ reserves room for derivatives), and decoupling (org changes do not break URLs).
Build for both, decide with data — Evaluating GCS and S3 simultaneously was ~30% more effort but eliminated guesswork entirely.
Modular tools over monoliths — 15 focused tools communicating via CSV. Re-run any stage independently. Add new destinations without touching analysis tools.

What Comes Next

Migrating 35 TB across clouds is not just a file copy. It is an opportunity to audit, deduplicate, secure, and restructure. Our zero-egress scan revealed 60% duplication. The UID transformation made URLs unguessable. The parallel streaming architecture — 3 VMs, 150 workers, crash-safe checkpointing — moved it all in a day and a half.

If you are facing a multi-terabyte migration: scan first, deduplicate, protect what is live, transform paths, stream do not buffer, checkpoint everything, and stay non-destructive.

The tools are built. The data is migrated. The URLs are secure. And the storage bill is $184/month lighter.