Building a Crash-Proof Migration Pipeline for 35 TB


TL;DR
- Migrated 35 TB (5.6M files) from Alibaba Cloud OSS to GCP and AWS with zero downtime
- A zero-cost metadata scan found 60.3% duplication, saving 8.85 TB and ~$3,000 in year-one costs
- UUID-based path transformation makes CDN URLs mathematically unguessable (10^22 years to brute-force one file)
- 3 VMs, 150 concurrent workers, streaming multipart uploads completed the transfer in 1.5 days
- Built 15+ modular Python tools: each idempotent, resumable, non-destructive, communicating via CSV
- Total compute cost for the entire migration: $18
Our client's DAM (Digital Asset Management) system had accumulated 35 TB of media files — product images, compliance documents, brand materials, design files — across 5.6 million objects in an Alibaba Cloud OSS bucket in Singapore. All served globally through CDN.
The mission: evaluate migration to both GCP Cloud Storage and AWS S3, restructure folder paths for CDN security, and eliminate terabytes of duplicate data. One engineer. Three VMs. One and a half days of actual transfer.
The Problem: Three Issues Hiding in 35 TB
Predictable CDN URLs — The CDN served files using raw bucket paths like cdn.example.com/asset/Create/BrandName/ProductCode/photo.jpg. Anyone who could guess the brand name or product code could enumerate and scrape assets. The folder hierarchy was a map of our internal structure, served publicly.
Massive duplication — The DAM system stored a full copy of every file on every edit. Version history, recycle bin, cache copies, and thumbnails all lived as independent objects. Storage was growing faster than content creation, but the true scale was unknown.
No audit trail — If we restructured file paths (which we needed to for the security fix), there was no mapping between old and new paths. We would need to build this from scratch for every single file.
Phase 1: Scanning 5.6 Million Files for Duplicates
Downloading 35 TB to compute checksums would be impractical. Instead, we built a zero-egress duplicate detection system using only OSS metadata.
Size + ETag Fingerprinting
OSS computes an ETag (MD5 hash) for every uploaded object at upload time. We used (file_size_bytes, etag) as a content fingerprint — same pair means identical content, guaranteed. No file downloads needed; ETags come back in the standard listing API.
We excluded zero-byte files (all share the same ETag) to avoid false positives.
Parallel Scanner Architecture
The bucket had 9 root-level folders. We scanned all simultaneously — one thread per folder, each with its own isolated OSS connection (shared connections cause race conditions at scale).
OSS LIST API ──> 9 parallel threads (one per folder)
│
▼
In-memory hashmap: key=(size, etag), val=[paths]
│
▼
Filter groups with 2+ members
Sort by last_modified → oldest = original
│
▼
Output: 3.4M-row CSV
Robustness features:
- Checkpointing every 10,000 objects — our first attempt crashed 46 minutes in; the restart resumed in seconds
- Auto-retry with 10 attempts and 5-second waits for network errors
- ETag normalization — OSS returns inconsistent quoting and casing
Result: 89 minutes. Zero egress. Complete picture of 5.6 million files.
What We Found
| Metric | Value |
|---|---|
| Total files scanned | 5,655,523 |
| Duplicate groups | 936,886 |
| Files safe to delete | 2,476,798 |
| Reclaimable storage | 8.85 TB |
| Duplication rate | 60.3% |
The version history folder alone held 3.3 million objects. We found groups of 11+ identical copies of the same PDF across different product folders. The recycle bin held thousands of "deleted" files never purged.
Multi-Factor Originality Detection
Simply picking the oldest file as "original" was not reliable — the DAM sometimes restored files from backups, making version-history copies older than canonical ones. We built a priority system:
Priority 1: Folder rank (primary assets > version history > recycle bin)
Priority 2: Filename contains "original"
Priority 3: Shorter path depth (closer to root)
Priority 4: Oldest timestamp (tiebreaker only)
Phase 2: CDN Safety — Do Not Delete What Is In Use
Many duplicates were actively served through CDN. Deleting them would break production pages. We obtained CDN URL exports from three sources: brand websites, product galleries, and service portals.
The matching worked at the group level — if ANY member of a duplicate group matched a CDN URL, the ENTIRE group was marked as protected (same ETag = same content). We ran two passes: exact match (filename + file size via HTTP HEAD) and filename-only as fallback.
Every file got classified:
- KEEP — original file
- DO_NOT_DELETE — duplicate, but CDN-matched
- SAFE_TO_DELETE — duplicate, not in use
The SAFE_TO_DELETE list became a skip-file for migration. We never deleted anything from the source bucket — just excluded these files during transfer. Built-in rollback safety.
Phase 3: UID-Based Path Transformation
This was the core architectural decision. It solved the URL-guessing problem permanently.
Before (Predictable)
cdn.example.com/asset/Create/BrandName/ProductCode/product_shot.jpg
After (Unguessable)
cdn.example.com/asset/crea/a1f9c82d3b4e5067890abcdef1234567/original/product_shot.jpg
The Transformation Rule
Source: asset/{Subfolder}/{Brand}/{ProductCode}/.../filename.ext
Destination: asset/{shortened-subfolder}/{32-char-UUID}/original/filename.ext
Each file gets a UUID v4 — 128-bit random identifier. With ~3 million files and 2^128 possible UUIDs, an attacker making 1 billion guesses per second would need 10^22 years to find a single valid file.
Folder shortening keeps paths navigable: Compliance Documents becomes comp-docu, Supplier Relations becomes supp-rela. Administrators can browse by category; they just cannot guess individual file paths.
The /original/ segment reserves room for future derivatives under the same UUID:
asset/crea/{uuid}/original/photo.jpg ← full-resolution
asset/crea/{uuid}/thumbnail/photo.jpg ← auto-generated
asset/crea/{uuid}/web/photo.webp ← WebP conversion
Every migration run produced a complete CSV mapping (source path, destination path, UUID, status, size) for CDN URL rewrites, application updates, and rollback capability.
Phase 4: Evaluating Both Cloud Destinations
We could not commit to a single destination upfront. We built migration tooling for both GCS and S3, tested both independently, and let real data inform the decision.
GCS Migration — Two Iterations
Iteration 1: Exact path preservation for validation — diff source and destination by path, confirm correctness.
Iteration 2: UID path transformation with skip-CSV support, streaming 8 MB chunks, mapping CSV output.
S3 Migration — Two Iterations
Iteration 1: Basic single-PUT upload. Simple but limited — loads files into memory, fails above 5 GB.
Iteration 2: Production-grade streaming multipart. Key improvements:
- OSS response stream piped directly into S3 upload — no memory buffer
- Files above 64 MB auto-split into parallel 64 MB parts
- Workers start uploading while listing is still in progress
- CSV rows written as uploads complete, not accumulated
- Skip-check mode eliminates hundreds of thousands of HEAD requests on fresh runs
The path transform differed slightly: GCS kept the asset/ prefix with maximum URL opacity; S3 stripped it and kept brand-level folders for easier browsing.
Both migrations ran independently on the same VMs — fair comparison, real data, no guesswork.
Phase 5: Migration Architecture
Thread-Local Clients
The most insidious bug: shared SDK clients worked fine at 5-10 workers but caused intermittent ConnectionResetError at 50. SDK clients maintain internal connection pools that are not thread-safe at high concurrency. Fix: each thread creates and caches its own client.
_thread_local = threading.local()
def _get_source_client():
if not hasattr(_thread_local, "source_client"):
_thread_local.source_client = create_source_client(...)
return _thread_local.source_client
After this change: zero connection errors across millions of uploads.
Idempotent and Resumable
Every tool was designed to be killed and restarted freely:
- Existence checks before uploading — safe to re-run
- Mapping CSV as checkpoint — completed uploads written immediately, skipped on restart
- No source modifications — the source bucket stayed read-only throughout
Gap Detection and Fill
After bulk migration, a comparison tool diffed source vs destination by (filename, size), accounting for intentionally skipped files. A gap-fill tool migrated only missing files. Run repeatedly until missing count hit zero.
PM2 Orchestration
A script generator created per-subfolder shell scripts. PM2 managed all streams with per-stream logging, automatic restart on crash, and a unified monitoring dashboard. Combined with resume-from-CSV logic, a crashed stream picked up right where it left off.
Phase 6: Execution — 3 VMs, 36 Hours
| VM | Spec | Role |
|---|---|---|
| VM 1 | GCE e2-standard-8 (8 vCPU, 32 GB) | Primary worker |
| VM 2 | GCE e2-standard-8 (8 vCPU, 32 GB) | Secondary worker |
| VM 3 | GCE (8 GB) | Gap-fill + verification |
VMs in the same region as the destination meant free GCS ingress and ~2ms latency vs ~150ms from a local machine — a 75x improvement. Each VM ran ~50 concurrent workers (~150 total).
The workflow: duplicate scan, CDN matching, originality refinement, generate batch scripts, deploy to VMs, PM2 processes, gap detection, gap fill, final verification.
Total data transfer time: ~1.5 days.
The Toolkit
We built 15+ purpose-built Python tools, each doing one thing well and communicating via CSV:
| Category | Tools | Purpose |
|---|---|---|
| Migration | 5 tools | Stream objects to GCS/S3 with various path transforms |
| Deduplication | 3 tools | Scan, analyze, and classify duplicate files |
| CDN Safety | 3 tools | Match CDN URLs, protect in-use files, map old-to-new paths |
| Verification | 4+ tools | Diff source/destination, rebuild mappings, generate reports |
Every tool was idempotent (safe to re-run), resumable (checkpoint to disk), non-destructive (source is read-only), and composable (CSV in, CSV out).
Lessons Learned
-
Deduplicate before migrating — Our 89-minute scan eliminated 8.85 TB of unnecessary transfers. Do not pay to move garbage.
-
Thread-local clients are non-negotiable — Shared SDK clients cause intermittent failures at high concurrency that are nearly impossible to debug. One client per thread, always.
-
Checkpoint everything — Our first scan crashed 46 minutes in. With checkpointing, zero work lost. Without it, 46 minutes gone. Build resumability from the start.
-
Skip-lists over deletes — Never deleted from the source bucket. Excluded duplicates via skip-files during migration. Source stayed intact as a rollback safety net.
-
Run compute near your data — Moving from local machine to same-region VMs: 150ms to 2ms latency. Saved an entire day of wall-clock time.
-
UUID paths solve multiple problems — Security (unguessable), collision avoidance (no filename conflicts), future-proofing (
/original/reserves room for derivatives), and decoupling (org changes do not break URLs). -
Build for both, decide with data — Evaluating GCS and S3 simultaneously was ~30% more effort but eliminated guesswork entirely.
-
Modular tools over monoliths — 15 focused tools communicating via CSV. Re-run any stage independently. Add new destinations without touching analysis tools.
What Comes Next
Migrating 35 TB across clouds is not just a file copy. It is an opportunity to audit, deduplicate, secure, and restructure. Our zero-egress scan revealed 60% duplication. The UID transformation made URLs unguessable. The parallel streaming architecture — 3 VMs, 150 workers, crash-safe checkpointing — moved it all in a day and a half.
If you are facing a multi-terabyte migration: scan first, deduplicate, protect what is live, transform paths, stream do not buffer, checkpoint everything, and stay non-destructive.
The tools are built. The data is migrated. The URLs are secure. And the storage bill is $184/month lighter.





