2 min read

Building a Distributed File System That Handles Massive Scale Without Breaking

A distributed file system at massive scale must handle petabytes of data, millions of files, and global access patterns. Systems like Google File System (GFS), Hadoop Distributed File System (HDFS), and Amazon EFS solve this through specific architectural choices.

Key sources: "The Google File System" paper (Ghemawat, Gobioff, Leung), HDFS architecture documentation.


The Limitations of NFS

Network File System (NFS) is the traditional approach to distributed file storage. It works for small clusters but breaks at scale due to:

  • Single server bottleneck: All metadata and data goes through one server
  • No data replication: Server failure means data loss
  • No geographic distribution: Performance degrades with distance
  • Limited concurrency: File locking is a bottleneck

Modern distributed file systems use fundamentally different architectures.


Architecture: Separate Metadata from Data

The key insight: metadata operations (file lookups, permissions) require strong consistency but low storage. Data operations (reading and writing file content) require high throughput but eventual consistency is often acceptable.

By separating them, each can scale independently:

Metadata Servers (small cluster, strongly consistent) │ ▼ Chunk Servers (large cluster, replicated for durability)


Key Design Decisions

Large Chunk Sizes

GFS uses 64 MB chunks. HDFS uses 128 MB. Large chunks reduce metadata overhead and enable efficient sequential reads. A 1 PB file system with 128 MB chunks needs only 8 million metadata entries rather than billions.

Immutable Files (Mostly)

Files in GFS/HDFS are append-only. Deleting or overwriting is rare. This simplifies concurrency control — readers never conflict with writers because new data is appended, not inserted.

Replication for Durability

Each chunk is replicated 3x across different servers (and ideally different racks). If a server fails, the chunk is available from replicas. The system automatically re-replicates chunks when replica count drops.


Consistency Model

GFS uses a relaxed consistency model for data:

  • Append operations are atomic for records larger than a chunk. If a record is written, it is visible in its entirety.
  • Concurrent appends may result in duplicates or padding, but records are never corrupted.
  • File metadata is strongly consistent through a single primary metadata server.

This trade-off is acceptable for batch processing workloads (MapReduce) where occasional duplicates are handled by idempotent processing.


Handling Failures

At massive scale, failures are not exceptional — they are expected. A 10,000-server cluster experiences daily hardware failures. The system must handle:

  • Server crashes
  • Disk failures
  • Network partitions
  • Slow servers (stragglers)

Stragglers are handled by redundant execution: the same task runs on multiple servers, and the first to complete wins.


Key Takeaways

  1. Separate metadata from data to allow independent scaling of each.
  2. Large chunk sizes (64-128 MB) reduce metadata overhead.
  3. Replication (3x) provides durability without expensive RAID hardware.
  4. Relaxed consistency for data with strong consistency for metadata.
  5. Design for failure — replicate aggressively and handle stragglers with redundant execution.