Google File System

Google File System (GFS or GoogleFS) is a proprietary distributed file system developed by Google to provide efficient, reliable access to data using large clusters of commodity hardware. A new version of Google File System code named Colossus was released in 2010.[1][2]

Google File System
Operating systemLinux kernel
TypeDistributed file system


Google File System is designed for system-to-system interaction, and not for user-to-system interaction. The chunk servers replicate the data automatically.

GFS is enhanced for Google's core data storage and usage needs (primarily the search engine), which can generate enormous amounts of data that must be retained; Google File System grew out of an earlier Google effort, "BigFiles", developed by Larry Page and Sergey Brin in the early days of Google, while it was still located in Stanford. Files are divided into fixed-size chunks of 64 megabytes, similar to clusters or sectors in regular file systems, which are only extremely rarely overwritten, or shrunk; files are usually appended to or read. It is also designed and optimized to run on Google's computing clusters, dense nodes which consist of cheap "commodity" computers, which means precautions must be taken against the high failure rate of individual nodes and the subsequent data loss. Other design decisions select for high data throughputs, even when it comes at the cost of latency.

A GFS cluster consists of multiple nodes. These nodes are divided into two types: one Master node and a large number of Chunkservers. Each file is divided into fixed-size chunks. Chunk servers store these chunks. Each chunk is assigned a unique 64-bit label by the master node at the time of creation, and logical mappings of files to constituent chunks are maintained. Each chunk is replicated several times throughout the network. At default, it is replicated three times, but this is configurable [3]. Files which are in high demand may have a higher replication factor, while files for which the application client uses strict storage optimizations may be replicated less than three times - in order to cope with quick garbage cleaning policies [3].

The Master server does not usually store the actual chunks, but rather all the metadata associated with the chunks, such as the tables mapping the 64-bit labels to chunk locations and the files they make up, the locations of the copies of the chunks, what processes are reading or writing to a particular chunk, or taking a "snapshot" of the chunk pursuant to replicate it (usually at the instigation of the Master server, when, due to node failures, the number of copies of a chunk has fallen beneath the set number). All this metadata is kept current by the Master server periodically receiving updates from each chunk server ("Heart-beat messages").

Permissions for modifications are handled by a system of time-limited, expiring "leases", where the Master server grants permission to a process for a finite period of time during which no other process will be granted permission by the Master server to modify the chunk. The modifying chunkserver, which is always the primary chunk holder, then propagates the changes to the chunkservers with the backup copies. The changes are not saved until all chunkservers acknowledge, thus guaranteeing the completion and atomicity of the operation.

Programs access the chunks by first querying the Master server for the locations of the desired chunks; if the chunks are not being operated on (i.e. no outstanding leases exist), the Master replies with the locations, and the program then contacts and receives the data from the chunkserver directly (similar to Kazaa and its supernodes).

Unlike most other file systems, GFS is not implemented in the kernel of an operating system, but is instead provided as a userspace library.


Deciding from benchmarking results,[3] when used with relatively small number of servers (15), the file system achieves reading performance comparable to that of a single disk (80–100 MB/s), but has a reduced write performance (30 MB/s), and is relatively slow (5 MB/s) in appending data to existing files. The authors present no results on random seek time. As the master node is not directly involved in data reading (the data are passed from the chunk server directly to the reading client), the read rate increases significantly with the number of chunk servers, achieving 583 MB/s for 342 nodes. Aggregating a large number of servers also allows big capacity, while it is somewhat reduced by storing data in three independent locations (to provide redundancy).

See also


  1. ^ "Google's Colossus Makes Search Real-Time by Dumping MapReduce", High Scalability (World Wide Web log), 2010-09-11.
  2. ^ "Colossus: Successor to the Google File System (GFS)". SysTutorials. 2012-11-29. Retrieved 2016-05-10.
  3. ^ a b c Ghemawat, Gobioff & Leung 2003.


External links

  • "GFS: Evolution on Fast-forward", Queue, ACM.
  • "Google File System Eval, Part I", Storage mojo.
Apache Hadoop

Apache Hadoop ( ) is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Originally designed for computer clusters built from commodity hardware—still the common use—it has also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.The base Apache Hadoop framework is composed of the following modules:

Hadoop Common – contains libraries and utilities needed by other Hadoop modules;

Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;

Hadoop YARN – introduced in 2012 is a platform responsible for managing computing resources in clusters and using them for scheduling users' applications;

Hadoop MapReduce – an implementation of the MapReduce programming model for large-scale data processing.The term Hadoop is often used for both base modules and sub-modules and also the ecosystem, or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, and Apache Storm.Apache Hadoop's MapReduce and HDFS components were inspired by Google papers on MapReduce and Google File System.The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell scripts. Though MapReduce Java code is common, any programming language can be used with Hadoop Streaming to implement the map and reduce parts of the user's program. Other projects in the Hadoop ecosystem expose richer user interfaces.


Bigtable is a compressed, high performance, proprietary data storage system built on Google File System, Chubby Lock Service, SSTable (log-structured storage like LevelDB) and a few other Google technologies. On May 6, 2015, a public version of Bigtable was made available as a service. Bigtable also underlies Google Cloud Datastore, which is available as a part of the Google Cloud Platform.


BitChute is a video hosting service that uses peer-to-peer WebTorrent technology in order to diffuse, redistribute, and ease bandwidth and issues of centralized streaming.


CHFS is a file system developed at the Department of Software Engineering, University of Szeged, Hungary. It was the first open source flash memory-specific file system written for the NetBSD operating system. Intended usage is over raw flash devices on embedded systems like ARM and MIPS, the filesystem is less suitable for use on consumer SSD (because consumer SSDs already make sure to not use the same physical blocks for writing modified data).


CloudStore (KFS, previously Kosmosfs) was Kosmix's C++ implementation of the Google File System. It parallels the Hadoop project, which is implemented in the Java programming language. CloudStore supports incremental scalability, replication, checksumming for data integrity, client side fail-over and access from C++, Java and Python. There is a FUSE module so that the file system can be mounted on Linux.

In September 2007 Kosmix published Kosmosfs as open source.

The last commit activity was in 2010. The Google Code page for Kosmosfs now points to the Quantcast File System on GitHub which is the successor to KFS.

A former project on SourceForge used the name CloudStore in 2008.

Cluster Exploratory

Cluster Exploratory (CluE) was a proposed 2008 U.S. National Science Foundation-funded program to use Google-IBM cluster technology to analyze massive amounts of data to search for patterns, part of the Academic Cluster Computing Initiative (ACCI). "The cluster will consist of 1,600 processors, several terabytes of memory, and hundreds of terabytes of storage, along with the software, including IBM's Tivoli and open source versions of Google File System and MapReduce". Google and IBM announced the first pilot phase of the ACCI in October 2007. The program ended in 2011, according to Google. NSF's call for proposals has been "archived".

Clustered file system

A clustered file system is a file system which is shared by being simultaneously mounted on multiple servers. There are several approaches to clustering, most of which do not employ a clustered file system (only direct attached storage for each node). Clustered file systems can provide features like location-independent addressing and redundancy which improve reliability or reduce the complexity of the other parts of the cluster. Parallel file systems are a type of clustered file system that spread data across multiple storage nodes, usually for redundancy or performance.

Distributed file system for cloud

A distributed file system for cloud is a file system that allows many clients to have access to data and supports operations (create, delete, modify, read, write) on that data. Each data file may be partitioned into several parts called chunks. Each chunk may be stored on different remote machines, facilitating the parallel execution of applications. Typically, data is stored in files in a hierarchical tree, where the nodes represent directories. There are several ways to share files in a distributed architecture: each solution must be suitable for a certain type of application, depending on how complex the application is. Meanwhile, the security of the system must be ensured. Confidentiality, availability and integrity are the main keys for a secure system.

Users can share computing resources through the Internet thanks to cloud computing which is typically characterized by scalable and elastic resources – such as physical servers, applications and any services that are virtualized and allocated dynamically. Synchronization is required to make sure that all devices are up-to-date.

Distributed file systems enable many big, medium, and small enterprises to store and access their remote data as they do local data, facilitating the use of variable resources.

Extent (file systems)

An extent is a contiguous area of storage reserved for a file in a file system, represented as a range of block numbers. A file can consist of zero or more extents; one file fragment requires one extent. The direct benefit is in storing each range compactly as two numbers, instead of canonically storing every block number in the range. Also, extent allocation results in less file fragmentation.

Extent-based file systems can also eliminate most of the metadata overhead of large files that would traditionally be taken up by the block-allocation tree. But because the savings are small compared to the amount of stored data (for all file sizes in general) but make up a large portion of the metadata (for large files), the overall benefits in storage efficiency and performance are slight.In order to resist fragmentation, several extent-based file systems do allocate-on-flush. Many modern fault-tolerant file systems also do copy-on-write, although that increases fragmentation. As a similar design, the CP/M file system uses extents as well, but those do not correspond to the definition given above. CP/M's extents appear contiguously as a single block in the combined directory/allocation table, and they do not necessarily correspond to a contiguous data area on disk.

Flash file system

A flash file system is a file system designed for storing files on flash memory–based storage devices. While flash file systems are closely related to file systems in general, they are optimized for the nature and characteristics of flash memory (such as to avoid write amplification), and for use in particular operating systems.


HAMMER is a high-availability 64-bit file system developed by Matthew Dillon for DragonFly BSD using B+ trees. Its major features include infinite NFS-exportable snapshots, master-multislave operation, configurable history retention, fsckless-mount, and checksums to deal with data corruption. HAMMER also supports data block deduplication, meaning that identical data blocks will be stored only once on a file system. A successor, HAMMER2, was announced in 2011 and became the default in Dragonfly 5.2 (April 2018).


HAMMER2 is a successor to the HAMMER filesystem, redesigned from the ground up to support enhanced clustering. HAMMER2 supports online and batched deduplication, snapshots, directory entry indexing, multiple mountable filesystem roots, mountable snapshots, a low memory footprint, compression, encryption, zero-detection, data and metadata checksumming, and synchronization to other filesystems or nodes.

Howard Gobioff

Howard Gobioff (1971 – 2008) was a computer scientist. He graduated magna cum laude with a double major in computer science and mathematics from the University of Maryland, College Park. At Carnegie Mellon University, he worked on the network attached secure disks project, before he went on to earn his PhD in computer science. He died suddenly from lymphoma at the age of 36.

InterPlanetary File System

InterPlanetary File System (IPFS) is a protocol and network designed to create a content-addressable, peer-to-peer method of storing and sharing hypermedia in a distributed file system.IPFS is a peer-to-peer distributed file system that seeks to connect all computing devices with the same system of files. IPFS could be seen as a single BitTorrent swarm, exchanging objects within one Git repository. In other words, IPFS provides a high-throughput, content-addressed block storage model, with content-addressed hyperlinks. This forms a generalized Merkle directed acyclic graph (DAG). IPFS combines a distributed hash table, an incentivized block exchange, and a self-certifying namespace. IPFS has no single point of failure, and nodes do not need to trust each other not to tamper with data in transit.The filesystem can be accessed in a variety of ways, including via FUSE and over HTTP.


Journalling Flash File System version 2 or JFFS2 is a log-structured file system for use with flash memory devices. It is the successor to JFFS. JFFS2 has been included into the Linux kernel since September 23, 2001, when it was merged into the Linux kernel mainline as part of the kernel version 2.4.10 release. JFFS2 is also available for a few bootloaders, like Das U-Boot, Open Firmware, the eCos RTOS, the RTEMS RTOS, and the RedBoot. Most prominent usage of the JFFS2 comes from OpenWrt.At least three file systems have been developed as JFFS2 replacements: LogFS, UBIFS, and YAFFS.

Sanjay Ghemawat

Sanjay Ghemawat (born 1966 in West Lafayette, Indiana) is an Indian American computer scientist and software engineer. He is currently a Senior Fellow at Google in the Systems Infrastructure Group. Ghemawat's work at Google, much of it in close collaboration with Jeff Dean, has included big data processing model MapReduce, the Google File System, and databases Bigtable and Spanner. Wired have described him as one of the "most important software engineers of the internet age".

Sawzall (programming language)

Sawzall is a procedural domain-specific programming language, used by Google to process large numbers of individual log records. Sawzall was first described in 2003, and the szl runtime was open-sourced in August 2010. However, since the MapReduce table aggregators have not been released, the open-sourced runtime is not useful for large-scale data analysis of multiple log files off the shelf. Sawzall has been replaced by Lingo (logs in Go) for most purposes within Google.

Steganographic file system

Steganographic file systems are a kind of file system first proposed by Ross Anderson, Roger Needham, and Adi Shamir. Their paper proposed two main methods of hiding data: in a series of fixed size files originally consisting of random bits on top of which 'vectors' could be superimposed in such a way as to allow levels of security to decrypt all lower levels but not even know of the existence of any higher levels, or an entire partition is filled with random bits and files hidden in it.

In a steganographic file system using the second scheme, files are not merely stored, nor stored encrypted, but the entire partition is randomized - encrypted files strongly resemble randomized sections of the partition, and so when files are stored on the partition, there is no easy way to discern between meaningless gibberish and the actual encrypted files. Furthermore, locations of files are derived from the key for the files, and the locations are hidden and available to only programs with the passphrase. This leads to the problem that very quickly files can overwrite each other (because of the Birthday Paradox); this is compensated for by writing all files in multiple places to lessen the chance of data loss.

Write Ahead Physical Block Logging

Write Ahead Physical Block Logging (WAPBL) provides meta data journaling for file systems in conjunction with Fast File System (FFS) to accomplish rapid filesystem consistency after an unclean shutdown of the filesystem and better general use performance over regular FFS.


This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.