all things distributed: Notes: Distributed System -- MapReduce, GFS, BigTable

Map Reduce
Highlights
Fault tolerance:
worker failure

Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global le system.

Page 5

The MapReduce master takes the location information of the input les into account and attempts to schedule a map task on a machine that contains a replica of the corresponding input data. Failing that, it attempts to schedule a map task near a replica of that task's input data (e.g., on a worker machine that is on the same network switch as the machine containing the data).

Page 6, Refinements 4.1-4.4

--------------------------------
GFS
3.4, The snapshot operation makes a copy of a file or a directory tree (the “source”) almost instantaneously, while minimizing any interruptions of ongoing mutations.
Work flow:

Master receives snapshot request

Master revokes any outstanding lease

Master makes a duplicate of metadata for source files of directory

That's it!

Whenever replica receives a request of write to Chunk C(in source files or directory), it will make a new copy of Chunk C in local disc, then starts working in new copy

Unlike many traditional file systems, GFS does not have a per-directory data structure that lists all the files in that directory. Nor does it support aliases for the same file or directory (i.e, hard or symbolic links in Unix terms). GFS logically represents its namespace as a lookup table mapping full pathnames to metadata

master --> prime chunkserver --> secondary chunckserver(replicas) --> chunks

4.5 use version number to detect stale replicas

----------------------------------------
BigTable

The master is responsible for assigning tablets to tablet servers, detecting the addition and expiration of tablet servers, balancing tablet-server load, and garbage collection of files in GFS. In addition, it handles schema changes such as table and column family creations. It does not know the location of a tablet, metadata has that info

5.2, bigtable uses chubby to detect tablet server failures

ensure there is only one active master
store the bootstrap location of BigTable data
discover tablet servers
store BigTable schema information
store access control lists

**
reference: http://blog.csdn.net/opennaive/article/details/7532589
http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable
http://read.seas.harvard.edu/cs261/2011/bigtable.html

**Because each row may have any number of different columns, there's no built-in way to query for a list of all columns in all rows
** In most cases, applications will simply ask for a given cell's data, without specifying a timestamp. In that common case, HBase/BigTable will return the most recent version (the one with the highest timestamp) since it stores these in reverse chronological order

**data replication is taken cared by GFS?

all things distributed

Thursday, July 10, 2014

Notes: Distributed System -- MapReduce, GFS, BigTable

No comments:

Post a Comment