Map Reduce
Highlights
Fault tolerance:
worker failure
Highlights
Fault tolerance:
worker failure
Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global le system.
Page 5
The MapReduce master takes the location information of the input les into account and attempts to schedule a map task on a machine that contains a replica of the corresponding input data. Failing that, it attempts to schedule a map task near a replica of that task's input data (e.g., on a worker machine that is on the same network switch as the machine containing the data).
Page 6, Refinements 4.1-4.4
--------------------------------
GFS
3.4, The snapshot operation makes a copy of a file or a directory tree (the “source”) almost instantaneously, while minimizing any interruptions of ongoing mutations.
Work flow:
4.5 use version number to detect stale replicas
**Because each row may have any number of different columns, there's no built-in way to query for a list of all columns in all rows
** In most cases, applications will simply ask for a given cell's data, without specifying a timestamp. In that common case, HBase/BigTable will return the most recent version (the one with the highest timestamp) since it stores these in reverse chronological order
**data replication is taken cared by GFS?
Page 6, Refinements 4.1-4.4
--------------------------------
GFS
3.4, The snapshot operation makes a copy of a file or a directory tree (the “source”) almost instantaneously, while minimizing any interruptions of ongoing mutations.
Work flow:
- Master receives snapshot request
- Master revokes any outstanding lease
- Master makes a duplicate of metadata for source files of directory
- That's it!
- Whenever replica receives a request of write to Chunk C(in source files or directory), it will make a new copy of Chunk C in local disc, then starts working in new copy
Unlike many traditional file systems, GFS does not have a per-directory data structure that lists all the files in that directory. Nor does it support aliases for the same file or directory (i.e, hard or symbolic links in Unix terms). GFS logically represents its namespace as a lookup table mapping full pathnames to metadata
master --> prime chunkserver --> secondary chunckserver(replicas) --> chunks
4.5 use version number to detect stale replicas
----------------------------------------
BigTable
The master is responsible for assigning tablets to tablet servers, detecting the addition and expiration of tablet servers, balancing tablet-server load, and garbage collection of files in GFS. In addition, it handles schema changes such as table and column family creations. It does not know the location of a tablet, metadata has that info
5.2, bigtable uses chubby to detect tablet server failures
BigTable
The master is responsible for assigning tablets to tablet servers, detecting the addition and expiration of tablet servers, balancing tablet-server load, and garbage collection of files in GFS. In addition, it handles schema changes such as table and column family creations. It does not know the location of a tablet, metadata has that info
5.2, bigtable uses chubby to detect tablet server failures
- ensure there is only one active master
- store the bootstrap location of BigTable data
- discover tablet servers
- store BigTable schema information
- store access control lists
**
reference: http://blog.csdn.net/opennaive/article/details/7532589
http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable
http://read.seas.harvard.edu/cs261/2011/bigtable.html
reference: http://blog.csdn.net/opennaive/article/details/7532589
http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable
http://read.seas.harvard.edu/cs261/2011/bigtable.html
**Because each row may have any number of different columns, there's no built-in way to query for a list of all columns in all rows
** In most cases, applications will simply ask for a given cell's data, without specifying a timestamp. In that common case, HBase/BigTable will return the most recent version (the one with the highest timestamp) since it stores these in reverse chronological order
**data replication is taken cared by GFS?
No comments:
Post a Comment