Performance

Sector is fast. In our 128-node (512-core) system, Sector/Sphere performs 2 - 4 times faster than Hadoop in various applications.

  • Sector does not split files into blocks. This provides better data locality (when an application needs to process a whole file) and greater flexibility in parallel data processing, provided that the users use proper file sizes in their data sets.
  • Sector's fault tolerance strategy can not only detect dead nodes, but can also detect slow nodes and remove them from the system to improve performance.
  • Sector's load balancing strategy considers not only CPU and disk resources, but also network bandwidth across racks and data centers.
  • Sector uses UDT for extremely high speed data transfer, especially over wide area networks.
  • Sector can process binary data directly and persistent index files are used to avoid run time parsing.
  • Sector's parallel data processing engine uses a PUSH model to move data to destinations. This is faster than the PULL model used in MapReduce.
  • Sector is written in C++.

Scalability

Sector is designed to scale up to a large number of nodes, even across multiple data centers. Sector uses the following strategies to ensure scalability:

  • Clients communicate directly with data storage nodes (i.e., slaves). There is no single data IO bottleneck in the system.
  • Multiple active-active masters allows Sector to support a large number of files and client requests and provides good availability.
  • Data is processed on its storage node or nearest possible node. Data transfer inside the system is minimized.
  • Sector employs a topology-aware strategy on data location and processing and employs the high speed data transport protocol UDT to allow it to scale to multiple data centers over wide area networks. Few other systems have this ability.

On the other hand, Sector is able to scale down, to a single node that serves as a high performance FTP-like server.

Reliability

Sector provides reliability without affecting performance.

  • Sector automatically replicates data files for a good reliability and availability.
  • Both Sector masters and slaves can be removed and inserted at run time.
  • Sector provides automatic failover for Map style data processing. When a slave fails during the processing of a data segment, another slave will be assigned for the same data segment to continue processing.
  • Sector does not provide failover for Reduce style processing in favor of higher performance. However, the cost of restarting a Reduce process can be alleviated by splitting the input into multiple sub-tasks.

Simplicity

Simplicity is one of the most important design goals for Sector. We have designed Sector to be easy to install, easy to manage, and easy to use/program.

  • Easy to install: Sector can be installed on commodity computers and no special expensive hardware is required. The system only needs the openssl library to compile and run; no other libraries, such as Boost, need to be installed.
  • East to manage: Only a small number of parameters need to be configured. Storage nodes can be removed or inserted without restarting the whole system.
  • Easy to use: Sector provides a file system interface that allows you to manage Sector as a directory on your local file system. Sphere is a generic MapReduce style data processing framework that supports transparent parallel data processing. In particular, Sphere's UDF model is more flexible than MapReduce. The minimum processing unit can be records/rows, blocks, files, or even directories, which allows existing applications to be wrapped within a UDF.

Limitation and Trade-off

Sector is not a generic file system. Sector is a network file system that uses file system level fault tolerance to handle data loss and corruption. In contrast, parallel file systems such as Lustre, GPFS, and PVFS rely on very expensive hardware (e.g., RAID, fiber channel, etc.) to provide fault tolerance and high performance. Due to the replication, Sector write operations are slower than read operations.

Sector also has a file size limit. Sector does not split files into blocks. In order to achieve good parallelism, Sector files should be limited to a proper size, although the actual size may be application specific. On the other hand, this greatly simplifies Sector system design and implementation and improves data processing throughput (less cross-node IO).

Home | Contact Us | © 2009 National Center for Data Mining. All rights reserved.