The link to Val Bercovici’s article is here.
Here’s the gist-
Hadoop natively uses HDFS, which is a file system that’s made to be node-level redundant. Data is replicated by default THREE times across nodes in a Hadoop cluster. The nodes themselves, at least in the “tradition” of Hadoop, do not perform any RAID at all, if node’s filesystem fails, the data is already contained elsewhere and any running MapReduce jobs are simply started over.
This is great if you have a few thousand nodes and the people you’re crunching data for are at-large consumers who aren’t paying for your service and as such cannot expect service levels of any kind.
Enterprise, however, is a different story. Once business units start depending on reduced results from Hadoop, they start depending on the timeframe in which it’s delivered as well. Simply starting jobs over is NOT going to please anyone and could interrupt business processes. Further, Enterprises don’t have the space or budget to put up Hadoop clusters with the scale the Facebooks and Yahoos do (they also don’t typically have the justifiable use cases). In fact, the Enterprises I’m working with are taking a “build it and the use cases will come” approach to Hadoop.
NetApp’s NFS connector for Hadoop significantly reduces the entry point for businesses who want to vet out Hadoop and justify use cases. One of the traditional problems with Hadoop is that one needs to create a silo’ed architecture- servers, storage, and network, in a scale that prove the worth of Hadoop.
Now, businesses can throw compute (physical OR virtual) into a Hadoop cluster and connect to existing NFS datastores – whether they are on NetApp or not! NetApp has created this connector and thrown it upon the world as open source on GitHub.
This removes a huge barrier to entry for any NetApp (or NFS!) customer who is looking to perform analytics against an existing dataset without moving it or creating duplicate copies.