2.04.2015

Hadoop


Hadoop is an open-source project from Apache. It is developing a software library for reliable, scalable, distributed computing systems capable of handling the Big Data and provides the first viable platform for Big Data analytics. Hadoop is already used by most Big Data pioneers. For example, LinkedIn currently uses Hadoop to generate over 100 billion personalized recommendations every week.

What Hadoop does is - distribute the storage and processing of large data sets across groups or "clusters" of server computers using a simple programming model. The number of servers in a cluster can also be scaled easily as requirements dictate, from maybe 50 machines to perhaps 2000 or more. Whereas traditional large-scale computing solutions rely on expensive server hardware with a high fault tolerance. Hadoop detects and compensates for hardware failures or other system problems at the application level.


Technically, Hadoop consists of two key elements - 

1. HDFS: 
Hadoop Distributed File System (HDFS), which permits the high-bandwidth, cluster-based storage essential for Big Data computing.

2. MapReduce: 
The second part of Hadoop is then a data processing framework called MapReduce. 

Google developed MapReduce to support distributed computing on large data sets on  Computer    Clusters. Inspired by Google's MapReduce and Google File System (GFS) papers, Doug Cutting created Hadoop while he was at Yahoo.

Based on Google's search technology, this distributes or "maps" large data sets across multiple servers. Each of these servers then performs processing on the part of the overall data set it has been allocated, and from this creates a summary. The summaries created on each server are then aggregated in the so-termed "Reduce" stage. This approach allows extremely large raw data sets to be rapidly pre-processed and distilled before more traditional data analysis tools are applied.


At present, many Big Data pioneers are deploying a Hadoop ecosystem alongside their legacy IT systems in order to allow them to combine old and new data in new ways. However, in time, Hadoop may be destined to replace many traditional data warehouse and rigidly-structured relational database technologies and to become the dominant platform for many kinds of data processing.

Many organizations are unlikely to have the resources and expertise to implement their own Hadoop solutions. Fortunately they do not have to, as cloud solutions are already available. Offered by providers including Amazon, Netapp and Google, these allow organizations of all sizes to start benefiting from the potential of Big Data processing. Where public Big Data sets need to be utilized, running everything in the cloud also makes a lot of sense, as the data does not have to be downloaded to an organization's own system. For example, AWS already hosts many public data sets. These include US and Japanese Census data, and many genomic and other medical and scientific Big Data repositories.


Video link -