What is Big Data
Big Data is very large, loosely structured data set that defies traditional storage.
Big Data is very large, loosely structured data set that defies traditional storage.
"Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time”. – wiki
So big that a single data set may contains few terabytes to many petabytes of data.
Human Generated Data and Machine Generated Data
Human Generated Data is emails, documents, photos and tweets. We are generating this data faster than ever. Just imagine the number of videos uploaded to You Tube and tweets swirling around. This data can be Big Data too.
Machine Generated Data is a new breed of data. This category consists of sensor data, and logs generated by 'machines' such as email logs, click stream logs, etc. Machine generated data is orders of magnitude larger than Human Generated Data.
Before 'Hadoop' was in the scene, the machine generated data was mostly ignored and not captured. It is because dealing with the volume was NOT possible, or NOT cost effective.
Where does Big Data come from
Original big data was the web data -- as in the entire Internet! Remember Hadoop was built to index the web. These days Big data comes from multiple sources.
- Web Data -- still it is big data
- Social media data : Sites like Facebook, Twitter, LinkedIn generate a large amount of data
- Click stream data : when users navigate a website, the clicks are logged for further analysis (like navigation patterns). Click stream data is important in on line advertising and and E-Commerce
- sensor data : sensors embedded in roads to monitor traffic and misc. other applications generate a large volume of data
Examples of Big Data in the Real world
- Facebook : has 40 PB of data and captures 100 TB / day
- Yahoo : 60 PB of data
- Twitter : 8 TB / day
- EBay : 40 PB of data, captures 50 TB / day
Challenges of Big Data
Size of Big Data
Big data is... well... big in size! How much data constitute Big Data is not very clear cut. So lets not get bogged down in that debate. For a small company that is used to dealing with data in gigabytes, 10TB of data would be BIG. However for companies like Facebook and Yahoo, peta bytes is big.
Just the size of big data, makes it impossible (or at least cost prohibitive) to store in traditional storage like databases or conventional filers.
We are talking about cost to store gigabytes of data. Using traditional storage filers can cost a lot of money to store Big Data.
Big Data is unstructured or semi structured
A lot of Big Data is unstructured. For example click stream log data might look like time stamp, user_id, page, referrer_page
Lack of structure makes relational databases not well suited to store Big Data.
Plus, not many databases can cope with storing billions of rows of data.
No point in just storing big data, if we can't process it
Storing Big Data is part of the game. We have to process it to mine intelligence out of it. Traditional storage systems are pretty 'dumb' as in they just store bits -- They don't offer any processing power.
The traditional data processing model has data stored in a 'storage cluster', which is copied over to a 'compute cluster' for processing, and the results are written back to the storage cluster.
This model however doesn't quite work for Big Data because copying so much data out to a compute cluster might be too time consuming or impossible. So what is the answer?
One solution is to process Big Data 'in place' -- as in a storage cluster doubling as a compute cluster.
How Hadoop solves the Big Data problem
Hadoop clusters scale horizontally
More storage and compute power can be achieved by adding more nodes to a Hadoop cluster. This eliminates the need to buy more and more powerful and expensive hardware.
Hadoop can handle unstructured / semi-structured data
Hadoop doesn't enforce a 'schema' on the data it stores. It can handle arbitrary text and binary data. So Hadoop can 'digest' any unstructured data easily.
Hadoop clusters provides storage and computing
We saw how having separate storage and processing clusters is not the best fit for Big Data. Hadoop clusters provide storage and distributed computing all in one.
No comments:
Post a Comment