How Face Book is storing & managing thousands of Tera Bytes of data …?

Nithish Kumar
4 min readSep 17, 2020

Daily, we are uploading photos and sending messages to our friends … Have you ever think about the data where it gonna storing all these things. Or, How Face Book is managing (saving and retrieving) the data within seconds …
What is big data ? Why Big Data problem arises ? How Facebook is managing thousands of tera bytes of data ? Which technology Facebook is using to managing those data ? Let’s see solution for all these questions …

🤔 Why the Big Data problem arises ❓

There is a huge explosion in the data available. Look back a few years, and compare it with today, and you will see that there has been an exponential increase in the data that enterprises can access. This data exceeds the amount of data that can be stored and computed, as well as retrieved. The challenge is not so much the availability, but the management of this data. With statistics claiming that data would increase 6.6 times the distance between earth and moon by 2020, this is definitely a challenge.

Along with rise in unstructured data, there has also been a rise in the number of data formats. Video, audio, social media, smart device data etc. are just a few to name.

📈 📉 Face Book statistics according to 2020 :

Face Book processes 2.5 billion pieces of content and 500+ terabytes of data each day. It’s pulling in 2.7 billion Like actions and 300 million photos per day, and it scans roughly 105 terabytes of data each half hour.

Facebook generates 4 petabytes of data per day—that’s a million gigabytes. All that data is stored in what is known as the Hive, which contains about 300 petabytes of data.

Every 60 seconds, 510,000 comments are posted, 293,000 statuses are updated, 4 million posts are liked, and 136,000 photos are uploaded.

✅ Now let’s see how Facebook is managing these data :

👉 Hadoop
“Facebook runs the world’s largest Hadoop cluster” says Jay Parikh, Vice President Infrastructure Engineering, Facebook.

Basically, Facebook runs the biggest Hadoop cluster that goes beyond 4,000 machines and storing more than hundreds of millions of gigabytes.

Hadoop provides a common infrastructure for Facebook with efficiency and reliability. Beginning with searching, log processing, recommendation system, and data warehousing, to video and image analysis, Hadoop is empowering this social networking platform in each and every way possible. Facebook developed its first user-facing application, Facebook Messenger, based on Hadoop database, i.e., Apache HBase.

👉 Scuba
With a huge amount of unstructured data coming across each day, Facebook slowly realized that it needs a platform to speed up the entire analysis part. That’s when it developed Scuba, which could help the Hadoop developers dive into the massive data sets and carry on ad-hoc analyses in real-time.

👉 Hive
After Yahoo implemented Hadoop for its search engine, Facebook thought about empowering the data scientists so that they could store a larger amount of data in the Oracle data warehouse. Hence, Hive came into existence. This tool improved the query capability of Hadoop by using a subset of SQL and soon gained popularity in the unstructured world. Today almost thousands of jobs are run using this system to process a range of applications quickly.

👉 Prism
Hadoop wasn’t designed to run across multiple facilities. Typically, because it requires such heavy communication between servers, clusters are limited to a single data center.

Initially when Facebook implemented Hadoop, it was not designed to run across multiple data centers. And that’s when the requirement to develop Prism was felt by the team of Facebook. Prism is a platform which brings out many namespaces instead of the single one governed by the Hadoop. This in turn helps to develop many logical clusters.

This system is now expandable to as many servers as possible without worrying about increasing the number of data centers.

🔰 For all these technologies, the core concept is distributed storage … Let’s see how distributed storage works …

👉 A distributed storage system is infrastructure that can split data across multiple physical servers, and often across more than one data center. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes.

👉 The Distributed storage stores the data in parallel by stripping / splitting the GB’s and GB’s of data in some species,. So that it will store the data within the seconds … Data stripping / splitting is done by master node / name node and it transfers data to all the respective Data Nodes / Slave nodes within seconds …

That’s all !!!
🤝Thanks for reading , See you in the next blog 👋.
…. Signing Off ….

--

--