Big data plays a big role at Pinterest. With more than 30 billion Pins in the system, we’re building the most comprehensive collection of interests online. One of the challenges associated with building a personalized discovery engine is scaling our data infrastructure to traverse the interest graph to extract context and intent for each Pin.
We currently log 20 terabytes of new data each day, and have around 10 petabytes of data in S3. We use Hadoop to process this data, which enables us to put the most relevant and recent content in front of Pinners through features such as Related Pins, Guided Search, and image processing. It also powers thousands of daily metrics and allows us to put every user-facing change through rigorous experimentation and analysis.
Building a self-serve platform for Hadoop
Though Hadoop is a powerful processing and storage system, it’s not a plug and play technology. Because it doesn’t have cloud or elastic computing, or non-technical users in mind, its original design falls short as a self-serve platform. Fortunately there are many Hadoop libraries/applications and service providers that offer solutions to these limitations. Before choosing from these solutions, we mapped out our Hadoop setup requirements.
1. Isolated multitenancy
3. Multi-cluster support
4. Support for ephemeral clusters
5. Easy software package deployment
6. Shared data store
7. Access control layer
Read all 7 items more on engineering.pinterest.com