[2015–02–18] How Pinterest Measures Real-Time User Engagement with Spark

Setting the Stage for Spark

With Spark on track to replace MapReduce, enterprises are flocking to the open source framework in effort to take advantage of its superior distributed data processing power.

Pinterest’s Spark Streaming Setup

How It Works

  • Pinterest pushes event data, such as pins and repins, to Apache Kafka.
  • Spark Streaming ingests event data from Apache Kafka, then filters by event type and enriches each event with full pin and geo-location data.
  • Using the MemSQL Spark Connector, data is then written to MemSQL with each event type flowing into a separate table. MemSQL handles record deduplication (Kafka’s “at least once” semantics guarantee fault tolerance but not uniqueness).
  • As data is streaming in, Pinterest is able to run queries in MemSQL to generate engagement metrics and report on various event data like pins, repins, comments and logins.

Visualizing the Data

We built a demo with Pinterest to showcase the locations of repins as they happen. When an image is repinned, circles on the globe expand, providing a visual representation of the concentration of repins by location.

The Point

This infrastructure gives Pinterest the ability to identify (and react to) developing trends as they happen. In turn, Pinterest and their partners can get a better understanding of user behavior and provide more value to the Pinner community. Because everything SQL based, access to data is more widespread. Engineers and analyst can work with familiar tools to run queries and track high-value user activity such as repins.

Initial Results

After integrating Spark Streaming and MemSQL, running on AWS, into their data stack, Pinterest now has a source of record for sharing relevant user engagement data and metrics their data analyst and with key brands.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Neil Dahlke

Neil Dahlke

Engineer. @hashicorp , formerly @memsql , @UChiResearch . @depaulu alum.