I’m of course delighted that one of my colleagues at LinkedIn was able to participate in the CIKM 2011 Industry Event. Principal software engineer Chavdar Botev delivered a presentation on “Databus: A System for Timeline-Consistent Low-Latency Change Capture“.
LinkedIn processes a massive amount of member data and activity. It has over 135M members and is growing faster than two new members per second. Based on recent measurements, those members are on track to perform more than four billion searches on the LinkedIn platform in 2011. All of this activity requires a data change capture mechanism that allows external systems, such as its graph index and real-time full-text search index Zoie, to act as subscribers in user space and stay up to date with constantly changing data in the primary stores.
LinkedIn has built the Databus system to meet these needs. Databus meets four key requirements: timeline consistency, guaranteed delivery, low latency, and user-space visibility. For example, edits to member profile fields, such as companies and job titles, need to be standardized. Also, in order to give recruiters act quickly on feedback to their job postings, we need to be able to propagate the changes to the job description in near-real-time.
Databus propagates data changes throughout LinkedIn’s architecture. When there is a change in a primary store (e.g., member profiles or connections), the changes are buffered in the Databus Relay through a push or pull interface. The relay can also capture the transactional semantics of updates. Clients poll for changes in the relay. If a client falls behind the stream of change events in the relay, it is redirected to a Bootstrap database that delivers a compressed delta of the changes since the last event seen by the client.
In contrast to generic message systems (including the Kafka system that LinkedIn has open-sourced through Apache), Databus has moreinsight in the structure of the messages and can thus do better than just guaranteeing message-level integrity andtransactional semantics for communication sessions.
I tend to live a few levels above core infrastructure, but I’m grateful that Chavdar and his colleagues build the core platform that makes all of our large-scale data collection possible. After all, without data we have no data science.