Categories
General

CIKM 2011 Industry Event: Chavdar Botev on Databus: A System for Timeline-Consistent Low-Latency Change Capture

This post is part of a series summarizing the presentations at the CIKM 2011 Industry Event, which I chaired with former Endeca colleague Tony Russell-Rose.

I’m of course delighted that one of my colleagues at LinkedIn was able to participate in the CIKM 2011 Industry Event. Principal software engineer Chavdar Botev delivered a presentation on “Databus: A System for Timeline-Consistent Low-Latency Change Capture“.

LinkedIn processes a massive amount of member data and activity. It has over 135M members and is growing faster than two new members per second. Based on recent measurements, those members are on track to perform more than four billion searches on the LinkedIn platform in 2011. All of this activity requires a data change capture mechanism that allows external systems, such as its graph index and real-time full-text search index Zoie, to act as subscribers in user space and stay up to date with constantly changing data in the primary stores.

LinkedIn has built the Databus system to meet these needs. Databus meets four key requirements: timeline consistency, guaranteed delivery, low latency, and user-space visibility. For example, edits to member profile fields, such as companies and job titles, need to be standardized. Also, in order to give recruiters act quickly on feedback to their job postings, we need to be able to propagate the changes to the job description in near-real-time.

Databus propagates data changes throughout LinkedIn’s architecture. When there is a change in a primary store (e.g., member profiles or connections), the changes are buffered in the Databus Relay through a push or pull interface. The relay can also capture the transactional semantics of updates. Clients poll for changes in the relay. If a client falls behind the stream of change events in the relay, it is redirected to a Bootstrap database that delivers a compressed delta of the changes since the last event seen by the client.

In contrast to generic message systems (including the Kafka system that LinkedIn has open-sourced through Apache), Databus has moreinsight in the structure of the messages and can thus do better than just guaranteeing message-level integrity andtransactional semantics for communication sessions.

I tend to live a few levels above core infrastructure, but I’m grateful that Chavdar and his colleagues build the core platform that makes all of our large-scale data collection possible. After all, without data we have no data science.

 

Categories
General

CIKM 2011 Industry Event: Khalid Al-Kofahi on Combining Advanced Search Technology and Human Expertise in Legal Research

This post is part of a series summarizing the presentations at the CIKM 2011 Industry Event, which I chaired with former Endeca colleague Tony Russell-Rose.

The original program for the CIKM 2011 Industry Event featured Peter Jackson, who was chief scientist at Thomson Reuters and author of numerous books and papers on natural language processing. Sadly, Peter died on August 3,2011. Thomson Reuters R&D VP of Research Khalid Al-Kofahi graciously agreed to speak in his place, delivering a presentation on  “Combining Advanced Search Technology and Human Expertise in Legal Research“.

Khalid began by giving an “83-second” overview of the US legal system, laying out the roles of the law, the courts, and the legislature. He did so to provide the context for the domain that Thomson Reuters serves — namely, legal information. Legal information providers curate legal information, enhance it editorially and algorithmically, and work to make legal information findable and explainable in particular task contexts. He then worked through an example of how a case law document (specifically, Burger King v. Rudzewicz), appears in WestLawNext, with annotations that include headnotes, topic codes, citation data, and historical context.

Channelling William Goffman, Khalid asserted that a document’s content (words, phrases, metadata) are not sufficient to determine its aboutness and importance. Rather, we also have to consider what other people say about the document and how they interact with it. This is especially true in the legal domain because of the precedential nature of law. He then framed legal search in terms of information retrieval metrics, stating the requirements as completeness (recall), accuracy (precision), and authority. Not surprisingly, Khalid agreed with Stephen Robertson’s emphasis on the importance of recall.

Speaking more generally, Khalid noted that vertical search is not just about search. Rather, it’s about findability. which includes navigation, recommendations, clustering, faceted classification, collaboration, etc. Most importantly, it’s about satisfying a set of well-understood tasks. And, particularly in the legal domain, customers demand explainable models. Beyond this demand, explainability serves an additional purpose: it enables the human searcher to add value to the process (cf. human-computer information retrieval).

It is sad to lose a great researcher like Peter Jackson from our ranks, but I am grateful that Khalid was able to honor his memory by presenting their joint work at CIKM. If you’d like to learn more, I encourage you to read the publications on the Thomson Reuters Labs page.

Categories
General

CIKM 2011 Industry Event: Jeff Hammerbacher on Experiences Evolving a New Analytical Platform

http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=20111027cikm-111116104111-phpapp02&rel=0&stripped_title=jeff-10188294&userName=dtunkelang

This post is part of a series summarizing the presentations at the CIKM 2011 Industry Event, which I chaired with former Endeca colleague Tony Russell-Rose.

The third speaker in the program was Cloudera co-founder and Chief Scientist Jeff Hammerbacher. Jeff, recently hailed by Tim O’Reilly as one of the world’s most powerful data scientists, built the Facebook Data Team, which is most known for open-source contributions that include Hive and Cassandra. Jeff’s talk was entitled “Experiences Evolving a New Analytical Platform: What Works and What’s Missing“. I am thankful to Jeff Dalton for live-blogging a summary.

Jeff’s talk was a whirlwind tour through the philosophy and technology for delivering large-scale analytics (aka “big data”) to the world:

1) Philosophy

The true challenges in the task of data mining are creating a data set with the relevant and accurate information and determining the appropriate analysis techniques. While in the past it made sense to plan data storage and structure around the intended use of the data, the economics of storage and the availability of open-source analytics platforms argue for the reverse: data first, ask questions later; store first, establish structure later. The goal is to enable everyone — developers, analysts, business users — to “party on the data”, providing infrastructure that keeps them from clobbering one another or starving each other of resources.

2) Defining the Platform

No one just uses a relational database anymore. For example, consider Microsoft SQL Server. It is actually part of a unified suite that includes SharePoint for collaboration, PowerPivot for OLAP, StreamInsight for complex event processing (CEP), etc. As with the LAMP stack, there is a coherent framework analytical data management which we can call an analytical data platform.

3) Cloudera’s Platform

Cloudera starts with a substrate architecture of Open Compute commodity Linux servers configured using Puppet and Chef and coordinated using ZooKeeper. Naturally this entire stack is open-source. They use HFDS and Ceph to provide distributed, schema-less storage. They offer append-only table storage and metadata using Avro, RCFile, and HCatalog; and mutable table storage and metadata using HBase. For computation, they offer YARN (inter-job scheduling, like Grid Engine, for data intensive computing) and Mesos for cluster resource management; MapReduce, Hamster (MPI), Spark, Dryad / DryadLINQ, Pregel (Giraph), and Dremel as processing frameworks; and Crunch  (like Google’s FlumeJava), PigLatin, HiveQL, and Oozie as high-level interfaces. Finally, Cloudera offers tool access through FUSE, JDBC, and ODBC; and data ingest through Sqoop and Flume.

4) What’s Next?

For the substrate, we can expect support for fat servers with fat pipes, operating system support for isolation, and improved local filesystems (e.g., btrfs). Storage improvements will give us a unified file format, compression, better performance and availability, richer metadata, distributed snapshots, replication across data centers, native client access, and separation of namespace and block management. We will see stabilization of our existing compute tools and better variety, as well as improved fault tolerance, isolation and workload management, low-latency job scheduling, and a unified execution backend for workflow. And we will see better integration through REST API access to all platform components, better document ingest, maintenance of source catalog and provenance information, and an integration beyond ODBC with analytics tools. We will also see tools that facilitate that transition from unstructured to structured data (e.g. RecordBreaker).

Jeff’s talk was as information-dense as this post suggests, and I hope the mostly-academic CIKM audience was not too shell-shocked. It’s fantastic to see practitioners not only building essential tools for research in information and knowledge management, but reaching out to the research community to build bridges. I saw lots of intense conversation after his talk, and I hope the results realize the two-fold mission of the Industry Event, which is to give  researchers an opportunity to learn about the problems most relevant to industry practitioners, and to offer practitioners an opportunity to deepen their understanding of the field in which they are working.

Categories
General

CIKM 2011 Industry Event: John Giannandrea on Freebase – A Rosetta Stone for Entities

This post is part of a series summarizing the presentations at the CIKM 2011 Industry Event, which I chaired with former Endeca colleague Tony Russell-Rose.

The second speaker in the program was Metaweb co-founder John Giannadrea. Google acquired Metaweb last year and has kept its promise to to maintain Freebase as a free and open database for the world (including for rival search engine Bing — though I’m not sure if Bing is still using Freebase). John’s talk was entitled “Freebase – A Rosetta Stone for Entities“. I am thankful to Jeff Dalton for live-blogging a summary.

John started by introducing Freebase as a representation of structured objects corresponding to real-world entities and connected by a directed graph of relationships. In other words, a semantic web. While it isn’t quite web-scale, Freebase is a large and growing knowledge base consisting of 25 million entities and 500 million connections — and doubling annually. The core concept in Freebase is a type, and an entity can have many types. For example, Arnold Schwarzenegger is a politician and an actor. John emphasized the messiness of the real world. For example, most actors are people, but what about the dog who played Lassie? It’s important to support exceptions.

The main technical challenge for Freebase is reconciliation — that is, determining how similar a set of data is to existing Freebase topics. John pointed out how critical it is for Freebase to avoid duplication of content, since the utility of Freebase depends on unique nodes in its graph corresponding to unique objects in the world. Freebase obtains many of its entities by reconciling large, open-source knowledge bases — including Wikipedia, WordNetLibrary of Congress Authorities,  and metadata from the Stanford Library. Freebase uses a variety of tools to implement reconciliation, including Google Refine (formerly known as Freebase Gridworks) and Matchmaker, a tool for gathering human judgments. While reconciliation is a hard technical problem, it is made possible by making inferences across the web of relationships that link entities to one another.

John then presented Freebase as a Rosetta Stone for entities on the web. Since an entity is simply a collection of keys (one of which is its name), Freebase’s job is to reverse engineer the key-value store that is distributed among the entity’s web references, e.g., the structured databases backing web sites and encoding keys in URL parameters. He noted that Freebase itself is schema-less (it is a graph database), and that even the concept of a type is itself an entity (“Type type is the only type that is an instance of itself”). Google makes Freebase available through an API and the Metaweb Query Language (MQL).

Freebase does have its challenges. The requirement to keep out duplicates is an onerous one, as they discovered when importing a portion of the Open Library catalog. Maintaining quality calls for significant manual curation, and quality varies across the knowledge base. John asserted that Freebase provides 99% accuracy at the 95th percentile, though it’s not clear to me what that means (update: see Bill’s comment below).

While I still have concerns about Freebase’s robustness as a structured knowledge base (see my post on “In Search Of Structure“), I’m excited to see Google investing in structured representations of knowledge. To hear more about Google’s efforts in this space, check out the Strata New York panel I moderated on Entities, Relationships, and Semantics — the panelists included Andrew Hogue, who leads Google’s structured data and information extraction group and managed me during my year at Google New York.

Categories
General

CIKM 2011 Industry Event: Stephen Robertson on Why Recall Matters

On October 27th, I had the pleasure to chair the CIKM 2011 Industry Event with former Endeca colleague Tony Russell-Rose. It is my pleasure to report that the program, held in parallel with the main conference sessions, was a resounding success. Since not everyone was able to make it to Glasgow for this event, I’ll use this and subsequent posts to summarize the presentations and offer commentary. I’ll also share any slides that presenters made available to me.

Microsoft researcher Stephen Robertson, who may well be the world’s preeminent living researcher in the area of information retrieval, opened the program with a talk on “Why Recall Matters“. For the record, I didn’t put him up to this, despite my strong opinions on the subject.

Stephen started by reminding us of ancient times (i.e., before the web), when at least some IR researchers thought in terms of set retrieval rather than ranked retrieval. He reminded us of the precision and recall “devices” that he’d described in his Salton Award Lecture — an idea he attributed to the late Cranfield pioneer Cyril Cleverdon. He noted that, while set retrieval uses distinct precision and recall devices, ranking conflates both into decision of where to truncate a ranked result list. He also pointed out an interesting asymmetry in the conventional notion of precision-recall tradeoff: while returning more results can only increase recall, there is no certainly that the additional results will decrease precision. Rather, this decrease is a hypothesis that we associate with systems designed to implement the probability ranking principle, returning results in decreasing order of probability of relevance.

He went on to remind us that there is information retrieval beyond web search. He hauled out the usual examples of recall-oriented tasks: e-discovery, prior art search, and evidence-based medicine. But he then made the case that not only the web not the only problem in information retrieval, but that “it’s the web that’s strange” relative to the rest of the information retrieval landscape in so strongly favoring precision over recall. He enumerated some of the peculiarities of the web, including its size (there’s only one web!), the extreme variation in authorship and quality, the lack of any content standardization (efforts like schema.org notwithstanding), and the advertising-based monetization model that creates an unusual and sometimes adversarial relationships between content owners and search engines. In particular, he cited enterprise search as an information retrieval domain that violates the assumptions of web search and calls for more emphasis on recall.

Stephen suggested that, rather than thinking in terms of the precision-recall curve, we consider the recall-fallout curve. Fallout is a relatively unknown measure that represents the probability that a non-relevant document is retrieved by the query. He noted that fallout offered little practical use in IR, given that the corpus is populated almost entirely by non-relevant documents. Still, he made the case that the recall-fallout trade-off might be more conceptually appropriate than the precision-recall curve in order to understand the value of recall.

In particular, we can generalize the traditional inverse precision-recall relationship to the hypothesis that the recall-fallout curve is convex (details in “On score distributions and relevance“). We can then calculate instantaneous precision at any point in the result list as the gradient of the recall-fallout curve. Going back to the notion of devices, we can now replace precision devices with fallout devices.

Stephen wrapped up his talk by emphasizing the user of information retrieval systems — as aspect of IR that is too often neglected outside HCIR circles. He advocated that systems provide user with evidence of recall, guidance of how far to go down ranked results, and prediction of the recall at any given stopping point.

It was an extraordinary privilege to have Stephen Robertson present at the CIKM Industry Event, and even better to have him make a full-throated argument in favor of recall. I can only hope that researchers and practitioners take him up on it.

Categories
General

Entities, Relationships, and Semantics: Strata NY Panel on the State of Structured Search

Earlier this year, I had the privilege to moderate a panel at Strata New York 2011 on Entities, Relationships, and Semantics: the State of Structured Search. The four panelists are people I’ve had the pleasure to work with over the years: Andrew Hogue (Google), Breck Baldwin (Alias-i), Evan Sandhaus (New York Times), Wlodek Zadrozny (IBM Research). They work on some of the world’s largest structured search problems — from offering users structured search on Google’s web corpus to building a computing system that defeated Jeopardy! champions in an extreme test of natural language understanding.

O’Reilly has compiled the nearly 50 hours of video from the conference and made the collection available for purchase. I was lucky to attend all of the keynotes and many of the breakout sessions, and I highly recommend them. In the meantime, you can see a recording of the panel I moderated.

Categories
General

Interview in Forbes: What is a Data Scientist?

Dan Woods has been interviewing a variety of folks to answer the question: “What is a data scientist?“, and I had the honor to participate in his series.

Here is a teaser of my interview:

Above all, a data scientist needs to be able to derive robust conclusions from data. But a data scientist also needs to possess creativity and strong communication skills. Creativity drives the process of hypothesis generation, i.e., picking the right problems to solve that will create value for users and drive business decisions.

Read the rest on Forbes.com. And thanks to Drew Conway for the awesome data science Venn diagram above.

Categories
General

RecSys 2011 Tutorial: Recommendations as a Conversation with the User


 

Last week, I had the privilege to present a tutorial at the 5th ACM International Conference on Recommender Systems (RecSys 2011). Given my passion for HCIR and my advocacy for transparency in recommender systems, it shouldn’t surprise regular readers that I focused on both. Unfortunately the tutorial was not recorded, but I hope the slides above prove useful. I also encourage you to take a look at the other tutorials, whose slides are posted on the conference site.

Categories
General

HCIR 2011: We Have Arrived!

If you followed the #hcir2011 tweet stream, then you already know what I have to say: the Fifth Workshop on Human-Computer Interaction and Information Retrieval (HCIR 2011) was an extraordinary success. We had about 100 people attending, 14 paper presentations, 28 posters, and 4 challenge entries, all packed into one intense day at Google’s beautiful Mountain View headquarters.

Wednesday evening before the workshop, we were treated to a welcome reception, the first of a few meals provided by Google’s excellent chefs. It was a great opportunity to reconnect with old friends and meet many first-time HCIR attendees.

Thursday started with a scrumptious breakfast that included chilaquiles, coconut fritters, and bacon. Last year’s keynote and this year’s local host Dan Russell pulled all the stops — apparently BigTable is the only Google cafe that serves bacon for breakfast! We then proceeded to a poster boaster session in which each poster presenter had a minute to pitch his or her poster. This session set the tone for the rest of the workshop: concentrated ideas and intense audience engagement.

Then came this year’s keynote, Gary Marchionini. It was a particular treat to have Gary as a keynote, since his lecture on “Toward Human-Computer Information Retrieval” inspired me to conceive the HCIR workshop back in 2007. And Gary delivered the goods. He started with a review of the history of HCIR, including some lesser known figures like Don Hawkins (who was in the audience) , Pauline Cochrane, Richard Marcus, and Charles Meadow.  He brought a few chuckles by citing Nick Belkin (who was present) and Sue Dumais (who was not) as the father and mother of HCIR. Naturally he described some of his own work at the University of North Carolina, including the Open Video, Relation Browser, and ResultsSpace projects.But the highlight of his talk was a graph he presented showing two paths to the same user end-state, one of the paths being a smooth progression and the other being a roller-coaster of ups and down. The question of which one was better drew a wide variety of responses, my favorite being Gene Golovchinsky observing that learning is the friction of the information-seeking process.

We broke for coffee and then came back to the first session of paper presentations. Sofia Athenikos presented a semantic search engine that outperformed IMDB in a user study. Chang Liu explored the effect of task difficulty and domain knowledge on dwell times, finding counterintuitive results (at least for me) regarding the correlation of expertise to dwell time. Jingjing Liu presented research on knowledge examination in multi-session tasks. Then came the lightning talks: Mark Smucker on how users examine and process ranked document lists; Jin Kim on simulating associative browsing; Bill Kules on visualizing the stages of exploratory search; and Michael Cole on user domain knowledge and eye movement patterns during search. Way too much goodness to summarize here — I suggest you read the full papers on the workshop site.

Then came lunch — again in BigTable, but this time with outdoor seating — and the poster session. As always, this it the most interactive part of the day: two hours of non-stop discussion that start over food and end with prying people away from discussions about posters. I was especially proud of LinkedIn’s contributions to the poster session, which covered faceted search log analysis, social navigation, and whether it is time to abandon abandonment.

Then back to the second session of  paper presentations. Luanne Freund talked about document usefulness and genre, finding that genre, besides being hard for users to reliably identify, only matters for tasks that involve doing, deciding, learning; but not for those that involve fact finding or problem solving. Gene Golovchinsky presented work on designing for collaboration in information seeking, previewing the system he used for his challenge entry.  Alyona Medelyan used the Pingar search engine to evaluate how search interface features affect performance on biosciences tasks. Then more lightning talks: Rob Capra analyzing faceted search on mobile devices; Keith Bagley on conceptual mile markers for exploratory search; Xiaojun Yuan on how cognitive styles affect user performance; and Mike Zarro on using social tags and controlled vocabularies as search filters.

Last but not least came the HCIR Challenge:

The HCIR 2011 Challenge focuses on the case where recall is everything – namely, the problem of information availability. The information availability problem arises when the seeker faces uncertainty as to whether the information of interest is available at all. Instances of this problem include some of the highest-value information tasks, such as those facing national security and legal/patent professionals, who might spend hours or days searching to determine whether the desired information exists.

The corpus we will use for the HCIR 2011 Challenge is the CiteSeer digital library of scientific literature. The CiteSeer corpus contains over 750,000 documents and provides rich meta-data about documents, authors, and citations.

There were four entries:

The competition was fierce. Claudiu showed off the Faceted DBLP interface, which is well suited to the information availability task on CiteSeer data. Ed showed how GisterPro uses visualization to support the information seeking process. But it came down to a close call between the Query Analytics Workbench and Querium. Despite the Elsevier team’s impressive functionality and animated presentation, Gene’s simpler interface and application of ranked fusion won the day. Congratulations to Gene and Abdigani, this year’s HCIR Challenge winners!

We wrapped up the evening at the Tied House, a local microbrewery. And of course the discussion turned to where, when, and how we will hold next year’s workshop. Watch this space. In the meantime, my heartfelt thanks to everyone who made this year’s workshop such a success — and especially to our sponsors. Thank you Endeca, Kent State, Microsoft, and Google!

Categories
General

Oracle Acquires Endeca!

   

Today is a wonderful day for Endeca and Oracle! Oracle has announced that it has entered into an agreement to acquire Endeca, bringing together two of the powerhouses of information access. Quoting from the announcement: “The combination of Oracle and Endeca is expected to create a comprehensive technology platform to process, store, manage, search and analyze structured and unstructured information together. ”

As part of Endeca’s founding team, I am very proud to see this day. My ten years at Endeca were a formative experience that established my professional identity and inspired my passion to pursue the vision of human-computer information retrieval (by happy coincidence, the 5th annual HCIR workshop take place on Thursday). Reading Oracle’s presentation about the acquisition, I’m excited to see how Endeca’s technology will play a key role in unifying structured and unstructured data management and analysis for Oracle’s customers.

I take pride in my contributions to Endeca — I still slip sometimes and refer to Endeca as “we”. But the real heroes here are the folks — and especially the leadership —  who have seen this journey through from start to finish. In particular, I am grateful to Steve Papa, Pete Bell, Adam Ferrari, Jack Walter, Keith Johnson, Nik Bates-Haus, and Jason Purcell for everything they have done to bring about this extraordinary outcome.

Finally, excited as I am about this event, it is only the beginning. I am excited to see Endeca’s people and technology powering one of the world’s largest enterprise software companies. Looking forward to the next play!