Year: 2011

General

This post is part of a series summarizing the presentations at the CIKM 2011 Industry Event, which I chaired with former Endeca colleague Tony Russell-Rose.

The third speaker in the program was Cloudera co-founder and Chief Scientist Jeff Hammerbacher. Jeff, recently hailed by Tim O’Reilly as one of the world’s most powerful data scientists, built the Facebook Data Team, which is most known for open-source contributions that include Hive and Cassandra. Jeff’s talk was entitled “Experiences Evolving a New Analytical Platform: What Works and What’s Missing“. I am thankful to Jeff Dalton for live-blogging a summary.

Jeff’s talk was a whirlwind tour through the philosophy and technology for delivering large-scale analytics (aka “big data”) to the world:

1) Philosophy

The true challenges in the task of data mining are creating a data set with the relevant and accurate information and determining the appropriate analysis techniques. While in the past it made sense to plan data storage and structure around the intended use of the data, the economics of storage and the availability of open-source analytics platforms argue for the reverse: data first, ask questions later; store first, establish structure later. The goal is to enable everyone — developers, analysts, business users — to “party on the data”, providing infrastructure that keeps them from clobbering one another or starving each other of resources.

2) Defining the Platform

No one just uses a relational database anymore. For example, consider Microsoft SQL Server. It is actually part of a unified suite that includes SharePoint for collaboration, PowerPivot for OLAP, StreamInsight for complex event processing (CEP), etc. As with the LAMP stack, there is a coherent framework analytical data management which we can call an analytical data platform.

3) Cloudera’s Platform

Cloudera starts with a substrate architecture of Open Compute commodity Linux servers configured using Puppet and Chef and coordinated using ZooKeeper. Naturally this entire stack is open-source. They use HFDS and Ceph to provide distributed, schema-less storage. They offer append-only table storage and metadata using Avro, RCFile, and HCatalog; and mutable table storage and metadata using HBase. For computation, they offer YARN (inter-job scheduling, like Grid Engine, for data intensive computing) and Mesos for cluster resource management; MapReduce, Hamster (MPI), Spark, Dryad / DryadLINQ, Pregel (Giraph), and Dremel as processing frameworks; and Crunch (like Google’s FlumeJava), PigLatin, HiveQL, and Oozie as high-level interfaces. Finally, Cloudera offers tool access through FUSE, JDBC, and ODBC; and data ingest through Sqoop and Flume.

4) What’s Next?

For the substrate, we can expect support for fat servers with fat pipes, operating system support for isolation, and improved local filesystems (e.g., btrfs). Storage improvements will give us a unified file format, compression, better performance and availability, richer metadata, distributed snapshots, replication across data centers, native client access, and separation of namespace and block management. We will see stabilization of our existing compute tools and better variety, as well as improved fault tolerance, isolation and workload management, low-latency job scheduling, and a unified execution backend for workflow. And we will see better integration through REST API access to all platform components, better document ingest, maintenance of source catalog and provenance information, and an integration beyond ODBC with analytics tools. We will also see tools that facilitate that transition from unstructured to structured data (e.g. RecordBreaker).

Jeff’s talk was as information-dense as this post suggests, and I hope the mostly-academic CIKM audience was not too shell-shocked. It’s fantastic to see practitioners not only building essential tools for research in information and knowledge management, but reaching out to the research community to build bridges. I saw lots of intense conversation after his talk, and I hope the results realize the two-fold mission of the Industry Event, which is to give researchers an opportunity to learn about the problems most relevant to industry practitioners, and to offer practitioners an opportunity to deepen their understanding of the field in which they are working.

General

CIKM 2011 Industry Event: John Giannandrea on Freebase – A Rosetta Stone for Entities

Post author By Daniel Tunkelang
Post date November 15, 2011
13 Comments on CIKM 2011 Industry Event: John Giannandrea on Freebase – A Rosetta Stone for Entities

This post is part of a series summarizing the presentations at the CIKM 2011 Industry Event, which I chaired with former Endeca colleague Tony Russell-Rose.

The second speaker in the program was Metaweb co-founder John Giannadrea. Google acquired Metaweb last year and has kept its promise to to maintain Freebase as a free and open database for the world (including for rival search engine Bing — though I’m not sure if Bing is still using Freebase). John’s talk was entitled “Freebase – A Rosetta Stone for Entities“. I am thankful to Jeff Dalton for live-blogging a summary.

John started by introducing Freebase as a representation of structured objects corresponding to real-world entities and connected by a directed graph of relationships. In other words, a semantic web. While it isn’t quite web-scale, Freebase is a large and growing knowledge base consisting of 25 million entities and 500 million connections — and doubling annually. The core concept in Freebase is a type, and an entity can have many types. For example, Arnold Schwarzenegger is a politician and an actor. John emphasized the messiness of the real world. For example, most actors are people, but what about the dog who played Lassie? It’s important to support exceptions.

The main technical challenge for Freebase is reconciliation — that is, determining how similar a set of data is to existing Freebase topics. John pointed out how critical it is for Freebase to avoid duplication of content, since the utility of Freebase depends on unique nodes in its graph corresponding to unique objects in the world. Freebase obtains many of its entities by reconciling large, open-source knowledge bases — including Wikipedia, WordNet, Library of Congress Authorities, and metadata from the Stanford Library. Freebase uses a variety of tools to implement reconciliation, including Google Refine (formerly known as Freebase Gridworks) and Matchmaker, a tool for gathering human judgments. While reconciliation is a hard technical problem, it is made possible by making inferences across the web of relationships that link entities to one another.

John then presented Freebase as a Rosetta Stone for entities on the web. Since an entity is simply a collection of keys (one of which is its name), Freebase’s job is to reverse engineer the key-value store that is distributed among the entity’s web references, e.g., the structured databases backing web sites and encoding keys in URL parameters. He noted that Freebase itself is schema-less (it is a graph database), and that even the concept of a type is itself an entity (“Type type is the only type that is an instance of itself”). Google makes Freebase available through an API and the Metaweb Query Language (MQL).

Freebase does have its challenges. The requirement to keep out duplicates is an onerous one, as they discovered when importing a portion of the Open Library catalog. Maintaining quality calls for significant manual curation, and quality varies across the knowledge base. John asserted that Freebase provides 99% accuracy at the 95th percentile, though it’s not clear to me what that means (update: see Bill’s comment below).

While I still have concerns about Freebase’s robustness as a structured knowledge base (see my post on “In Search Of Structure“), I’m excited to see Google investing in structured representations of knowledge. To hear more about Google’s efforts in this space, check out the Strata New York panel I moderated on Entities, Relationships, and Semantics — the panelists included Andrew Hogue, who leads Google’s structured data and information extraction group and managed me during my year at Google New York.

General

CIKM 2011 Industry Event: Stephen Robertson on Why Recall Matters

Post author By Daniel Tunkelang
Post date November 14, 2011
5 Comments on CIKM 2011 Industry Event: Stephen Robertson on Why Recall Matters

On October 27th, I had the pleasure to chair the CIKM 2011 Industry Event with former Endeca colleague Tony Russell-Rose. It is my pleasure to report that the program, held in parallel with the main conference sessions, was a resounding success. Since not everyone was able to make it to Glasgow for this event, I’ll use this and subsequent posts to summarize the presentations and offer commentary. I’ll also share any slides that presenters made available to me.

Microsoft researcher Stephen Robertson, who may well be the world’s preeminent living researcher in the area of information retrieval, opened the program with a talk on “Why Recall Matters“. For the record, I didn’t put him up to this, despite my strong opinions on the subject.

Stephen started by reminding us of ancient times (i.e., before the web), when at least some IR researchers thought in terms of set retrieval rather than ranked retrieval. He reminded us of the precision and recall “devices” that he’d described in his Salton Award Lecture — an idea he attributed to the late Cranfield pioneer Cyril Cleverdon. He noted that, while set retrieval uses distinct precision and recall devices, ranking conflates both into decision of where to truncate a ranked result list. He also pointed out an interesting asymmetry in the conventional notion of precision-recall tradeoff: while returning more results can only increase recall, there is no certainly that the additional results will decrease precision. Rather, this decrease is a hypothesis that we associate with systems designed to implement the probability ranking principle, returning results in decreasing order of probability of relevance.

He went on to remind us that there is information retrieval beyond web search. He hauled out the usual examples of recall-oriented tasks: e-discovery, prior art search, and evidence-based medicine. But he then made the case that not only the web not the only problem in information retrieval, but that “it’s the web that’s strange” relative to the rest of the information retrieval landscape in so strongly favoring precision over recall. He enumerated some of the peculiarities of the web, including its size (there’s only one web!), the extreme variation in authorship and quality, the lack of any content standardization (efforts like schema.org notwithstanding), and the advertising-based monetization model that creates an unusual and sometimes adversarial relationships between content owners and search engines. In particular, he cited enterprise search as an information retrieval domain that violates the assumptions of web search and calls for more emphasis on recall.

Stephen suggested that, rather than thinking in terms of the precision-recall curve, we consider the recall-fallout curve. Fallout is a relatively unknown measure that represents the probability that a non-relevant document is retrieved by the query. He noted that fallout offered little practical use in IR, given that the corpus is populated almost entirely by non-relevant documents. Still, he made the case that the recall-fallout trade-off might be more conceptually appropriate than the precision-recall curve in order to understand the value of recall.

In particular, we can generalize the traditional inverse precision-recall relationship to the hypothesis that the recall-fallout curve is convex (details in “On score distributions and relevance“). We can then calculate instantaneous precision at any point in the result list as the gradient of the recall-fallout curve. Going back to the notion of devices, we can now replace precision devices with fallout devices.

Stephen wrapped up his talk by emphasizing the user of information retrieval systems — as aspect of IR that is too often neglected outside HCIR circles. He advocated that systems provide user with evidence of recall, guidance of how far to go down ranked results, and prediction of the recall at any given stopping point.

It was an extraordinary privilege to have Stephen Robertson present at the CIKM Industry Event, and even better to have him make a full-throated argument in favor of recall. I can only hope that researchers and practitioners take him up on it.

General

Entities, Relationships, and Semantics: Strata NY Panel on the State of Structured Search

Post author By Daniel Tunkelang
Post date November 5, 2011
1 Comment on Entities, Relationships, and Semantics: Strata NY Panel on the State of Structured Search

Earlier this year, I had the privilege to moderate a panel at Strata New York 2011 on Entities, Relationships, and Semantics: the State of Structured Search. The four panelists are people I’ve had the pleasure to work with over the years: Andrew Hogue (Google), Breck Baldwin (Alias-i), Evan Sandhaus (New York Times), Wlodek Zadrozny (IBM Research). They work on some of the world’s largest structured search problems — from offering users structured search on Google’s web corpus to building a computing system that defeated Jeopardy! champions in an extreme test of natural language understanding.

O’Reilly has compiled the nearly 50 hours of video from the conference and made the collection available for purchase. I was lucky to attend all of the keynotes and many of the breakout sessions, and I highly recommend them. In the meantime, you can see a recording of the panel I moderated.

General

Interview in Forbes: What is a Data Scientist?

Post author By Daniel Tunkelang
Post date November 1, 2011

Dan Woods has been interviewing a variety of folks to answer the question: “What is a data scientist?“, and I had the honor to participate in his series.

Here is a teaser of my interview:

Above all, a data scientist needs to be able to derive robust conclusions from data. But a data scientist also needs to possess creativity and strong communication skills. Creativity drives the process of hypothesis generation, i.e., picking the right problems to solve that will create value for users and drive business decisions.

Read the rest on Forbes.com. And thanks to Drew Conway for the awesome data science Venn diagram above.

General

RecSys 2011 Tutorial: Recommendations as a Conversation with the User

Post author By Daniel Tunkelang
Post date October 31, 2011
4 Comments on RecSys 2011 Tutorial: Recommendations as a Conversation with the User

Last week, I had the privilege to present a tutorial at the 5th ACM International Conference on Recommender Systems (RecSys 2011). Given my passion for HCIR and my advocacy for transparency in recommender systems, it shouldn’t surprise regular readers that I focused on both. Unfortunately the tutorial was not recorded, but I hope the slides above prove useful. I also encourage you to take a look at the other tutorials, whose slides are posted on the conference site.

General

HCIR 2011: We Have Arrived!

Post author By Daniel Tunkelang
Post date October 21, 2011
1 Comment on HCIR 2011: We Have Arrived!

If you followed the #hcir2011 tweet stream, then you already know what I have to say: the Fifth Workshop on Human-Computer Interaction and Information Retrieval (HCIR 2011) was an extraordinary success. We had about 100 people attending, 14 paper presentations, 28 posters, and 4 challenge entries, all packed into one intense day at Google’s beautiful Mountain View headquarters.

Wednesday evening before the workshop, we were treated to a welcome reception, the first of a few meals provided by Google’s excellent chefs. It was a great opportunity to reconnect with old friends and meet many first-time HCIR attendees.

Thursday started with a scrumptious breakfast that included chilaquiles, coconut fritters, and bacon. Last year’s keynote and this year’s local host Dan Russell pulled all the stops — apparently BigTable is the only Google cafe that serves bacon for breakfast! We then proceeded to a poster boaster session in which each poster presenter had a minute to pitch his or her poster. This session set the tone for the rest of the workshop: concentrated ideas and intense audience engagement.

Then came this year’s keynote, Gary Marchionini. It was a particular treat to have Gary as a keynote, since his lecture on “Toward Human-Computer Information Retrieval” inspired me to conceive the HCIR workshop back in 2007. And Gary delivered the goods. He started with a review of the history of HCIR, including some lesser known figures like Don Hawkins (who was in the audience) , Pauline Cochrane, Richard Marcus, and Charles Meadow. He brought a few chuckles by citing Nick Belkin (who was present) and Sue Dumais (who was not) as the father and mother of HCIR. Naturally he described some of his own work at the University of North Carolina, including the Open Video, Relation Browser, and ResultsSpace projects.But the highlight of his talk was a graph he presented showing two paths to the same user end-state, one of the paths being a smooth progression and the other being a roller-coaster of ups and down. The question of which one was better drew a wide variety of responses, my favorite being Gene Golovchinsky observing that learning is the friction of the information-seeking process.

We broke for coffee and then came back to the first session of paper presentations. Sofia Athenikos presented a semantic search engine that outperformed IMDB in a user study. Chang Liu explored the effect of task difficulty and domain knowledge on dwell times, finding counterintuitive results (at least for me) regarding the correlation of expertise to dwell time. Jingjing Liu presented research on knowledge examination in multi-session tasks. Then came the lightning talks: Mark Smucker on how users examine and process ranked document lists; Jin Kim on simulating associative browsing; Bill Kules on visualizing the stages of exploratory search; and Michael Cole on user domain knowledge and eye movement patterns during search. Way too much goodness to summarize here — I suggest you read the full papers on the workshop site.

Then came lunch — again in BigTable, but this time with outdoor seating — and the poster session. As always, this it the most interactive part of the day: two hours of non-stop discussion that start over food and end with prying people away from discussions about posters. I was especially proud of LinkedIn’s contributions to the poster session, which covered faceted search log analysis, social navigation, and whether it is time to abandon abandonment.

Then back to the second session of paper presentations. Luanne Freund talked about document usefulness and genre, finding that genre, besides being hard for users to reliably identify, only matters for tasks that involve doing, deciding, learning; but not for those that involve fact finding or problem solving. Gene Golovchinsky presented work on designing for collaboration in information seeking, previewing the system he used for his challenge entry. Alyona Medelyan used the Pingar search engine to evaluate how search interface features affect performance on biosciences tasks. Then more lightning talks: Rob Capra analyzing faceted search on mobile devices; Keith Bagley on conceptual mile markers for exploratory search; Xiaojun Yuan on how cognitive styles affect user performance; and Mike Zarro on using social tags and controlled vocabularies as search filters.

Last but not least came the HCIR Challenge:

The HCIR 2011 Challenge focuses on the case where recall is everything – namely, the problem of information availability. The information availability problem arises when the seeker faces uncertainty as to whether the information of interest is available at all. Instances of this problem include some of the highest-value information tasks, such as those facing national security and legal/patent professionals, who might spend hours or days searching to determine whether the desired information exists.

The corpus we will use for the HCIR 2011 Challenge is the CiteSeer digital library of scientific literature. The CiteSeer corpus contains over 750,000 documents and provides rich meta-data about documents, authors, and citations.

There were four entries:

FreeSearch – Literature Search in a Natural Way
Claudiu S. Firan, Wolfgang Nejdl, Mihai Georgescu (University of Hanover), and Xinyun Sun (DEKE Lab MOE, Renmin)
Session-based search with Querium
Gene Golovchinsky (FX Palo Alto Lab) and Abdigani Diriye (University College London)
GisterPro
David L.Ostby and Edmond Brian (Visual Purple)
Query Analytics Workbench
Antony Scerri, Matthew Corkum, Keith Gutfreund, Ron Daniel Jr., Michael Taylor (Elsevier Labs)

The competition was fierce. Claudiu showed off the Faceted DBLP interface, which is well suited to the information availability task on CiteSeer data. Ed showed how GisterPro uses visualization to support the information seeking process. But it came down to a close call between the Query Analytics Workbench and Querium. Despite the Elsevier team’s impressive functionality and animated presentation, Gene’s simpler interface and application of ranked fusion won the day. Congratulations to Gene and Abdigani, this year’s HCIR Challenge winners!

We wrapped up the evening at the Tied House, a local microbrewery. And of course the discussion turned to where, when, and how we will hold next year’s workshop. Watch this space. In the meantime, my heartfelt thanks to everyone who made this year’s workshop such a success — and especially to our sponsors. Thank you Endeca, Kent State, Microsoft, and Google!

General

Oracle Acquires Endeca!

Today is a wonderful day for Endeca and Oracle! Oracle has announced that it has entered into an agreement to acquire Endeca, bringing together two of the powerhouses of information access. Quoting from the announcement: “The combination of Oracle and Endeca is expected to create a comprehensive technology platform to process, store, manage, search and analyze structured and unstructured information together. ”

As part of Endeca’s founding team, I am very proud to see this day. My ten years at Endeca were a formative experience that established my professional identity and inspired my passion to pursue the vision of human-computer information retrieval (by happy coincidence, the 5th annual HCIR workshop take place on Thursday). Reading Oracle’s presentation about the acquisition, I’m excited to see how Endeca’s technology will play a key role in unifying structured and unstructured data management and analysis for Oracle’s customers.

I take pride in my contributions to Endeca — I still slip sometimes and refer to Endeca as “we”. But the real heroes here are the folks — and especially the leadership — who have seen this journey through from start to finish. In particular, I am grateful to Steve Papa, Pete Bell, Adam Ferrari, Jack Walter, Keith Johnson, Nik Bates-Haus, and Jason Purcell for everything they have done to bring about this extraordinary outcome.

Finally, excited as I am about this event, it is only the beginning. I am excited to see Endeca’s people and technology powering one of the world’s largest enterprise software companies. Looking forward to the next play!

General

Keeping It Professional: Relevance, Recommendations, and Reputation at LinkedIn

Post author By Daniel Tunkelang
Post date September 30, 2011
4 Comments on Keeping It Professional: Relevance, Recommendations, and Reputation at LinkedIn

Last week, I delivered the following presentation at the CMU Intelligence Seminar:

http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=keepingitprofessional-110930233231-phpapp02&stripped_title=keeping-it-professional-relevance-recommendations-and-reputation-at-linkedin&userName=dtunkelang

I had a great audience, including the department head! Of course that meant fielding tough questions, but that’s what makes it fun to present at my alma mater. Now that it’s been over a decade since my defense, I can handle the tough questions. 🙂

Unfortunately there is no video, but hopefully the slides are reasonably self-explanatory. If you have questions, please ask them in the comments.

General

Visiting the East Coast: CMU and Strata New York

Post author By Daniel Tunkelang
Post date September 18, 2011

Tonight I’m taking a red-eye to Pittsburgh so that I can spend three days at my (doctoral) alma mater, CMU. In addition to spending time with lots of great students and faculty, my goal is to communicate a taste of the hard computer science problems we are solving (or trying to solve!) at LinkedIn. I’m giving a tech talk Tuesday afternoon, joining my colleagues for an info session Tuesday evening, and participating in the Technical Opportunities Conference (TOC) Wednesday.

Here’s a teaser for my tech talk:

You can find more details about LinkedIn’s visits to CMU and other campuses at http://studentcareers.linkedin.com/.

Hopefully some of you are attending the O’Reilly Strata Conference in New York this Thursday and Friday. If so, I encourage you to attend my panel session on “Entities, Relationships, and Semantics: the State of Structured Search“:

Structured search improves the search experience through the identification of entities and their relationships in documents and queries. This panel will explore the current state of structured and semi-structured search, as well as exploring the open problems in an area that promises to revolutionize information seeking.

The four panelists work on some of the world’s largest structured search problems, from offering users structured search on Google’s web corpus to building a computing system that defeated Jeopardy! champions in an extreme test of natural language understanding. They work on the data, tools, and research that are driving this field. They are all excellent researchers and presenters, promising to offer a informative and engaging panel discussion, for which I will act as moderator.

Panelists:

Andrew Hogue is a Senior Staff Engineer and Engineering Manager in the Search Quality group at Google New York. He has worked on a wide array of projects including question answering, Google Squared, sentiment analysis, local and product search, and Google Goggles. His is interested in the areas of structured data, information extraction, and machine learning, and their applications to search and search interfaces. Prior to Google, he earned a M.Eng. and B.S. in Computer Science from MIT.

Breck Baldwin is the President of Alias-i, creators of the popular LingPipe computational linguistics toolkit. He received his Ph.D. in computer science in 1995 from the University of Pennsylvania. In the time between his thesis on coreference resolution and evaluation and founding Alias-i in 1999, Breck worked on DARPA-funded projects through the University of Pennsylvania.

Evan Sandhaus works as the Semantic Technologist in The New York Times Research and Development Labs. He is spearheading The New York Times Linked Open Data Strategy and overseeing the release of 1.8 million documents to the computer science research community. Previously, Evan helped to put The New York Times on Google Earth, collaborated with New York University to explore new directions in News Search, and worked to bring The New York Times to Facebook.

Wlodek Zadrozny is an IBM Researcher working on natural language applications. Most recently he worked on text sources for Watson (IBM’s Jeopardy chamption) and applying related DeepQA technology to business problems. His previous work ranged from language processing research to product development and technical planning; in particular, he lead the development of interactions systems that used speech, natural language and focused search. Wlodek Zadrozny received a Ph.D. in Mathematics, from the Polish Academy of Science.

And one more thing. Karaoke at Second on Second in the East Village on Friday night. It’s an unofficial Strata after-party, so come join us Big Data folks for some Big Fun.