Categories
General

Symposium on Semantic Knowledge Discovery, Organization and Use, Day 1

Today was the first day of the two-day NSF Sponsored Symposium on Semantic Knowledge Discovery, Organization and Use at NYU.

Here are some highlights:

  • Marti Hearst started us off with a discussion of tricks for statistical semantic knowledge discovery–namely, using “lots o’ text”, unambiguous cues, and “rewrite and verify”.
     
  • Dekang Lin showed off the power of “lots o’ text” by showing how the Google n-gram data could be used to peform various semantic discovery tasks.
     
  • Peter Turney argued that we need to combine symbolic representations for episodic information (i.e., what we obtain from information extraction) with spatial (i.e., vector space model) representations for semantic information.
There were a bunch of other talks that focused on the details of building and using semantic knowledge bases, but I’ll freely admit that I’m a bit of an outsider in this world. Nonetheless, I find the participation impressive in both quality and quantity.
I’ll post more notes tomorrow. And, if you’re in New York, I encourage you to attend tomorrow. They are letting people walk in, even they haven’t registered in advance.
Categories
Uncategorized

Blogs I Read: Geeking with Greg

One of the great things about blogging as social medium is that you quickly discover the most reputable people in your field. When I started blogging and participating in other blogs that dealt with topics related to information retrieval, I quickly discovered Geeking with Greg.

Greg Linden was the primary developer and designer of the Amazon.com recommendations engine, which is one of the most widely used–and perhaps the most famous–recommendation system on the web. He tried his hand at a news and blog aggregation, Findory, that was based on collaborative filtering but ultimately gave up on that to join Microsoft Live Labs.

Greg has been blogging since 2004. His interests combine strong research interests with practical grounding. For example, he recently wrote about a paper at CIKM 2008 (by the way, he’s been great at blogging about the conference) on learning to rank using click data. Given my day job, I am sometimes involved in “bake offs” between our engine and those of competitors, and I’d be delighted to see prospects run evaluation experiments based on this paper.

Greg’s posts range from the theoretical (“Does the entropy of search logs indicate that search should be easy?”) or the practical (“Advertising, search, and drive-by malware“). He talks about great work coming out of Microsoft’s labs, but gives equal time to works from competing labs, such as Yahoo and Google.

In short, Geeking with Greg is a must-read for anyone serious about real-world information retrieval.

Categories
Uncategorized

Under Attack by Spammers

I’ve recently seen a sharp increase in spam comments which are getting past the Akismet filter. I may need to take defensive action if it keeps up, such as moderating comments or using some kind of CAPTCHA. Please let me know if you have any suggestions. I don’t like anything that impedes the flow of legitimate comments, but I imagine that spam comments are just as annoying to you as they are to me.

Categories
Uncategorized

NSF Symposium on Semantic Knowledge Discovery, Organization and Use

This Friday and Saturday, I’ll be attending the NSF Sponsored Symposium on Semantic Knowledge Discovery, Organization and Use at NYU. Evidently it’s so popular that they have a waiting list registration! If you’re attending, please come by on Saturday afternoon to see my demo, or find me any time to chat!

Categories
General

Big Google can be Benign

An article in today’s New York Times reports on Google Flu Trends, which aspires to detect regional outbreaks of the flu before they are reported by the Centers for Disease Control and Prevention. As reported in the article:

Google Flu Trends is based on the simple idea that people who are feeling sick will probably turn to the Web for information, typing things like “flu symptoms” or “muscle aches” into Google. The service tracks such queries and charts their ebb and flow, broken down by regions and states.

It’s a clever idea, though obviously it raises privacy concerns. Google mitigates those concerns by “relying only on aggregated data that cannot be used to identify individual searchers.”

It will be interesting to see popular reaction to this offering in the United States and in more privacy-conscious Western Europe. On one hand, health-related search logs are the bête noire of privacy activists–and with good reason, since people are terrified of losing their health insurance. On the other hand, Google seems to have only the best intentions here, and the service they provide may do a lot of good.

I personally hope we can see efforts like these succeed. Of course, it’s essential that Google and anyone else who pursues such efforts  be transparent about what data they collect and how they protect individuals from inadvertent disclosure. Ideally, they don’t collect more data than is needed–especially when that data is dangerous in the wrong hands.

Though I have to wonder, might anyone try to game such a system? Maybe I have an over-active imagination, but systems like these are seem to be ripe targets for denial-of-insight attacks. Whit, another one for your files?

Categories
General

Should We Build Task-Centric Search Engines?

Greg blogged today about a video of a DEMOfall08 panel on “Where the Web is Going” where Peter Norvig from Google and Prabhakar Raghavan from Yahoo both advocated that, rather than supporting only one search at a time, search engines should focus on helping people accomplished larger tasks, such as booking a vacation or finding a job. I won’t be so vain as to assume they read my blog, as these are canonical examples of tasks that current search engines don’t address very well.

I do see value in building task-specific applications that encapsulate the process of accomplishing particular classes of tasks–including any information seeking neccessary towards that end. But I’m not convinced that such applications live inside of search engines.  Rather, I think that a search engine (if that is still the right word for it) should be adaptive enough for task-centric applications to leverage it as a tool.

Perhaps it’s natural that leading researchers from Google and Yahoo have a search-centric view of the world. Given my daily work, I sometimes lapse into that view myself. But it’s important for us to realize that search–or, more broadly speaking, information seeking–is a means to an end. At least in the future I envision, information seeking support tools will be so well embedded in task-centric applications that we will almost never be conscious of information seeking as a distinct activity.

Categories
Uncategorized

Spamalytics

Nice piece in the BBC today about a study led by UCSD computer scientist Stefan Savage on how spammers cash in. You can read the full CCS ’08 paper here. It’s an illuminating study, and a nice example of overcoming the challenges of ethically investigating spamming while still obtaining real-world data.

Categories
Uncategorized

LinkedIn Pushing Ads into Inbox

I’d noticed a number of reports about LinkedIn monetizing its audience through advertising, but this morning is the first time I’ve seen an ad in my inbox:

I absolutely understand their need to generate revenue. But I’m curious how users will feel about this approach–particularly if the ads are not even targeted. I am a delighted (though non-paying) LinkedIn user, but my delight would quickly fade if my inbox became over-run by spam.

Categories
General

The Word of the Day is…Ambient

No, not Ambien or ambiance. but ambient as in ambient findability.

Two items caught my attention this weekend. The first was a post by Oscar Berg at The Content Economy about ambient awareness and findability. The second was a presentation by Marianne Sweeny, posted at Ambient Insight, about SEO for Web 2.0.

An excerpt from Oscar’s post:

I am however more fond of the term “ambient awareness” and I am especially interested in how ambient awareness relates to findability which has traditionally been focused mainly on active methods of finding information such as searching and browsing.

I dare to say that humans are lazy by nature and that we are likely to use the method that requires the least effort when we look for information. We even tend to use less reliable information if it’s just easy to find and use. Instead of actively looking for information we prefer to passively monitor the flow of information in our environment. In fact, some say that actively looking for information is a relatively new phenomenon in human history. So, just being in an environment and becoming passively aware about things that happen in it is something we find very natural and convenient.

It’s an interesting point. Most of the systems we build for finding information presume an active information-seeking motive, but perhaps such systems are not optimizing for the way people are used to obtaining information. Still, I think that, until systems can passively surmise what information people need, we are stuck with requiring at least some active expression of that intent.

That leads us to the Sweeny presentation. It traces the history of search from an SEO point of view:

  • Human-Mediated
  • Human-Mediated plus Catalogs
  • Machine-Mediated
  • Human-Directed / Machine-Mediated
  • Human-Like Machine Mediation (aspirational)

It’s a nice presentation, and I recommend you give it a look. I’m delighted to see someone in the SEO community express a version of history and vision that is largely in line with that of the information seeking support folks.

Categories
Uncategorized

Happy Birthday to the ACM Digital Library!

This month’s issue of the Communications of the ACM includes a letter from ACM CEO John White celebrating the 10th anniversary of ACM’s Digital Library. As some of you may know, my colleagues and I at Endeca have been working with the ACM to improve the search and navigation functionality that the Digital Library provides.

In particular, ACM recently deployed a terminology extraction feature that we recently presented at HCIR ’08. While it’s still a work in progress (their version isn’t quite as current as what we demonstrated at the workshop), it represents a strong step in the direction of supporting exploratory search as part of the online library experience.

Please check it out and provide them with feedback, especially regarding the user interface that they designed using their own consultants.