Author: Daniel Tunkelang

High-Class Consultant.

De-anonymizing Social Networks

Post author By Daniel Tunkelang
Post date March 27, 2009
1 Comment on De-anonymizing Social Networks

Just saw via this article on Techmeme that my friend Vitaly Shmatikov co-authored a paper on “De-anonymizing Social Networks“.

Here’s the abstract as a teaser:

Operators of online social networks are increasingly sharing potentially sensitive information about users and their relationships with advertisers, application developers, and data-mining researchers. Privacy is typically protected by anonymization, i.e., removing names, addresses, etc.

We present a framework for analyzing privacy and anonymity in social networks and develop a new re-identification algorithm targeting anonymized social-network graphs. To demonstrate its effectiveness on real-world networks, we show that a third of the users who can be verified to have accounts on both Twitter, a popular microblogging service, and Flickr, an online photo-sharing site, can be re-identified in the anonymous Twitter graph with only a 12% error rate.

Our de-anonymization algorithm is based purely on the network topology, does not require creation of a large number of dummy “sybil” nodes, is robust to noise and all existing defenses, and works even when the overlap between the target network and the adversary’s auxiliary information is small.

General

Does Metadata Matter?

I’m trilled at the discussion that my call for devil’s advocacy has incited. Keep bringing it–and let me know if you’d like to contribute a guest post.

But it’s also nice to find strong views elsewhere in the blogosphere. This morning I saw a post at The Findability Project entitled “Metadata Schmetadata, Relevance and Reality“, in which the authors argue that they don’t need metadata.

Specifically, they say:

Working on this project, we have evaluated what we need from metadata as part of enterprise search implementation. Our conclusion? We don’t need metadata.

Or better said, we don’t need to add metadata for a Google Search Appliance (GSA) to accomplish what we want to accomplish with enterprise search.

I posted the following comment, which is currently pending moderation:

An interesting article. But perhaps I missed an explanation of how you performed your evaluation. Did you assign tasks to your users and compare their effectiveness on the two systems? Did you ask them to express their subjective satisfaction with the system? Did you have some productivity measure external to the system, such as efficiency at completing projects?

It may be that a simple out-of-the box ranked search approach, with no annotation, manual or automatic, of your documents, is exactly what your organization need. But it’s very hard to generalize from your experience without understanding better what exactly you where evaluating.

I am on board with the argument that 100% manual annotation offers poor return on investment. But that’s a straw-man argument. I would think that the real question is whether you want fully automated metadata generation, a semi-automated approach, or none at all.

And, as per my comment, it seems hard to justify any design decisions about enterprise search without success metrics, even imperfect ones.

In any case, look forward to more such posts, as I strive to increase the diversity of views at The Noisy Channel, even if I have to import them!

Uncategorized

One on One with Content Management’s Movers and Shakers

Post author By Daniel Tunkelang
Post date March 26, 2009

Ron Miller has just published his first volume of interviews with content management experts in a digital book, which he’s made available for free online. Since I’m featured in this book alongside Google Search Appliance product manger Nitin Mangtani as the experts on enterprise search (see the history of our sparring match here), I thought I’d do the self-serving thing and promote the book here. Enjoy!

Uncategorized

Looking for a Devil’s Advocate

Post author By Daniel Tunkelang
Post date March 26, 2009
62 Comments on Looking for a Devil’s Advocate

I blog a lot about the virtues of exploratory search and the narrow-mindedness that Google and others exhibit in their focus on ranked search result lists, as well as about my skepticism about the value and longevity of the ad-supported model. I think it’s safe to say that I’m largely preaching to the converted–if anything the comments often amplify rather than challenge my premises, or even go further, arguing that I’m understating my arguments.

On one hand, it’s reassuring to hear validation for deeply held but hardly uncontroversial views. On the other hand, I’d love to find passionate advocates for the other sides of these issues who are interested in debating them. I’d like to believe I’m open-minded, even about the beliefs I hold most deeply, and in any case that I’ll offer an opposing argument a fair hearing and serious consideration. I trust that readers here would be just as fair-minded.

The problem is that those advocates don’t seem to show up, even in the comment threads. Perhaps that is because of my incredible powers of persuasion, but I imagine that no one wants to be a lightening rod for criticism, so those with opposing views lurk quietly or simply take their eyeballs elsewhere.

What to do? Is there any chance that someone who strongly disagrees with me on at least one the aforementioned points would be interested in writing a guest post? I’d even be willing to post it anonymously. I just want to make sure we’re not getting into the intellectual rut of a mutual admiration society.

Please contact me if you are interested; this is a serious offer, and I’m quite open to suggestions on how to make it work.

General

Taking the Google Wonder Wheel for a Spin

Post author By Daniel Tunkelang
Post date March 25, 2009
10 Comments on Taking the Google Wonder Wheel for a Spin

I tried out the Google Wonder Wheel today–it’s being rolled out as an experiment, but you can enable the cookie yourself by entering the following into the address bar after a Google query:

javascript:void(document.cookie=”PREF=ID=4a609673baf685b5:TB=2:LD=en:CR=2:TM=1227543998:LM=1233568652:DV=AA:GM=1:IG=3:S=yFGqYec2D7L0wgxW;path=/; domain=.google.com”);

As I blogged yesterday, I’m glad that Google is giving exploratory / HCIR approaches a shot. But I’m shocked at how far behind they seem to be, especially given the incredible talent of Google employees.

In fact, the Wonder Wheel reminds me of work AltaVista did over a decade ago, close to heart because of related work I did at IBM Research. I’d be more impressed if the related topics were of higher quality, but they seem to be lagging far behind the state of the art. Which makes it all the more understandable that they don’t promote such features to users.

I’d like to get more excited about such efforts, but I fear that they are being set up to fail and will be used as evidence against HCIR in general. I hope at least that more informed technologists recognize these efforts as less than representative of what can be done using richer interface metaphors, and will keep working on improving the tools to support information seeking.

General

HCIR 2009: A Pre-CFP

Post author By Daniel Tunkelang
Post date March 25, 2009

We’re still a week or two away from officially announcing the 3rd Annual Workshop on Human Computer Information Retrieval (HCIR ’09), but there are some details I wanted to share now, as well as a favor I’d like to ask of the community.

First, the details.

The workshop will be held on October 23, 2009 at the Catholic Univesity of America, in Washington, DC. We have a great keynote speaker lined up, but I’ll save that for the official announcement. As in the past couple of years, attendees will be expected to engage as active participants, and the format of the workshop will increase the emphasis on participation.

Second, the request for a favor.

Since we will be in the nation’s capital (please forgive the U.S.-centrism), we see this year’s workshop as an opportunity for the HCIR community to engage with the federal public sector. Seeing the success that Vivek Kundra, now our national CIO, has had with the Apps for Democracy program in DC should inspire anyone who believes in the value of an informed citizenry. The struggles of intelligence agencies to make sense of enormous amounts of data that is quite literally a matter of life and death call for approaches that best combine the skills of people and machines. And we all want economists and public policy experts to have the best access to any information that could help them help us.

If you are in the federal public sector or know people who are, I would greatly appreciate the opportunity to talk with you about HCIR ’09. Ideally, we are hoping for a U.S. government agency to join Catholic, Endeca, and Microsoft Research in sponsoring the event. But this appeal isn’t about money–even before the recession, we ran the workshop with fiscal restraint. Rather, we’d like to make sure that we make effective use of the workshop’s location to educate the public sector about HCIR–and to educate the HCIR community about the needs of the public sector.

If you would like to get involved, please reach out to me, either publicly here or privately by email.

Uncategorized

Are You Part Of The Noisy Community?

Post author By Daniel Tunkelang
Post date March 24, 2009

Just a reminder that readers here can, at no extra charge, ask to be part of The Noisy Community. It’s a great way to reinforce that so much the value of this blog comes from you, the readers, who collectively contribute far more content to it than I do!

General

Google Offers “More And Better Search Refinements”

Post author By Daniel Tunkelang
Post date March 24, 2009
11 Comments on Google Offers “More And Better Search Refinements”

Fresh news, hot from the Official Google Blog:

Starting today, we’re deploying a new technology that can better understand associations and concepts related to your search, and one of its first applications lets us offer you even more useful related searches (the terms found at the bottom, and sometimes at the top, of the search results page).

For example, if you search for [principles of physics], our algorithms understand that “angular momentum,” “special relativity,” “big bang” and “quantum mechanic” are related terms that could help you find what you need.

A couple of reactions. First, Google has offered related searches for a while, so I’d love to know what makes these “more and better”. I can’t tell from playing with it, and the suggestions I see aren’t as good as, say, Kosmix. Second, if they believe that this feature can improve user experience, why are they putting the results at the bottom of the page (at least on all of my queries)? Surely they know from their own logs that only a minority of users look to the end of the results list.

While I see this enhancement as a step in the right direction for Google, I wonder if they have their hearts in it. Google used to promote refinements–actually faceted search refinements–on their product search site, but pushed those to the bottom too. It seems very hard for them to get away from the primacy of those ten blue links.

I’d like to get excited about Google embracing HCIR, especially after they were so kind as to let me lecture them about it. And perhaps I’m being too harsh a critic. Their post concludes:

Even if you don’t notice all of our changes, rest assured we’re hard at work making sure you have the highest quality

It seems to me that they go out of their way to make sure that changes aren’t noticeable to users. I suppose their conservative attitude might cost them the occasional designer, but hasn’t hurt their pocketbooks.

Come on, guys, you’re the market leaders! Don’t be so timid.

Uncategorized

How Do Your Tweets Measure Up? (NSFW)

Post author By Daniel Tunkelang
Post date March 24, 2009
1 Comment on How Do Your Tweets Measure Up? (NSFW)

Well, maybe NSFW is a bit of an exaggeration, but I don’t want to get anyone in trouble with a prudish employer. For everyone else, you can find out the size of your Twitter e-Penis here. Feel free to brag about the size of your Tweet-hood in the comments.

General

Text Analytics Summit: Early-Bird Discount

Post author By Daniel Tunkelang
Post date March 23, 2009

I’m giving a presentation on “Enabling Exploration through Text Analytics” at the 5th Annual Text Analytics Summit, which will take place June 1st-2nd in Boston. Check out the agenda and line-up of speakers; you’ll find some impressive names.

June is far away, so why am I posting about this now? Well, as is usual for conferences like these, there are early-bird registration discounts. To get the cheapest option, you should register by Thursday, March 27th. Better yet, enter my name in the promotional offer box and you get another $100 off. And no, I don’t get any kickbacks. 😦

I am fully aware that conferences like these are expensive, and that budgets are tight. But if you are in a business that could benefit from better text analytics and can afford to attend (and hopefully the discounts make that more of a possibility), then I encourage you to do so. Education is the best investment you can make in order to maximize the value of your investments in information technology, whether you buy or build.

Also, to the best of my knowledge, all of the speakers are invited and their participation is not predicated on any kind of corporate sponsorship. This is a big deal when it comes to industry conferences; the last thing you want to do is pay a lot of money for glorified sales pitches. These guys are putting in the effort to deliver quality content. I hope to see some of you there– let me know if you’re attending!

Of course, if you’re Endeca customer, I urge you even more strongly to attend Discover ’09, Endeca’s annual user conference, where I’ll be talking about “Money for Nothing and Your Tags for Free“. It’s also in Boston in June: June 8th to June 10th.