On the Google News Blog today: you can now search by author. Actually, I think you could always search by author from their advanced search page, but now the links are available directly from the search results, facilitating exploration.
It’s a nice enhancement, though there are still some kinks to work out. For example when I did a Google News search for tim burton’s alice in wonderland, the first two hits have no author link, and the third article has an author link for Kofi Outlaw that leads to a dead end, i.e., no results found (you can find his articles here).
Interestingly, this is a case where Bing, even though it doesn’t offer author search, does do a lot better on related searches. I have to give them the win on this one, at least as far as supporting exploration goes. Unfortunately, their crawler could use a bit more muscle–they don’t even find the Kofi Outlaw article, and they return only 37 results compared to over 300 from Google. Yes, size isn’t everything, and Google’s returning more results is undermined by their not offering as much ability to sift through them. But, conversely, Bing’s exploratory search capabilities are undermined by their having far less content to explore. Is it that hard to do both?
Those of you who know Marti Hearst or follow her work may have heard that she’s been writing a book on Search User Interfaces to follow up on her chapter in Ricardo Baeza-Yates and Berthier Ribiero-Neto’s textbook on Modern Information Retrieval. Well, the wait is over: her book will be available later this week! Moreover, it will be available (and searchable!) for free online.
In the mean time, I’ve had a chance to preview the text, and I’m impressed. She introduces the book by saying:
Many books on information retrieval describe the algorithms behind search engines and information retrieval systems. By contrast, this book focuses on the human users of search systems and the tool they use to interact with them: the search user interface. Because of their global reach, search user interfaces must be understandable by and appealing to a wide variety of people of all ages, cultures and backgrounds, and for an enormous variety of information needs.
She then proceeds to elaborate on the design and evaluation of search interfaces. Not surprisingly, she reserves whole chapters for query reformulation and for integrating navigation and search–she is, after all, one of the pioneers of faceted search and one of the leading HCIR researchers. She also includes a chapter on theoretical models of the information seeking process–a nice review that includes the highlights from the decades of library and information science work on this topic.
Of course, the wide scope of the book requires some trade-offs. Each chapter is surely worthy of a book in its own right. But where breadth has to take precedence over depth, she makes up for it by citing hundreds of references so that readers can follow up to their hearts’ content. Also, the focus is academic, so most of the references are to academic rather than commercial work–though she does sneak in a reference to WebMD as an example of faceted search. That said, it’s great to see so much of the academic work on search interfaces brought together in one place. Some may find her thorough bibliography to be almost as useful as the book itself!
All in all, this is an excellent book, and I’m sure it will find its way into many course syllabi. The book is aimed primarily at academic audiences–in fact, she points out that, while there are some books for practicitioners (e.g., Peter Morville and Lou Rosenfeld’s Information Architecture for the World Wide Web), there have been no academic books that focus primarily on search user interfaces (the closest, in her view, have been books about theoretical models of information seeking behavior). Hopefully this new book will incite more academic interest in this area. For those of us who would like to advance beyond the status quo of search interfaces, this book is a welcome contribution.
Yes, folks, it’s really, really, real-time! Of course Twitter and Facebook have their own real-time search offerings. And apparently Google, Yahoo, and Microsoft are looking hard at real-time too.
I concede that there’s something in this real-time mania. I’ve live-tweeted events, and I’ve followed others who were doing so. I certainly read current news and blogs–as they say, today’s newspaper wraps tomorrow’s fish (someone will have to translate the expression for folks who’ve never read an analog newspaper). But yes, recency / freshness is a certainly a concern in information seeking.
But it’s not the only one, and I doubt it’s the dominant one. Moreover, the dismissal of web search engines as if their index contents are ancient history is preposterous. Search for iran election on Google, Yahoo, or Bing, and you see a lot of current news. I suppose Twitter offers more recently generated bits, but the main virtue there is not the immediacy–rather, it’s the social nature of the content. For example, a number of people are following @persiankiwi for a personal perspective. I’ll let you decide for yourselves if Collecta or Crowdeye offer something new or valuable–I’m still waiting for the former to show me anything at all!
I know that the technology press likes new buzzwords, and “real-time” search is surely the buzzword du jour, even giving “semantic” search a run for its money. And I understand how many in the blogosphere feel it is their moral duty to cheer on any start-up that makes a go at disrupting the current regime. But I wish these folks would evaluate the new entrants on their merits, rather than simply on the drama of the David vs. Goliath story.
I understand what it’s like on the startup side–it wasn’t that long ago that few people outside the Boston-area technology scene had heard of Endeca. For a long time, I was jealous of people whose companies had generated more buzz. But, in retrospect, I’m at least glad that my colleagues and I had a chance to build a robust product before the press noticed us. Overenthusiastic press isn’t necessarily a good thing, as I’m sure a line-up of prematurely crowned Google killers can attest.
In that spirit, I hope that CrowdEye and Collecta bring something interesting to the market. But I doubt that “real-time” search will cut it, especially if it’s not ready for prime time.
I’ve noted in the past that “real-time” alerting systems, in contrast to search engines that place less emphasis on immediacy, are particularly vulnerable to spamming. It’s a lot like telemarketing–you could avoid it entirely if you routed any questionable calls to voicemail, but then you would, at the very least, not be able to be reached in real time.
At first glance, Twitter seems immune from this sort of spamming, since you only see tweets from the users you follow. Yes, Barack Obama and Guy Kawasaki must spend a lot of time on Twitter! But, regardless of how many users you follow, you are the one in control.
At least that’s the theory. Of course, things tend to work a bit differently in practice. Like many Twitter users, I use Twitter Search to maintain a running vanity query for mentions of my user name, employer, blog, etc. As a result, a user I don’t follow can nonetheless get my attention by tweeting an “at reply” to me. Twitter has struggled to figure out whether that is a good thing or a bad thing, but I suspect that my erring on the side of vanity is a common behavior.
But I do recognize that I’m opening myself up to alert spamming–perhaps not just in theory, but in practice. Today I read on All Things Digital that:
Pontiflex, a lead generation startup that hoovers up names and other other info from users that visit its network of publishers, then sells the data to marketers. The Brooklyn-based company is rolling out a Twitter product that lets marketers compile a list of interested Twitter users.
…
Since the users aren’t actually signing up to “follow” any of the marketers, said marketers can’t send them direct messages. The marketers could try to “at reply” their leads — the equivalent of shouting out the name of someone you think might be at a loud cocktail party, but who you can’t actually see. But that’s about it.
That’s about enough, if enough users are like me. Fortunately, I’m not enough of a celebrity to be particularly concerned about being singled out–at this stage. But I think the writing is on the wall, and spammers will innovate to embrace social media. I’ve already experienced afewexamples of such innovation, and I’m sure that they are child’s play compared to what’s in store.
Personally, I look forward to this spamageddon. Why? Because I think we already have a problem managing attention scarcity in social media, but haven’t found sufficient motivation to confront the problem head on. A spam epidemic will certainly cause us to revisit our priorities, and I’m optimistic that we’ll innovate beyond the existing approaches used for email spam.
Now this is the sort of publicity that even $100M can’t buy: the New York Post is reporting that, in response to Microsoft’s recent Bing launch, “FEAR GRIPS GOOGLE” (all caps in the original):
Sergey Brin is so rattled by the launch of Microsoft’s rival search engine that he has assembled a team of top engineers to work on urgent upgrades to his Web service.
I never imagined that anyone would get their technology news from the New York Post, but evidently it’s well read in the blogosphere. Techmeme reports the following articles as citing the New York Post article:
I know that the press loves a good fight, and in technology it’s hard to ask for a better pairing than Google and Microsoft. Moreover, I do think that Google should be paying attention to Microsoft’s positioning of Bing, regardless of how well Microsoft has delivered on that positioning. In any case, it makes sense for Google to keep close tabs on its competitors. After all, even a fraction of a percent of web search market share translates into millions–more than enough revenue to justify a few full-time employees.
Still, to assert that Google is gripped with fear stretches credibility, even for a tabloid. I don’t mean to suggest that Google is so self-confident as to be fearless. Google may well have reacted with fear when it looked like Microsoft would acquire Yahoo–in fact, some have suggested that Google’s proposed (but ultimately abandoned) advertising deal with Yahoo was a Machiavellian maneuver to scuttle the acquisition.
But, unless I’m missing something, Bing simply isn’t a threat to Google’s market dominance. If anyone should be concerned, it’s folks like Kayak who might lose some market share to Bing’s travel search–which seems to be generally acknowledged as Bing’s strongest vertical.
Personally, after being underwhelmed by Bing, I decided to try it for 2 weeks. I made it for about a week and a half, and you can see some of my commentary on Twitter. I stand by initial impression: it’s not bad, but it’s noticeably inferior to Google, and even parity is not enough to reverse the tide. Perhaps the tiny gain–or the slowdown in loss–that they will make in market share will justify their investment. But this is no revolution, and the Gevil Empire is not running scared.
The case for federated search is straightforward: no single organization has all of the answers, and therefore no single index can ever hope to complete satisfy its users’ needs. Federation allows the developer of a search application to hedge his or her bets by bringing in knowledge from outside resources.
But federation is no panacea, at least as it is implemented today.
Yesterday, I was fortunate to attend a presentation from a Google Engineering Director about Google Wave, an online communication and collaboration tool that Google recently unveiled at the Google I/O developer conference. For those who, like me, were unable to attend I/O, Google has posted the entire 80-minute presentation on YouTube (embedded above). For those of you without 80 minutes to spare, Gina Trapani has assembled a highlight reel.
The pitch is that email, the most popular technology for online communication, is a 40 years old and needs an overhaul to reflect the opportunities of an always-on world. They also emphasize that everything they’ve done works inside the browser.
The video is sexy, showing off both the real-time updating capabilities of Wave (blurring the lines between email and instant messaging) and the ability to support structure more cleanly than email (e.g., responding to only part of an email). The conversation model is also nice: for example, participants can bring someone new into a conversation, and that new person can access the evolution of a conversation (a sort of retroactive cc). Indeed, Wave looks more like Basecamp than like email.
Google is pitching Wave to developers–they even stole a page from Oprah and gave every Google I/O attendee a new Android phone in order to develop applications using their early-access Wave accounts. I haven’t studied the APIs, but the object model seems reasonable, ranging from a “blip” (a low-level event associated with content, possibly as fine-grained as someone typing a single character) to “wavelets” (the sub-conversations that comprise a wave) to of course the wave itself. And, given that the team is led by the folks who developed Google Maps, I have no doubt that they understand how to play well with mash-ups.
But I’m left with two big questions.
The first is what it would feel like to access this rich structured history of conversation. The search interface feels a lot like Gmail’s–and I don’t mean that as a compliment. I use Gmail, but I curse every time I have to deal with managing search results that include large conversational threads. I think there will be a lot of challenges for managing search results, and I’m curious how Google, with its historically spartan approach to search interfaces, will address them.
The second is about interoperability. For all of the openness, I get the sense that everything can be brought into Wave and Waves can be embedded anywhere. That feels about as open as Facebook. What I’m missing is a sense of how (or even if!) Google Wave will interoperate with other communication platforms. They do show an example of building a Twitter client within Wave–perhaps that is representative of their interoperability strategy.
The Google Wave demo is impressive, and I have no doubt that developers will play with it and build cool demos of their own. But I believe the ultimate success of Google Wave will depend on how they address the above two questions. Time will tell.
I hope that regular readers forgive the recent sparsity of posts. I spent most of the last three days attending Discover, Endeca’s annual user conference. It might come as a shock to some (especially the PR folks who keep sending me press releases), but I’m not a professional blogger, and I actually hold down a day job at Endeca as Chief Scientist!
As promised, here are some of the highlights of that user conference, and my thoughts about what makes an event like this successful.
Outside a handful of general sessions, the conference agenda consisted of three parallel tracks: business, technical, and labs.
Most of the business sessions centered around case studies presented by customers (particularly those recognized as Navigator Award winners). I attended as many of these as I could, learning about what Scripps Networks is doing with Endeca’s Page Builder (check out the Food Network site!) and how Expedia and CHIP Online (a massively popular German site similar to CNET) are implementing SEO. The CHIP guys went as far as to perform live queries on Google.de to show off their SEO success–quite a tour de force! Of course, I also made a point of attending the Newssift (Financial Times) and ESPN sessions, since I’m especially proud of what they’ve done with text analytics.
I didn’t attend quite as many of the technical sessions, though I did manage to make it to the one I was presenting: “money for nothing and your tags for free”. I was lucky to have an 80s-friendly crowd, though someone in the audience did confuse Dee Snider with Roger Daltrey. But my favorite technical session was the one about the extensible Endeca: it featured some of our coolest lab-ware, some of it developed by my team. One of the demos even had a Google Squared sort of feel: it uses WordNet to support dynamic facet creation in response to queries. And it was built by one person on my team in 24 hours for a Guardian Hack Day!
Finally, I didn’t attend the labs, but I heard great feedback about them, especially from customers who had been nervous about using new product features. There’s nothing like hands-on training to build comfort with new technology.
In short, I had a blast, and I heard lots of positive feedback from attendees. I was impressed with the level of the presentations–especially since I’d seen a few fluffy ones in past years’ conferences. This year, I suspect some people will complain that the presentations had too much content! And, while Boston isn’t as nice a setting as Orlando (nor bowling quite as fun as the rides at Universal Studios), I suspect attendees appreciated that the focus was squarely on the substance of the event. I certainly did.
Marshall Kirkpatrick at ReadWriteWeb wrote a post arguing that the people working at Twitter aren’t using the service the way its power users do, and that this bodes ill for Twitter. His main arguments:
Twitter’s employees don’t twitter very much: an average of 2 to 3 tweets per person per day.
Twitter employees don’t follow very many other people: only 2 out of 49 Twitter team members follow more than 500 people and no one was over 1k.
Twitter staff members aren’t following top Twitter developers in the community.
I can’t really address the third point, but the first two–and especially the second–are hardly helpful to Kirkpatrick’s case. To the contrary, they argue that the people who work at Twitter get it. And, to make sure Kirkpatrick got it, Twitter CEO Ev Williams even wrote him a letter, in which he said:
Many people fall into the trap that you should follow all or most people back out of a sense of politeness or so-called engagement with the community… At a certain point, you’re not actually reading any more tweets by following more people — you’re just dipping into the stream somewhat randomly and missing a whole lot of what people say. That’s fine, but I believe people will generally get more value out of Twitter by dropping the symmetrical relationship expectation and simply curating their following list based on the information and people they want to tune in to.
Amen! I’ve been hammering this point here in most of my posts about Twitter, but here is a handful of examples for newer readers:
And of course the whole point of TunkRank is to discourage the vicious circle of reciprocity and fake following. That’s baked into the the measure which, like PageRank, divides the voting power by the number of out-links.
The comments on Kirkpatrick’s post suggest that a lot of regular Twitter users also get it. I find that reassuring, especially given the hype around Twitter in the last several weeks. Twitter can be a useful tool but it will help if people don’t devalue it by imposing cultural norms that devalue the social network. I’m glad the folks who have given us Twitter realize that.
As promised, here are my slides from the recently held Text Analytics Summit. Feel free to download them from SlideShare–some of the animation may not come through in this version (though I try to use such animation sparingly).
I enjoyed the conference, and was pleasantly surprised by the overall intellectual level of participants, who included a number of end-users of text analytics, as well as senior technologists from the leading text analytics vendors. Yes, there were sales and marketing people there too, and the occasional vendor fluff piece, but someone’s got to pay the bills. I believe that all of the presentations will be posted online in the next few weeks.
And, speaking of vendors, I hope to see some of you next week at Endeca Discover. I’ll be delivering an 80s-themed presentation entitles “money for nothing and your tags for free”.
Also, while I have your attention, I urge everyone to spread the word about SIGIR. If you are in industry and have little patience for academic conference, I still encourage you to consider the one-day SIGIR Industry Track. For $300 (compare that to any other industry conference!), you get a chance to hear and meet a star-studded line-up, including:
Matt Cutts, Google: Web Spam and Adversarial IR: The Road Ahead
danah boyd, Microsoft Research: The Searchable Nature of Acts in Networked Publics
Vanja Josifovski, Yahoo! Research: Ad Retrieval – A new Frontier of Information Retrieval
Thomas (Tom) Tague, Thomson Reuters: Semantic Web and the Linked Data Economy
Tip House, OCLC: Alexandria 2.0: Search Innovations Keep Libraries Relevant in an Online World