Categories
Uncategorized

Blogs I Read: Peter Turney’s Apperceptual

The other day, Daniel Lemire posted a comment extolling Peter Turney as someone who does a great job blogging about his research. His blog, Apperceptual, is one of the highest-quality blogs I’ve seen in the information retrieval community.

Turney is a Research Officer at Canada’s National Research Council (NRC) Institute for Information Technology. His two decades of research cover a broad spectrum of topics in machine learning, information retrieval, and computational linguistics. Moreover, the practial orientation of the NRC helps ensure that Peter’s scholarly work is grounded in the real-world.

The best way to get a feeling for Turney’s blog is to read it. Here are a few posts I’d suggest:

This last post, published today, offers a promising approach towards establishing analogies as the central problem in a theory of semantics. Or, as Turney quotes Douglas Hofstadter, that “all meaning comes from analogies”.

Turney’s writing isn’t always so heavy. In fact, two of his most popular posts are “Open Problems” and “How to Maximize Citations“, both of which I’d recommend to aspiring researchers.

Turney doesn’t crank out blog posts daily or even weekly–he sometimes goes for over a month between posts. But what he does write is well worth reading.

Categories
Uncategorized

The Future of Measurement

Over the past few days, Kate Niederhoffer put together a collection of thoughts about the future of measurement in social media. Contributors include:

I enjoyed being part of the collective writing process, and I hope you enjoy reading the results.

Categories
Uncategorized

If you can’t stand the links, get off the web

I don’t always agree with Jeff Jarvis, but he nailed it in “A danger to journalism“, a post in which he discusses the “GateHouseGate” controversy: 

GateHouse has sued The New York Times Co., arguing that the Boston Globe’s new YourTown hyperlocal site for Newton is violating copyright laws by copying headlines and first sentences verbatim from GateHouse sites in Massachusetts and–horrors!–linking to the stories themselves on GateHouse’s pages.

As Jarvis put it, “If you can’t stand the links, Gatehouse, get off the web.” I am sympathetic to authors whose work is being unfairly used, as I discussed in my recent post on fair use and SEO. But suing people for copying two sentences and linking? I thought we were past that by now.

Categories
General

Is People-Powered Search Overrated?

I recently read an article by Matthew Shaer in the Christian Science Monitor entitled “The future of search: Do you ask Google or the gaggle?” and subtitled “To improve results, new search engines rely on users instead of computers.” The article goes on to talk about Google’s SearchWiki, Jimmy Wales’s Wikia Search, and a number of “people-powered” search tools.

I agree strongly with Wales on the value of transparency and that “because search is so secretive, and so propriety, there are fewer checks and balances”. But I agree just as strongly with Shaer that “handing over control to a community could engender a flood of spam, or devolve into a mess of internecine backbiting among users”, both of which he’s observed on the Yahoo Answers site.

Wales ultimately sees the question as not whether humans make the decisions, but rather by what process, i.e., democratically vs. top-down. His Wikia Search effort is an attempt to take repeat Wikipedia’s success for general web search.

But, while I like democracy as a political system (as Churchill said, it’s the worst form of government except all the others that have been tried), I’m not sold on Wikia Search or any of the crop of people-powered search engines.

Perhaps the problem is that, much as in electoral democracy, we need to be vigilant about attempts to game the system. The anonymity of web users is as much a problem as the secrecy of search ranking algorithms. since it allows people to game “people-powered” systems with impunity.

Would a transparent people-powered search system work? Perhaps, assuming it could address the privacy concerns of users. I’m all for transparent social navigation.

But let’s not forget the other part of people power: giving users meaningful control. Crowdsourcing might improve on the current crop of ranking algorithms, but what I really want is a search engine that provides me with transparency, control, and guidance. Let me get under the hood.

Categories
Uncategorized

The Evolution of Search Results: An SEO Perspective

Today’s post on SEOmozBlog explores how the evolution of search results is changing the landscape for search marketers, aka SEO professionals. Here’s a teaser:

I still believe we’re years (3-5) away from an SEO economy where links don’t play the primary role (and I doubt we can ever get away from keywords – that’s search at its most basic), but I do agree that we’re plodding slowly down that path.

The post also links to and excerpts a “thought paper” called “New Signals to Search Engines” by search marketing firm Acronym Media. It’s an interesting–and refreshingly intellectual–take on how the search engines of the future may move to a less link-centric approach.

Categories
Uncategorized

Guerilla Marketing Gone Wild

The Sunday before Festivus is surely a slow news day, but today’s top tech story is a doozie. Evidently College Prowler, a publishing company for guidebooks on top colleges and universities in the United States, was creating hundreds of “Class of 2013” groups on Facebook, using sock puppet accounts, for the purposes of self-promotion. Brad Ward, a recruiter for Butler University, sleuthed out this marketing strategy and posted an expose at his blog, SquaredPeg. The story has spread like wildfire, including to the Chronicle of Higher Education.

The story is still evolving, but it looks pretty bad for College Prowler. Social networks, whether offline or online, are built on trust, and, as we’ve learned recently from the Madoff scandal, networks of trust are vulnerable. Perhaps universities should have been more proactive in establish their own Class of 2013 Facebook groups, though that feels like blaming the victim.

In my view, this incident argues in favor of discouraging online anonymity, at least in contexts where we need to build trust. This is the one aspect in which Knol got it right.

Categories
General

Enterprise Search: Beset by Marketing and Hype

Given my role at Endeca, I am hardly objective about the competitive landscape of enterprise search. But, while reading an article about enterprise search in the latest issue of Information Age,  I was pleasantly surprised to find myself agreeing with Autonomy CEO Mike Lynch that the enterprise search industry is beset by “marketing and hype”, and that the technologies available are far from equal.

Not surprisingly, there are a variety of  perspectives among the major enterprise search vendors about how best to address the challenges of enterprise search:

  • Autonomy promotes “meaning-based computing”, its branding of its information extraction and text mining techniques.
  • Dave Armstrong, a head of products and marketing for Google’s Enterprise division, questions the feasibility of structuring content and emphasizes the importance of search for unstructured data.
  • Martyn Christian, IBM’s VP of enterprise content management, asserts that search should not be used to address problems better served by classification and metadata.
  • Endeca (not mentioned in the article) emphasizes an interaction-centric “guided summarization” approach that readers here will recognize as human-computer information retrieval.
  • Microsoft’s FAST is mentioned, but the only quotation cited is from a disgruntled former customer.

Note that I am trying to convert vendor slogans into vendor-independent terms that have some traction in the information retrieval research community. My hope is that, through neutral forums like the SIGIR Industry Track, we can do a better job as vendors of keeping ourselves honest, as well as engaging academic researchers to help connect their work to the real world.

Above all, let’s strive to compete on technology and ideas, rather than on obfuscation through marketing.

Categories
General

Fair Use and SEO

The Huffington Post, one of the most prominent political blogs on the web, usually courts political controversy for its unapologetically liberal perspective. But now it finds itself in a different sort of controversy over they way it aggregates content from other sites.

It started with a complaint from Whet Moser at the Chicago Reader:

The Huffington Post’s local “aggregation” wing straight stole our entire Bon Iver Critic’s Choice–they didn’t ask permission (“read the whole article”? that is the whole article, dumbass),

This isn’t an isolated incident. As Henry Blodget puts it:

The Huffington Post’s news aggregation business drives enormous traffic to the third-party sites its editors link to (including, occasionally, this one). The Huffington Post also often excerpts liberally from third-party sites’ stories and uses this content to drive significant traffic to itself.

Ryan Singel presents both sides of the story at Wired, including Huffington Post co-founder Jonah Peretti’s contention that the excerpts drive traffic to the original sites from which they were aggregated.

What fascinates me is that, while the legal and ethical arguments are about what constitutes fair use, the driving concern is search engine optimization (SEO). In many cases, The Huffington Post is excerpting stories without adding any new content, but is then drawing a significant amount of search traffic to its site that, presumably, would have otherwise gone directly to the original articles. In other words, they’re putting themselves in the middle and taking a cut through the resulting advertising revenue.

I can certainly see how this behavior drives online news providers up the wall. Even if The Huffington Post is acting within the legal constraints of fair use, its actions certainly seem parasitical. Unless they are driving traffic to the sites they aggregate that would not have otherwise gone there directly, they are simply profiting from being better at the SEO game.

I see this scenario as a cautionary tale for our excess dependence on traffic from search engines that promote an adversarial model. This is the dark side of SEO–a no-holds-barred fight for a piece of people’s scarce attention.

Categories
Uncategorized

Google Image Search Gets Style

Clip art

Line drawing

Google announced today that its image search now supports search-by-style. As someone who regularly uses Google’s image search to find fodder for my presentations, I am excited about this enhancement. Moreover, I think it’s a clever application of the various image analysis algorithms Google has been developing.

They now include a drop-down that allows you to restrict searches to images from news content, faces, clip art, line drawings, and photo content. It’s not 100% accurate, but it’s not bad.

What is unfortunate is that the interface, whether you’d like to explore images by style or by size, doesn’t give you any sort of preview of the content in each category. I at least find it annoying to have to keep clicking to explore the space. But this is at least a baby step towards supporting exploratory search, in a domain that cries out for it.

Categories
Uncategorized

How do people arrive at The Noisy Channel?

Like most bloggers, I diligently analyze my logs to see how readers are responding to my rambling. I use the Clicky, which I’ve found quite nice even if it isn’t free (but it does provide real-time updates).

Here are the stats for the past month:

  • 48%: directly or through bookmarks
  • 25%: links from other sites
  • 15%: RSS readers and social media
  • 12%: (non-paid) search results

Note that I don’t find out who is reading the blog through RSS readers; I only see log entries for people who click through the readers, e.g., to read or post comments.

The searches are certainly the most entertaining  bits in the log. Here are a few I found particularly amusing:

  • channel for inspired people
  • english sex channel
  • how to make pipe quick
  • “keep yourself on the gravy train for life”
  • psychology of noisy people

I would be curious to know more about who is reading the blog through RSS readers. Anyone here have advice on how or whether it is feasible to do so?

Note: my asserting that my eulogies for privacy nothwithstanding, I respect the anonymity of my readers and will only disclose log data in forms like the above, which do not disclose any even remotely personally identifying information.