The Noisy Channel

 

All the News that’s Fit to Text Mine

October 31st, 2008 · 2 Comments · Uncategorized

My friend Evan Sandhaus at the New York Times Company told me the other day that the paper of record would be releasing a large collection of their articles. Well, the New York Times Annotated Corpus is here!

For full details check out this overview document, but here are some vital stats to whet your appetite:

  • Over 1.8 million articles written and published between January 1, 1987 and June 19, 2007.
  • Over 650,000 article summaries written by the staff of The New York Times Index Department.
  • Over 1.5 million articles manually tagged by The New York Times Index Department with a normalized indexing vocabulary of people, organizations, locations and topic descriptors.
  • Over 275,000 algorithmically-tagged articles that have been hand verified by the online production staff at NYTimes.com.

LDC members can obtain the corpus for free; non-members pay $300.

This is an exciting development, and yet another encouraging sign that old media dogs can learn new tricks. Thanks to Jon and Panos for posting about it today.

2 responses so far ↓

Clicky Web Analytics