My friend Evan Sandhaus at the New York Times Company told me the other day that the paper of record would be releasing a large collection of their articles. Well, the New York Times Annotated Corpus is here!
For full details check out this overview document, but here are some vital stats to whet your appetite:
- Over 1.8 million articles written and published between January 1, 1987 and June 19, 2007.
- Over 650,000 article summaries written by the staff of The New York Times Index Department.
- Over 1.5 million articles manually tagged by The New York Times Index Department with a normalized indexing vocabulary of people, organizations, locations and topic descriptors.
- Over 275,000 algorithmically-tagged articles that have been hand verified by the online production staff at NYTimes.com.
LDC members can obtain the corpus for free; non-members pay $300.
This is an exciting development, and yet another encouraging sign that old media dogs can learn new tricks. Thanks to Jon and Panos for posting about it today.
2 replies on “All the News that’s Fit to Text Mine”
[…] Thanks Peter for forwarding the news. […]
[…] readers may recall hearing about the New York Times Annotated Corpus (which is the basis for the HCIR Challenge), and decision to publish their tags as Linked Open […]