My friend Evan Sandhaus at the New York Times Company told me the other day that the paper of record would be releasing a large collection of their articles. Well, the New York Times Annotated Corpus is here!
For full details check out this overview document, but here are some vital stats to whet your appetite:
- Over 1.8 million articles written and published between January 1, 1987 and June 19, 2007.
- Over 650,000 article summaries written by the staff of The New York Times Index Department.
- Over 1.5 million articles manually tagged by The New York Times Index Department with a normalized indexing vocabulary of people, organizations, locations and topic descriptors.
- Over 275,000 algorithmically-tagged articles that have been hand verified by the online production staff at NYTimes.com.
LDC members can obtain the corpus for free; non-members pay $300.