News Title Data Corpus Stats

As part of a personal research effort I created a batch download program to collect news articles over a 5 week period. I then performed analysis of the titles of those articles. The highlights are below:

  • 79 directories of raw data with 7593 files that take up 370 mb on disk.
  • There are 48237 unique article titles in the data
  • There are 33866 unique words in the data (minus a few data cleanup issues)
  • Word frequency has very steep drop off with 24173 entries or 71% under 10 occurrences (pretty much expected)
  • 1505 words, 4%, appear 100 or more times.
  • ‘Obama’ appears 4621 times, 99.9th percentile, edging out ‘is’, ‘at’ and ‘news’.
  • ‘iPhone’ has 2408 occurrences also in the 99.9th percentile.
  • About a 100 words that start with ‘qu’

A few other interesting observations shared by a co-worker looking at the corpus:

  • peace (200) is less than war(734)
  • microsoft(725) is less than twitter(839) is less than facebook(948) is less than google(1023) is less than apple(1658)
  • microsoft(725) is less than good(726), almost, but no cigar
  • small(346) is less than big(994), or big is about 3x of small
  • love(710) is less than money(711)