As part of a personal research effort I created a batch download program to collect news articles over a 5 week period. I then performed analysis of the titles of those articles. The highlights are below:
- 79 directories of raw data with 7593 files that take up 370 mb on disk.
- There are 48237 unique article titles in the data
- There are 33866 unique words in the data (minus a few data cleanup issues)
- Word frequency has very steep drop off with 24173 entries or 71% under 10 occurrences (pretty much expected)
- 1505 words, 4%, appear 100 or more times.
- ‘Obama’ appears 4621 times, 99.9th percentile, edging out ‘is’, ‘at’ and ‘news’.
- ‘iPhone’ has 2408 occurrences also in the 99.9th percentile.
- About a 100 words that start with ‘qu’
A few other interesting observations shared by a co-worker looking at the corpus:
- peace (200) is less than war(734)
- microsoft(725) is less than twitter(839) is less than facebook(948) is less than google(1023) is less than apple(1658)
- microsoft(725) is less than good(726), almost, but no cigar
- small(346) is less than big(994), or big is about 3x of small
- love(710) is less than money(711)