Lately I have been interested in writing a web crawler to help me find and exploit information on the web. Two ideas I had that might be feasable would be writing a program to wander the web looking for pronunciation data. Another idea is to crawl domain specific websites for keywords and conversations to help identify and look for aliases and synonyms. For example I went to a popular music site and found the following a conversational thread.
In a single thread here is what I extracted:
The track is caled trinity
music on it
Name that song!
What’s the name of the song from the second half of the trailer
remixed a track
So what is the name of the song at the second half of the trailer?
wait, where can you get “Tempest”?
I’m also interested in the music
what is the second song played
I’m looking for the song
“Born to be Dizzy” by The Starlite Desperation.
It starts with something like “All my heroes turned out to be …”
This process could be automated using NLP and used to collect vast amounts of linguistic information from thousands of websites. The best part is. Its Free!