All the news sources that have been currently integrated now have date calculations. The webcrawler seams to be working fine. Adding DailyNews was the biggest challenge as it has several different root URLs depending on how old the news articles can be.
A few improvements can be made to prevent the fetching of invalid pages. For example, some valid news articles from some news sources always have 'news' in their URL, etc.
I tried performing clustering to see ohw it works. Found some interesting results. For example, DailyNews and SLBC seam to have very similar articles, word to word at times. The agglomerative clustering was able to do a decent job in grouping news articles on the same event together. But it did break down a lot.
One problem that came to my mind was the difference in document length. I tried capping the number of valid tokens (excluding the stopwords) that would be used from a particular article at 500, 250 and 100. I didn't have too much time to analyze these too closely. I will also have to investigate using stemming, giving the title more weight, etc.