Monday, November 01, 2004

One more site added

Added another news source to the crawler -

Also tried to add and But had some trouble with those two news sources. They bury the http base in their webpage and all the internal links are based on those. I need to integrate grabbing the http base if one exists for a news source. When I tried to implement it, both these sites went down.

One thing that needs to be implemented is ignoring old articles. Most articles have a date embedded in it. Just need to add some logic to retrieve this and compare it to the current date.


Post a Comment

<< Home