Monday, November 01, 2004

One more site added

Added another news source to the crawler - http://www.tamilnet.com

Also tried to add http://www.dailynews.lk and http://www.sundayobserver.lk. But had some trouble with those two news sources. They bury the http base in their webpage and all the internal links are based on those. I need to integrate grabbing the http base if one exists for a news source. When I tried to implement it, both these sites went down.

One thing that needs to be implemented is ignoring old articles. Most articles have a date embedded in it. Just need to add some logic to retrieve this and compare it to the current date.

0 Comments:

Post a Comment

<< Home