One more site added
Added another news source to the crawler - http://www.tamilnet.com
Also tried to add http://www.dailynews.lk and http://www.sundayobserver.lk. But had some trouble with those two news sources. They bury the http base in their webpage and all the internal links are based on those. I need to integrate grabbing the http base if one exists for a news source. When I tried to implement it, both these sites went down.
One thing that needs to be implemented is ignoring old articles. Most articles have a date embedded in it. Just need to add some logic to retrieve this and compare it to the current date.
Also tried to add http://www.dailynews.lk and http://www.sundayobserver.lk. But had some trouble with those two news sources. They bury the http base in their webpage and all the internal links are based on those. I need to integrate grabbing the http base if one exists for a news source. When I tried to implement it, both these sites went down.
One thing that needs to be implemented is ignoring old articles. Most articles have a date embedded in it. Just need to add some logic to retrieve this and compare it to the current date.
0 Comments:
Post a Comment
<< Home