Thursday, December 02, 2004

A few more news sources added...

Corrected the error when parsing www.hindu.com. Finished adding www.xinhuanet.com. The root page on this news source has a lot of links and the time taken to retrieve a web page is also very slow. Also added www.priu.gov.lk, which is the Official Website of the Government of Sri Lanka. Also tried to add www.peaceinsrilanka.org. Wasn't able to fully complete this as the articles from this news source have varying HTML templates.

Wednesday, December 01, 2004

www.hindu.com added and a few bugs fixed

Added a new news sorce today: www.hindu.com. The setup of this website is similar to www.dailynews.lk. It was working fine for a while. But the web crawler seamed to have trouble accessing the site during the final run. Will have to check up on this later.

Fixed a few major bugs while at it too. Some news sources report the date as 1st, 11th, etc. I used to always assume that the number portion of such a date would always be 2 digits. But as the beginning of December roled out, I ran into those 1 digit dates. This was a quick fix.

Related to the date was the dynamic creation of URLs containing dates such as XXX.com/2004/12/01. When creating such a string, I was forgetting to add the leading 0 when 1 digit months and dates were needed. This was also quickly fixed.

There was a bug in one of the main loops in the crawler where some of the root starting URLs were skipped due to one too many calls at times inside this loop to retrieve the next root URL. This was also quickly fixed.