Pet Projects

A few more news sources added...

Corrected the error when parsing www.hindu.com. Finished adding www.xinhuanet.com. The root page on this news source has a lot of links and the time taken to retrieve a web page is also very slow. Also added www.priu.gov.lk, which is the Official Website of the Government of Sri Lanka. Also tried to add www.peaceinsrilanka.org. Wasn't able to fully complete this as the articles from this news source have varying HTML templates.

www.hindu.com added and a few bugs fixed

Added a new news sorce today: www.hindu.com. The setup of this website is similar to www.dailynews.lk. It was working fine for a while. But the web crawler seamed to have trouble accessing the site during the final run. Will have to check up on this later.

Fixed a few major bugs while at it too. Some news sources report the date as 1st, 11th, etc. I used to always assume that the number portion of such a date would always be 2 digits. But as the beginning of December roled out, I ran into those 1 digit dates. This was a quick fix.

Related to the date was the dynamic creation of URLs containing dates such as XXX.com/2004/12/01. When creating such a string, I was forgetting to add the leading 0 when 1 digit months and dates were needed. This was also quickly fixed.

There was a bug in one of the main loops in the crawler where some of the root starting URLs were skipped due to one too many calls at times inside this loop to retrieve the next root URL. This was also quickly fixed.

Fixed titled gathering

Fixed the title gathering mechanism. Previous version used the TITLE tag in the HTML file. However, many news sources were not putting the title of the news article in the TITLE tag. Hence, developed a new mechanism which retrieves the news article title from the HTML file of the news article. This is done by looking for specific places where the news article title is embedded.

The main benefit this provies is that it prevent my need to visit each news article induvidually when determining the quality of a particular clustering of news articles. Now, I can just inspect the title of a particular news article and have a good idea on what the article is about. This prevent the need for me to visit each news article individually, which is cumbersome and time consuming.

Stable version

All the news sources that have been currently integrated now have date calculations. The webcrawler seams to be working fine. Adding DailyNews was the biggest challenge as it has several different root URLs depending on how old the news articles can be.

A few improvements can be made to prevent the fetching of invalid pages. For example, some valid news articles from some news sources always have 'news' in their URL, etc.

I tried performing clustering to see ohw it works. Found some interesting results. For example, DailyNews and SLBC seam to have very similar articles, word to word at times. The agglomerative clustering was able to do a decent job in grouping news articles on the same event together. But it did break down a lot.

One problem that came to my mind was the difference in document length. I tried capping the number of valid tokens (excluding the stopwords) that would be used from a particular article at 500, 250 and 100. I didn't have too much time to analyze these too closely. I will also have to investigate using stemming, giving the title more weight, etc.

More news sources

Added a few more news sources to the crawler:

http://www.dailynews.lk

http://www.sundayobserver.lk

http://www.lankabusinessonline.com

http://www.alertnet.org/thenews/emergency/LK_CON.htm

http://news.yahoo.com/news?tmpl=index&cid=1534

http://www.hindustantimes.com/news/7170_0,00050002.htm

More date calculations added

Added date calculations for http://www.lankatruth.com, http://www.slbc.lk and http://www.bbc.co.uk/sinhala.

Date calculations working

Got the date calculations working. Date determination works for articles from http://www.tamilnet.com and http://www.asiantribune.com. Currently working on http://www.lankatruth.com.

Started to extract date

Started to extract the date from articles. I tried http://www.tamilnet.com first and was able to get the date without a problem from the article. However, wasn't too successful in playing with the Java Calendar object to determine if the article date was older than a certain date. Need to look at a few examples and get the Calendar date calculations working.

Pet Projects

Thursday, December 02, 2004

A few more news sources added...

Wednesday, December 01, 2004

www.hindu.com added and a few bugs fixed

Monday, November 29, 2004

Fixed titled gathering

Saturday, November 27, 2004

Stable version

Monday, November 15, 2004

More news sources

More date calculations added

Thursday, November 04, 2004

Date calculations working

Wednesday, November 03, 2004

Started to extract date

About Me

Previous Posts

Archives