Pet Projects: November 2004

Fixed titled gathering

Fixed the title gathering mechanism. Previous version used the TITLE tag in the HTML file. However, many news sources were not putting the title of the news article in the TITLE tag. Hence, developed a new mechanism which retrieves the news article title from the HTML file of the news article. This is done by looking for specific places where the news article title is embedded.

The main benefit this provies is that it prevent my need to visit each news article induvidually when determining the quality of a particular clustering of news articles. Now, I can just inspect the title of a particular news article and have a good idea on what the article is about. This prevent the need for me to visit each news article individually, which is cumbersome and time consuming.

Stable version

All the news sources that have been currently integrated now have date calculations. The webcrawler seams to be working fine. Adding DailyNews was the biggest challenge as it has several different root URLs depending on how old the news articles can be.

A few improvements can be made to prevent the fetching of invalid pages. For example, some valid news articles from some news sources always have 'news' in their URL, etc.

I tried performing clustering to see ohw it works. Found some interesting results. For example, DailyNews and SLBC seam to have very similar articles, word to word at times. The agglomerative clustering was able to do a decent job in grouping news articles on the same event together. But it did break down a lot.

One problem that came to my mind was the difference in document length. I tried capping the number of valid tokens (excluding the stopwords) that would be used from a particular article at 500, 250 and 100. I didn't have too much time to analyze these too closely. I will also have to investigate using stemming, giving the title more weight, etc.

More news sources

Added a few more news sources to the crawler:

http://www.dailynews.lk

http://www.sundayobserver.lk

http://www.lankabusinessonline.com

http://www.alertnet.org/thenews/emergency/LK_CON.htm

http://news.yahoo.com/news?tmpl=index&cid=1534

http://www.hindustantimes.com/news/7170_0,00050002.htm

More date calculations added

Added date calculations for http://www.lankatruth.com, http://www.slbc.lk and http://www.bbc.co.uk/sinhala.

Date calculations working

Got the date calculations working. Date determination works for articles from http://www.tamilnet.com and http://www.asiantribune.com. Currently working on http://www.lankatruth.com.

Started to extract date

Started to extract the date from articles. I tried http://www.tamilnet.com first and was able to get the date without a problem from the article. However, wasn't too successful in playing with the Java Calendar object to determine if the article date was older than a certain date. Need to look at a few examples and get the Calendar date calculations working.

Stopwords

Added a stopword removal feature to the Indexer. It was quick and I didn't get to comment the new code. One thing about the Indexer code base is that it is well commented. Need to come up with a good stopword list. Need to add stemming, namely Potor stemming, soon as well.

I indexed about 167 news articles the webcrawler had gathered from a few news sources. Surprisingly it ran pretty past. Need to still test it with a much larger input file. The speed also depends on the size of each news article as well.

One more site added

Added another news source to the crawler - http://www.tamilnet.com

Also tried to add http://www.dailynews.lk and http://www.sundayobserver.lk. But had some trouble with those two news sources. They bury the http base in their webpage and all the internal links are based on those. I need to integrate grabbing the http base if one exists for a news source. When I tried to implement it, both these sites went down.

One thing that needs to be implemented is ignoring old articles. Most articles have a date embedded in it. Just need to add some logic to retrieve this and compare it to the current date.

HTML Parser

Currently I am using HTMLParser to parse the html files. It seems to be working ok but also doesn't look like the best parser out there. Once in a while it throws an exception. Need to investigate obtaining a better HTML parser.

Subscription sites

I also ran into a few sites that require a subscription:

http://www.island.lk

http://www.dailymirror.lk

http://www.sundaytimes.lk

http://www.thesundayleader.lk/passport/login.aspx?ReturnUrl=%2findex.htm

This reduces my number of news sources. But looks like the subscriptions are free of charge and hence I might have a way of getting around this problem.

Non-English sites

One of the good things that the web has produced is the access to news back from the homeland in local languages - Sinhala and Tamil. Upon examining a few Sri Lankan news sites I ran into this problem.

The following sites have Sinhala font:

http://www.divaina.com

http://www.ravaya.lk

http://www.lankadeepa.lk

http://www.lakbima.lk

http://www.silumina.lk

http://www.navaliya.com

http://www.lakehouse.lk/budusarana

The following sites have Tamil font:

http://www.virakesari.lk/20041101/index.asp

http://www.uthayan.com

http://www.thinakural.com/2004/November/01/Index.htm

I also noted that http://www.silumina.lk/ and http://www.lakehouse.lk/budusarana had their conntent in pdf files. Maybe to prevent people like me from using their content. Those two sites are in Sinhala font right now. So it really doesn't make a different to me currently.

So looks like I will not be able to use these news sources for this project. Maybe in the future, I will be able to include news articles from news sources in other languages as well. However, this seams like a long strech.

Pet Projects

Monday, November 29, 2004

Fixed titled gathering

Saturday, November 27, 2004

Stable version

Monday, November 15, 2004

More news sources

More date calculations added

Thursday, November 04, 2004

Date calculations working

Wednesday, November 03, 2004

Started to extract date

Tuesday, November 02, 2004

Stopwords

Monday, November 01, 2004

One more site added

HTML Parser

Subscription sites

Non-English sites

About Me

Previous Posts

Archives