Sunday, October 31, 2004

Project Kickoff

Officially kicking off the project to create a site that provides a gateway to the latest news form Sri Lankan. I have been thinking of doing this for a long time. It was just a matter of getting it started. Getting the ball rolling. It is modeled after news.google.com, except it's only for Sri Lankan news. Searching for Sri Lankan on news.google.com does not seam to provide enough articles on Sri Lanka. I believe that this is because Google does not crawl most of the Sri Lankan news sites, which are small and has low traffic. My objective is to bring all these sites together.

Similar to news.google.com, I want to provide many sources for the same news event through clustering techniques. This will be a bit challenging since there aren't too many online news sources on Sri Lanka. www.infolanka.com/news and http://www.lankapage.com are sites that provide similar gateways, except they only provide a single source per news event. It also seems to be subjective.

This project is based on my CS678 final project. Which did the same thing except it covered all topics, not just Sri Lanka. It was a research project and hence only used a few popular online news sources such as www.cnn.com, www.bb.co.uk/news, etc.

Started out by modifying the web crawler from the CS678 final project to read articles about Sri Lanka from the following sites:

http://www.bbc.co.uk/sinhala
http://www.lankatruth.com/index.htm
http://www.slbc.lk
http://www.asiantribune.com


The crawler is written in Java and seams to be working fine. Trying to make the crawler gather links from only the starting search page. Attempting to implement this, but it is giving a weird error. I want to only gather links from the starting page because I only want to consider current events (less than 3 days old). I think this is a good way of doing this. Eventually I will have to add something to get the date from the news article itself.

Haven't worked on the Indexer for some time. It seams to be working fine the last time I checked it. But I am trying to get it in a stable condition where I can release the API.

This all started when I was doing my CS678 final project and couldn't find a document indexer in Java. So I ended up using the Lemur toolkit instead, which was written in C. I wanted all the components of my model to be in Java. So a few weeks back, I started to implement it on my own.

I even signed up on SourceForge.net so I could publicly distribute it. I figured there might be other people in my position. But need to get the Indexer in a good enough state to do that. Right now I am working on the process of storing the index once the documents have been parsed and then later accessing it. I am also concerned about the data structures I have used. I don't know how it scales for large document collections. That's one thing I haven't tested.

Once I am able to successfully gather news articles form a few sites, I will use the indexer to index them. I will have to make sure that the indexer is pretty stable at that point. Then I will cluster the articles according to some distance function. I will most likely use agglomerative clustering with cosine similarity as my distance function.

Assuming that these provide satisfactory results, I will add more news articles from different news sources to the web crawler. The goal is that all works out fine and that I actually put the results of the system online so people can make use of it. I am not sure how that will work out. I guess I will need a server somewhere, which runs the system periodically and updates a website.

The goal is to build the site a little bit at a time each day!!! I do want to create a site with a lot of traffic. This would mean adding more features to the site such as categorized links on Sri Lanka, chat rooms, forums, advertising, classifieds, polls, etc. If I could create a site that a lot of people used, that would be really cool.