Thursday, January 24, 2013

Could the crawler build web graph without overhead?

Today, I completed my new submission to JCDL 2013 about building a new system to extract and build Temporal Web Graph for the Web Archives. My focus was on optimization, the different stages took a plenty of time and space. Adding to this the large-scale of the Web Archives nowadays, IA reached 240B pages with 5PB, the problem becomes significant.
We divided the process into four stages, of course one of them is the extraction of the link structure from the mementos (archived web pages). During the evaluation of the different HTML parser,  I came across the Heritrix crawler parser for extracting the URI. Heritrix crawler is an open source crawler that has been built and maintained by Internet Archive [Mohr 2004]. Heritrix is  Remote Harvesting Crawler that used to start with seed URIs and extract/discover new URIs from the hyperlinks. Then, I came to the question.

Could the crawler build web graph without overhead?
The short answer is Yes. To build the web graph with temporal dimension, we need two main features: 1) Duplicate detection, 2) Link extraction. Both of them are part of the crawler functionality.

The crawler used to visit the pages, and calculate the check-sum for each page. Based on the check-sum, it will decide, if this page should be crawled or it hasn't changed since last visit. If the crawler decides to crawl the page, it will extract all the links inside the page inorder to discover new URIs, adding this new list to some queue for crawling later.


The suggested update is to write the extracted links list to a new repository, besides the crawling queue, that will be the source of the web graph. We could use NoSql db, with the check-sum as the key and the value is the outlink for this content. The repository could be part of other pipeline processes to process the links for further usage.

I plan to give it a try and calculate the performance overhead for this step. Of course, it could help only with the new crawled materials, and we still have to solve the previous archived Web.