We divided the process into four stages, of course one of them is the extraction of the link structure from the mementos (archived web pages). During the evaluation of the different HTML parser, I came across the Heritrix crawler parser for extracting the URI. Heritrix crawler is an open source crawler that has been built and maintained by Internet Archive [Mohr 2004]. Heritrix is Remote Harvesting Crawler that used to start with seed URIs and extract/discover new URIs from the hyperlinks. Then, I came to the question.
Could the crawler build web graph without overhead?
The short answer is Yes. To build the web graph with temporal dimension, we need two main features: 1) Duplicate detection, 2) Link extraction. Both of them are part of the crawler functionality.The suggested update is to write the extracted links list to a new repository, besides the crawling queue, that will be the source of the web graph. We could use NoSql db, with the check-sum as the key and the value is the outlink for this content. The repository could be part of other pipeline processes to process the links for further usage.
I plan to give it a try and calculate the performance overhead for this step. Of course, it could help only with the new crawled materials, and we still have to solve the previous archived Web.
No comments:
Post a Comment