Wednesday, August 27, 2014

Killing Hang up Threads in Heritrix

Since I've started working with Heritrix, I used to face the problem of  hanging up threads. Sometime the job is running but it doesn't do anything. The main solution that comes to my mind is how to stop/kill these threads without affecting the current process.

In this blog, I will list some steps that I used to enable jdb on Heritrix and using it to kill specific thread.

  1. You need to modify bin/heritrix script to enable the jdb attach.
  2. JAVA_OPTS="${JAVA_OPTS} -Xmx256m -Xrunjdwp:transport=dt_socket,address=50100,server=y,suspend=n"
  3. Run Heritrix
  4. bin/heritrix -a username:password
  5. Run jdb
  6. jdp -attach 50100
  7. List the available thread groups by using threadgroups, list threads for specific group by threads , or list all the available threads by using thread commands.
  8. threads
  9. Select the thread that you want to kill. This is the tricky part, you need to look to heritrix web interface and the thread names to determine the thread that should be killed.
  10. Example: (org.archive.crawler.framework.ToeThread)0xe4b ToeThread #24: http://humsci.stanford.edu/robots.txt running
  11. Kill the thread from jdb by doing the following steps, based on this blog post.
  12. Four steps
    • thread <thread_id>
    • suspend <thread_id>
    • step
    • kill <thread_id> new java.lang.Exception()
    > thread 0xe4b ToeThread #24: [1] suspend 0xe4b ToeThread #24: [1] step > Step completed: "thread=ToeThread #24: ", org.archive.crawler.frontier.WorkQueueFrontier.findEligibleURI(), line=729 bci=492 ToeThread #24: [1] kill 0xe4b new java.lang.Exception() killing thread: ToeThread #24: ToeThread #24: [1] instance of org.archive.crawler.framework.ToeThread(name='ToeThread #24: ', id=3659) killed

  13. If you go to the job web interface, you will find the following line in the crawl log.

  14. 2014-08-27T01:31:42.982Z SEVERE Fatal exception in ToeThread #24: (in thread 'ToeThread #24: ')

Thursday, June 5, 2014

IIPC GA 2014 - Open Day report

In the beautiful Paris, Bibliothèque nationale de France hosted the 2014 general assembly for the International Internet Preservation Consortium. On the first day, May 19th 2014, it was the open day which incorporates IIPC members and general audience, mostly researchers. The theme of this day was "Building Modern Research Corpora: the Evolution of Web Archiving and Analytics".
Personally, it was my first GA as web archiving engineer not as a researcher or PhD student. It gave me the opportunity to understand the needs of the researchers and the limitations from the archivists. I can conclude some remarks about the day as the following. First, Wayback machine is not enough for researchers, they need richer interface in the size of the results and in the ability to filter e.g., mimetype. Second, a complete dataset is required by the researchers, storage like AWS as common crawl or Web Observatory may be good locations for sharing the data. Third, Computer scientists and technologists can't work alone in this area, they need feedback from librarians, historians, and digital humanity researchers to drive the technology development. Now, let's work with the details of the day.

The keynote speaker for the morning session was Prof. Dame Wendy Hall, from University of Southampton. Her presentation entitled "The role of the Web Observatory in web archiving and analytics". In the beginning, Wendy explained that Web Science focuses on the web in general including different disciplines in addition to the technologists. Wendy described the Web Observatory initiative that aims to global partnership in sharing datasets, observations, and tools between the researchers. But sharing the datasets has some challenges, e.g., it may be open/closed and public/private. The technical challenges include how to store, share, and access this data. Describing the dataset may be solved using Schema.org. More details about the initiative could be found in The Southampton University Web Observatory paper, appeared in ACM Web Science 2013.

The next session entitled "Web archives needs and projects across time and social sciences: topical studies". It has two presentations. First one was "The future of the digital heritage of WWI" presented by Valérie Beaudouin from Telecom ParisTech and Philippe Chevallier from BnF. They determined three problems to save the digital objects related to WWI: epistemological, how to constitutes a corpus of websites; technical, how to collect these websites; and juridical, the legal framework to collect these sites. Web archiving was the solution, it helped them to select, build, and explore the corpus of WWI websites. Then, they demonstrated this solution by two cases "FORUM, A space of WWI discussion" that is an open discussion about the war in addition to hosting for the shared digital objects about the war and "Albus Valois albums" that is a set of 539 albums about the the front line of the WWI, putting them online will open new methods to share the data for a larger audience. The second presentation was "Exploring French language song in the web archive" presented by Dominic Forest from University of Montreal. Forest discussed "Project IMAGE" that indexed the music, lyrics, and other metadata on a large scale. Lyrics are used as a description for the songs and the research question here is how to benefit from the music archives to mine the songs. Songs metadata included artists, year, and recording label may be useful to describe the songs as it enriches the mining and machine learning tools.

After that we had a panel  about "Scholarly use and issues of web archives". The panelists shared their ideas about the topic. Valérie Schafer from CNRS presented "Internet historians and the “risks” of web archives". She mentioned that most users access web archives using Wayback machine but this model was good to 1996 but it should be extended to be beyond one document. The researchers can build collection by searching  the archives. The archives should provide services to filter the results based on non-textual elements such as geolocation, data ranges, call dates, and link graph.
Jean-Marc Francony from University of Grenoble presented Archive 1.0 to Archive 2.0. He described Archive 1.0 that started with the election on the web because the candidates showed more interest in the web during their campaigns. Archive 1.0 depends on Archive-it service to collect the data. Now, they are working on Archive 2.0 that will depend on bigdata techniques. Anthony Cocciolo from Pratt Institute presented "Web archives of deleted youth".  He is watching  the companies that are going to be shutdown (death watch) and contacting it to the internet archive to collect them. It affects a large portion of young people who got depressed for losing their data. Anat Ben-David from University of Amsterdam presented "Web Archive search as research". She spoke about one of the tools in WebArt project called WebARTIST that is a web archive search engine that is capable of searching, filtering, accessing, and exporting the data. In this research, they found that 10% of the pages in the collection are not part of the selection policy and archived unintentionally.

After lunch, we had one more keynote, by Wolfgang Nejdl from University of Hannover, entitled Temporal retrieval, exploration and analytics in web archives. Nejdl discussed ALEXANDRIA project that aims to develop models, tools and techniques necessary to archive, index, retrieve and explore the web. ALEXANDRIA project has nine research questions such as: Q1, How to link web archive content against multiple entity and event collections evolving over time?; Q4, How to aggregate social media streams for archiving?; and Q6, How to improve result ranking and clustering for time-sensitive and entity-based queries?

The next session entitled "large scale solutions and initiatives". Lisa Green, Jordan Mendelson, and Kurt Bollacker from Common Crawl presented Common Crawl: enabling machine-scale analysis of web data . Common Crawl is a non-profit organization that are doing periodical crawls of the web and make it available to the researchers via Amazon AWS. They started with the fact that web archiving analytics move from human scale of accessing one page using Wayback machine to the machine scale of accessing millions pages using clouds. Then, they gave some examples on succesful projects built on the top of Common Crawl corpus: Web Data Commons, Extracting Structured Data from the Common Craw, WikiEntities, In What Context Is a Term Referenced? , Measuring the impact of Google Analytics, Data Publica: Finding French Open Data , and Improved Spell Checking. Robert Meusel from University of Mannheim gave more details about Web Data Commons project in his presentation - Mining a large web corpus. Web Data Commons depended on Common Crawl corpus on AWS where they adapted the extraction framework by modifying only one class. Meusel gave examples about the extracted datasets: Hyperlink Graph of 3.5 billion pages connected by over 128 billion links; websites that contain structured data are 585 million of the 2.2 billion pages contain Microformat, Microdata or RDFa data (26.3%); HTML Tables that computed as 14B raw tables, 154M are “good” relations (1.1%).
The next presenter was Masayuki Asahara from National Institute for Japanese Language and Linguistics with presentation Web-based ultra large scale corpus where they are compiling a ten billion-word corpus of web texts for linguistic research. The project depends on four basic technologies: Page collection is done by applying crawling techniques using remote harvesting with Heritrix; Linguistic annotation is done by applying normalization, Japanese morphological analysis, character normalization, and word segmentation; releasing the corpus by making it publicly available; and the preserving the data in WARC formats.
The last presentation in the session was "From web archiving services to web scale data processing" by Chloé Martin from Internet Memory Research. The goal is to make the web data available for new business models by developing new technologies using large scale crawler with high performance. The system depends on large scale crawler working on scalable platform (Big data components) and a set of analytics tool. Martin mentioned some interesting projects: ArchiveThe.Net is a tool to collect and preserve web material for historical, cultural or heritage purpose, mignify is web data processing platform supported with crawler on demand service, and newstretto is an app to search and select your favorite news based on keyword.

The next session was "Building/integrating collections and tools". It started with a presentation from Helen Hockx-Yu from British Library about Building a national collection of the historical UK web. Helen divided the interaction with web archives into three categories: Archive-driven, initiated by archival institutions; scholar-driven, initiated by scholars with specific research interests; and project-based, which varies on scope and scale. The scholary interaction started with building collections then formulating the research questions around this data. Helen defined the "go-to" state of the scholary interaction by having independent use of web archives focused on the scholary requirements and it needs a clear interfaces with open access. Generally, web archives should not be the bottleneck for scholar access due to the limited resource. Helen presented UK websites datasets supported by JISC, it's divided into two sets the 1990-2010 set with 30 TB and 2010-2013 set with 27.5 TB. Some completed work on these datasets are Visualising links (to and from bl.uk), Mapping the UK Webspace: Fifteen Years of British Universities on the Web and The Three Truths of Margaret Thatcher: Creating and Analysing.
The next presentation was Data analysis and resource discovery from Tom Storrar and Claire Newing from UK Government Web Archive (UKGWA). At UKGWA, they performed user surveys to determine their users and how they use the web archive. The survey showed that UKGWA was visited by various type of users where they found a limitation with search function. UKGWA has a search functionality that may suffer from the duplicate in the results and reliant on keyword matching. So UKGWA developed semantic search service where they can model the facts and define linkage between the entities through time. The semantic search service is supported with machine readable interface APIs. Now the Semantic Knowledge Base is a dataset by itself.
After that, Martin Klein from Los Alamos National Laboratory presented "HIBERLINK: quantifying and addressing link rot in scholarly communications". Klein showed the Hiberlink that aims to protect the scholary documents from the reference rot, which is divided into LinkRot (unfound page or 404 page) and Content drift (the content changed over time). Hiberlink can solve the reference rot by augmenting HTTP links with a temporal context to indicate the required access time with a link to the memento that refers to this time.
The last presentation in this session was "Proprioception : a tool kit for web archive datamining" by David Rapin from INA and Sophie Gebeil from Maison méditerranéenne des sciences de l’homme. They are studying the history of memories using web archives and moving from the traditional methods of libraries. In order to do that they need few filters for the data, fast QA feedback loop, and readable results. They used PigLatin for data analysis but it was not enough. Their system depends on 5 numerical indicators (e.g., unique URL count and average size), 7 filters types (e.g., month and Top Level Domain), precomputation of data using Map-Reduce and Impala, and easy to understand interface for the users.

The closing remark for the day was presented by Niels Brügger from University of Aarhus. Niels defined the phases of research in web archiving into 4 stages. First, corpus creation that includes searching for the seed URIs, removing the duplicate, selecting the best candidate, and isolating it from the web archive for further processing. Second stage is the analysis that includes the need for analytical tool as long as visualization technique to help understanding the data. Third stage is dissemination where the legal issues challenge the research ability to publish his results. Finally the storage is a challenge to ensure the long term preservation for the data and to provide the suitable access methods to users. Niels presented two on-gonig projects: "Probing a nation’s web sphere" that aims to study the historical development of a nation Danish web and RESAW that is a research infrastructure for the study of archived web materials.

 The complete list of presentations for the open day, members workshops, and the open workshops is available under IIPC GA 2014 presentations.

Tuesday, October 8, 2013

Power Regression in R

The power equation has the general form of y = bX^m. I found a good example to calculate the power regression using Excel but I couldn't find a direct method to calculate the power regression in R. In this blog, I will try to explain the same technique using R.
XY
0.021505
0.062182
0.20255.3
0.52322.2
1.00811.3
3.3204.17
7.2901.75

The basic idea is to get the linear regression between log(X) and log(Y), then map the coefficient to the power log m and b. The code in R could be as the following:
x=c(0.021,0.062,0.202,0.523,1.008,3.320,7.290) y=c(505,182,55.3,22.2,11.3,4.17,1.75) fit=lm(log10(x)~log10(y)) fit Call: lm(formula = log10(x) ~ log10(y)) Coefficients: (Intercept) log10(y) 1.124 -1.037 plot(x,y,type="p",col="black") lines(x, 10^ fit$coefficients[1] * x^fit$coefficients[2],type="l",col="red")

Sunday, September 29, 2013

JCDL 2013 Minute Madeness

On July 24, 2013, I attended the poster session in JCDL 2013 conference at Indianapolis, IN. In this session, I had a poster entitled "ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph" (The poster is also available as tech report on arXiv:1305.5959). ArcLink is an extension for the Wayback Machine that can extract, preserve, and deliver the web graph for the web archived data. ArcLink extracts efficiently the web graph (also known as link structure) from the archive data that may be on WARC or WAT format using Hadoop and PigLatin. The preservation depends on Cassandra db. The delivery phase uses JAVA web services to deliver the web graph in RDF/XML format. ArcLink code is available on googlecode.

JCDL has its own convention for Minute Madness, every poster's author has 60 sec to give the audience an idea about his poster in order to attract the people to come and see his poster. These 60 sec are open for madness. My minute madness idea came from the relation between WayBack Machine and ArcLink. Wayback Machine is the original tool and ArcLink is just an extension. As my son, Yousof, was with me in the conference, we prepared two t-shirts: one printed with Wayback Machine logo and the other with ArcLink logo. I shared the t-shirts with Yousof and we were ready for the madness.



I had the chance to present this work at IIPC GA 2013 in Ljubljana, Slovenia. The presentation video is also available on YouTube.

Sunday, March 3, 2013

Sample from DMOZ by Language

Open Directory (DMOZ) is an open source URIs directory. Sampling from DMOZ is a common sampling technique for web research. In this post, I will describe how to extract sample set per language instead of randomly sample. For example, I need a sample set of URIs in Arabic.
  1. You need to export the open directory in XML/RDF format from Open Directory RDF Dump (you should select contnent.rdf.u8.gz).
  2. Unzip the file, gzip -d contnent.rdf.u8.gz
  3. You may need to avoid some bad encoding,  sed -e 's/&#/#/g' content.rdf.u8 > content.rdf.u8.clean
  4. Using the attached code, you can extract all the langauges or your favorite one.
  • Usage: java DMOZSAXParser <dmoz in RDF> [Langauge]
  • Export Arabic URIs: java DMOZSAXParser content.rdf.u8.clean Arabic
  • Export All Languages: java DMOZSAXParser content.rdf.u8.clean 
Tips: to get the exact language name, copy the name as appeared in DMOZ World.

Then, we can use Language Detection Library for Java from Cybozu Labs to determine the accuracy of the extracted URI list.


import java.io.IOException; import javax.xml.parsers.*; import org.xml.sax.*; import org.xml.sax.helpers.DefaultHandler; public class DMOZSAXParser extends DefaultHandler{ private String curLanguage=null; private String templink=null; private String requiredLang = null; private void parseDocument(String fileName, String requiredLang) { this.requiredLang = requiredLang; SAXParserFactory spf = SAXParserFactory.newInstance(); try { SAXParser sp = spf.newSAXParser(); sp.parse(fileName, this); }catch(SAXException se) { se.printStackTrace(); }catch(ParserConfigurationException pce) { pce.printStackTrace(); }catch (IOException ie) { ie.printStackTrace(); } } public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException { if(qName.equalsIgnoreCase("Topic")) { String category = attributes.getValue(0); if(category != null && category.startsWith("Top/World") && category.indexOf("/", 10)>10 ){ String curVal = category.substring(10, category.indexOf("/", 10)); if(requiredLang==null || requiredLang.equalsIgnoreCase(curVal)){ curLanguage = curVal; } } }else if(qName.equalsIgnoreCase("link")){ templink=attributes.getValue(0); } } public void endElement(String uri, String localName, String qName) throws SAXException { if(curLanguage!= null){ if(qName.equalsIgnoreCase("Topic") ) { curLanguage =null; }else if (qName.equalsIgnoreCase("link")) { System.out.println(curLanguage+"\t"+templink); } } } public static void main(String[] args){ DMOZSAXParser spe = new DMOZSAXParser(); String lang=null; if(args.length > 1){ lang = args[1]; } spe.parseDocument(args[0],lang); } }

Thursday, February 7, 2013

"Digging into Data" Challenge Data Repositories Review

On Feb. 5, 2013, Digging into the Data Challenge Round 3 has been announced. The challenge aims to invite the researchers to explore and gain new insights about the data. Digging into the Data covers a wide range of areas such as: social sciences, archival information, library, etc. The application deadline is May 15, 2013.

I reviewed the dataset and got the following points
  1. The list of datasets didn't cover any web materials, even they had a wide varieties of born-digital materials, but they didn't cover the Web pages. Even some web archives participated in this competition, but they preferred to join with other materials. For example, Internet Archive , the largest and the oldest web archive, participated with Films and Books collections.
  2. Additional to the ignorance of web pages collections, the social media didn't appear. Library of Congress preferred to participate with "Chronicling American" newspaper collection instead of their valuable Twitter archive collection. I'm not sure if there is a copyright limitation for that.
  3. Most of the datasets have APIs access points that range between OAI-PMH, XML, and REST services.
In the attached table, I tried to summarize the type of the data in each collection. I may revisit the table again to add additional information about the nature of the data (e.g., Biomedical, history, technical).
 
Data objects
Images Book text videos Newspaper Code projects biography Numeric Maps Metadata
The Archaeology Data Service (ADS) x
ARTstor x
Biodiversity Heritage Library x
The Centre for Contemporary Canadian Art Canadian Art Database Project x x x x
Chronicling America Library of Congress National Digital Newspaper Program x
Data-PASS x
The Digital Archaeological Record (tDAR) x x x
Digital Library for Earth System Education (DLESE) x
Early Canadiana Online x
FLOSSmole x
English Broadside Ballad Archive (EBBA) x
Great War Primary Documents Archive x x
Harvard Time Series Center (TSC) x
HathiTrust x
The History Data Service (HDS) x x
Infochimps.org x
Internet Archive x x
Inter-university Consortium for Political and Social Research x x
JISC MediaHub x x x x
JSTOR x
Marriott Library- University of Utah x x x x
Smithsonian/NASA Astrophysics Data System (ADS) x
National Archives, London x
National Library of Medicine (NLM) x x x
The National Library of Wales x x x x x x x
National Science Digital Library (NSDL) x
National Technical Information Service (NTIS) x
Nebraska Digital Newspaper Project x
New York Public Library x
The New York Times Article Search API x
Opening History x
PhilPapers x
Project MUSE x
PSLC DataShop
Scholarly Database at the Cyberinfrastructure for Network Science Center, Indiana University
ScholarSpace at the University of Hawai'i at Manoa x
Statistical Accounts of Scotland x x
University of Florida Digital Library Center x x x x x
University of North Texas x x x x

Thursday, January 24, 2013

Could the crawler build web graph without overhead?

Today, I completed my new submission to JCDL 2013 about building a new system to extract and build Temporal Web Graph for the Web Archives. My focus was on optimization, the different stages took a plenty of time and space. Adding to this the large-scale of the Web Archives nowadays, IA reached 240B pages with 5PB, the problem becomes significant.
We divided the process into four stages, of course one of them is the extraction of the link structure from the mementos (archived web pages). During the evaluation of the different HTML parser,  I came across the Heritrix crawler parser for extracting the URI. Heritrix crawler is an open source crawler that has been built and maintained by Internet Archive [Mohr 2004]. Heritrix is  Remote Harvesting Crawler that used to start with seed URIs and extract/discover new URIs from the hyperlinks. Then, I came to the question.

Could the crawler build web graph without overhead?
The short answer is Yes. To build the web graph with temporal dimension, we need two main features: 1) Duplicate detection, 2) Link extraction. Both of them are part of the crawler functionality.

The crawler used to visit the pages, and calculate the check-sum for each page. Based on the check-sum, it will decide, if this page should be crawled or it hasn't changed since last visit. If the crawler decides to crawl the page, it will extract all the links inside the page inorder to discover new URIs, adding this new list to some queue for crawling later.


The suggested update is to write the extracted links list to a new repository, besides the crawling queue, that will be the source of the web graph. We could use NoSql db, with the check-sum as the key and the value is the outlink for this content. The repository could be part of other pipeline processes to process the links for further usage.

I plan to give it a try and calculate the performance overhead for this step. Of course, it could help only with the new crawled materials, and we still have to solve the previous archived Web.