AlSum - The Archivist: 2013

Tuesday, October 8, 2013

Power Regression in R

The power equation has the general form of y = bX^m. I found a good example to calculate the power regression using Excel but I couldn't find a direct method to calculate the power regression in R. In this blog, I will try to explain the same technique using R.

X	Y
0.021	505
0.062	182
0.202	55.3
0.523	22.2
1.008	11.3
3.320	4.17
7.290	1.75

The basic idea is to get the linear regression between log(X) and log(Y), then map the coefficient to the power log m and b. The code in R could be as the following:


x=c(0.021,0.062,0.202,0.523,1.008,3.320,7.290)
 y=c(505,182,55.3,22.2,11.3,4.17,1.75)
 
 fit=lm(log10(x)~log10(y))
 fit

Call:
lm(formula = log10(x) ~ log10(y))

Coefficients:
(Intercept)     log10(y)  
      1.124       -1.037  

 plot(x,y,type="p",col="black")
 lines(x, 10^ fit$coefficients[1] * x^fit$coefficients[2],type="l",col="red")

Sunday, September 29, 2013

JCDL 2013 Minute Madeness

On July 24, 2013, I attended the poster session in JCDL 2013 conference at Indianapolis, IN. In this session, I had a poster entitled "ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph" (The poster is also available as tech report on arXiv:1305.5959). ArcLink is an extension for the Wayback Machine that can extract, preserve, and deliver the web graph for the web archived data. ArcLink extracts efficiently the web graph (also known as link structure) from the archive data that may be on WARC or WAT format using Hadoop and PigLatin. The preservation depends on Cassandra db. The delivery phase uses JAVA web services to deliver the web graph in RDF/XML format. ArcLink code is available on googlecode.

JCDL has its own convention for Minute Madness, every poster's author has 60 sec to give the audience an idea about his poster in order to attract the people to come and see his poster. These 60 sec are open for madness. My minute madness idea came from the relation between WayBack Machine and ArcLink. Wayback Machine is the original tool and ArcLink is just an extension. As my son, Yousof, was with me in the conference, we prepared two t-shirts: one printed with Wayback Machine logo and the other with ArcLink logo. I shared the t-shirts with Yousof and we were ready for the madness.

I had the chance to present this work at IIPC GA 2013 in Ljubljana, Slovenia. The presentation video is also available on YouTube.

Sunday, March 3, 2013

Sample from DMOZ by Language

Open Directory (DMOZ) is an open source URIs directory. Sampling from DMOZ is a common sampling technique for web research. In this post, I will describe how to extract sample set per language instead of randomly sample. For example, I need a sample set of URIs in Arabic.

You need to export the open directory in XML/RDF format from Open Directory RDF Dump (you should select contnent.rdf.u8.gz).
Unzip the file, gzip -d contnent.rdf.u8.gz
You may need to avoid some bad encoding, sed -e 's/&#/#/g' content.rdf.u8 > content.rdf.u8.clean
Using the attached code, you can extract all the langauges or your favorite one.

Usage: java DMOZSAXParser <dmoz in RDF> [Langauge]
Export Arabic URIs: java DMOZSAXParser content.rdf.u8.clean Arabic
Export All Languages: java DMOZSAXParser content.rdf.u8.clean

Tips: to get the exact language name, copy the name as appeared in DMOZ World.

Then, we can use Language Detection Library for Java from Cybozu Labs to determine the accuracy of the extracted URI list.


import java.io.IOException;
import javax.xml.parsers.*;
import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;


public class DMOZSAXParser extends DefaultHandler{
 
  private String curLanguage=null;
  private String templink=null;
  private String requiredLang = null;

  private void parseDocument(String fileName, String requiredLang) {
    
    this.requiredLang = requiredLang;
    SAXParserFactory spf = SAXParserFactory.newInstance();
    try {
    
      SAXParser sp = spf.newSAXParser();
      sp.parse(fileName, this);
      
    }catch(SAXException se) {
      se.printStackTrace();
    }catch(ParserConfigurationException pce) {
      pce.printStackTrace();
    }catch (IOException ie) {
      ie.printStackTrace();
    }
  }

  public void startElement(String uri, String localName, 
    String qName, Attributes attributes) throws SAXException {

    if(qName.equalsIgnoreCase("Topic")) {
      
      String category = attributes.getValue(0);
      if(category != null 
         && category.startsWith("Top/World") 
         && category.indexOf("/", 10)>10 ){
        
        String curVal = category.substring(10, category.indexOf("/", 10));
        if(requiredLang==null || requiredLang.equalsIgnoreCase(curVal)){
          curLanguage = curVal;
        }
      }
    }else if(qName.equalsIgnoreCase("link")){
      templink=attributes.getValue(0);  
    }
  }

  public void endElement(String uri, String localName, String qName) 
      throws SAXException {

    if(curLanguage!= null){
      if(qName.equalsIgnoreCase("Topic")  ) {
        curLanguage =null;
  
      }else if (qName.equalsIgnoreCase("link")) {
        System.out.println(curLanguage+"\t"+templink);
      }
    }
  }
  
  public static void main(String[] args){
    DMOZSAXParser spe = new DMOZSAXParser();
    String lang=null;
    if(args.length > 1){
      lang = args[1];
    }
    spe.parseDocument(args[0],lang);
  }
}

Thursday, February 7, 2013

"Digging into Data" Challenge Data Repositories Review

On Feb. 5, 2013, Digging into the Data Challenge Round 3 has been announced. The challenge aims to invite the researchers to explore and gain new insights about the data. Digging into the Data covers a wide range of areas such as: social sciences, archival information, library, etc. The application deadline is May 15, 2013.

I reviewed the dataset and got the following points

The list of datasets didn't cover any web materials, even they had a wide varieties of born-digital materials, but they didn't cover the Web pages. Even some web archives participated in this competition, but they preferred to join with other materials. For example, Internet Archive , the largest and the oldest web archive, participated with Films and Books collections.
Additional to the ignorance of web pages collections, the social media didn't appear. Library of Congress preferred to participate with "Chronicling American" newspaper collection instead of their valuable Twitter archive collection. I'm not sure if there is a copyright limitation for that.
Most of the datasets have APIs access points that range between OAI-PMH, XML, and REST services.

In the attached table, I tried to summarize the type of the data in each collection. I may revisit the table again to add additional information about the nature of the data (e.g., Biomedical, history, technical).

	Data objects	Images	Book	text	videos	Newspaper	Code projects	biography	Numeric	Maps	Metadata
The Archaeology Data Service (ADS)	x
ARTstor		x
Biodiversity Heritage Library			x
The Centre for Contemporary Canadian Art Canadian Art Database Project		x		x	x			x
Chronicling America Library of Congress National Digital Newspaper Program						x
Data-PASS	x
The Digital Archaeological Record (tDAR)	x	x		x
Digital Library for Earth System Education (DLESE)	x
Early Canadiana Online			x
FLOSSmole							x
English Broadside Ballad Archive (EBBA)			x
Great War Primary Documents Archive		x		x
Harvard Time Series Center (TSC)	x
HathiTrust			x
The History Data Service (HDS)	x		x
Infochimps.org	x
Internet Archive			x		x
Inter-university Consortium for Political and Social Research				x					x
JISC MediaHub		x	x		x	x
JSTOR						x
Marriott Library- University of Utah	x	x	x							x
Smithsonian/NASA Astrophysics Data System (ADS)								x
National Archives, London			x
National Library of Medicine (NLM)	x		x			x
The National Library of Wales	x	x	x	x	x	x				x
National Science Digital Library (NSDL)	x
National Technical Information Service (NTIS)	x
Nebraska Digital Newspaper Project						x
New York Public Library		x
The New York Times Article Search API						x
Opening History	x
PhilPapers											x
Project MUSE						x
PSLC DataShop
Scholarly Database at the Cyberinfrastructure for Network Science Center, Indiana University
ScholarSpace at the University of Hawai'i at Manoa						x
Statistical Accounts of Scotland		x		x
University of Florida Digital Library Center		x	x		x	x				x
University of North Texas	x	x	x			x

Thursday, January 24, 2013

Could the crawler build web graph without overhead?

Today, I completed my new submission to JCDL 2013 about building a new system to extract and build Temporal Web Graph for the Web Archives. My focus was on optimization, the different stages took a plenty of time and space. Adding to this the large-scale of the Web Archives nowadays, IA reached 240B pages with 5PB, the problem becomes significant.
We divided the process into four stages, of course one of them is the extraction of the link structure from the mementos (archived web pages). During the evaluation of the different HTML parser, I came across the Heritrix crawler parser for extracting the URI. Heritrix crawler is an open source crawler that has been built and maintained by Internet Archive [Mohr 2004]. Heritrix is Remote Harvesting Crawler that used to start with seed URIs and extract/discover new URIs from the hyperlinks. Then, I came to the question.

Could the crawler build web graph without overhead?

The short answer is Yes. To build the web graph with temporal dimension, we need two main features: 1) Duplicate detection, 2) Link extraction. Both of them are part of the crawler functionality.

The crawler used to visit the pages, and calculate the check-sum for each page. Based on the check-sum, it will decide, if this page should be crawled or it hasn't changed since last visit. If the crawler decides to crawl the page, it will extract all the links inside the page inorder to discover new URIs, adding this new list to some queue for crawling later.

The suggested update is to write the extracted links list to a new repository, besides the crawling queue, that will be the source of the web graph. We could use NoSql db, with the check-sum as the key and the value is the outlink for this content. The repository could be part of other pipeline processes to process the links for further usage.

I plan to give it a try and calculate the performance overhead for this step. Of course, it could help only with the new crawled materials, and we still have to solve the previous archived Web.