In the beautiful Paris,
Bibliothèque nationale de France hosted the
2014 general assembly for the
International Internet Preservation Consortium. On the first day, May 19th 2014, it was the open day which incorporates IIPC members and general audience, mostly researchers. The theme of this day was "Building Modern Research Corpora: the Evolution of Web Archiving and Analytics".
Personally, it was my first GA as web archiving engineer not as a researcher or PhD student. It gave me the opportunity to understand the needs of the researchers and the limitations from the archivists. I can conclude some remarks about the day as the following. First, Wayback machine is not enough for researchers, they need richer interface in the size of the results and in the ability to filter e.g., mimetype. Second, a complete dataset is required by the researchers, storage like AWS as common crawl or Web Observatory may be good locations for sharing the data. Third, Computer scientists and technologists can't work alone in this area, they need feedback from librarians, historians, and digital humanity researchers to drive the technology development. Now, let's work with the details of the day.
The keynote speaker for the morning session was
Prof. Dame Wendy Hall, from University of Southampton. Her presentation entitled "
The role of the Web Observatory in web archiving and analytics". In the beginning, Wendy explained that Web Science focuses on the web in general including different disciplines in addition to the technologists. Wendy described the
Web Observatory initiative that aims to global partnership in sharing datasets, observations, and tools between the researchers. But sharing the datasets has some challenges, e.g., it may be open/closed and public/private. The technical challenges include how to store, share, and access this data. Describing the dataset may be solved using
Schema.org. More details about the initiative could be found in
The Southampton University Web Observatory paper, appeared in ACM Web Science 2013.
The next session entitled "Web archives needs and projects across time and social sciences: topical studies". It has two presentations. First one was "
The future of the digital heritage of WWI" presented by Valérie Beaudouin from Telecom ParisTech and Philippe Chevallier from BnF. They determined three problems to save the digital objects related to WWI:
epistemological, how to constitutes a corpus of websites;
technical, how to collect these websites; and
juridical, the legal framework to collect these sites. Web archiving was the solution, it helped them to select, build, and explore the corpus of WWI websites. Then, they demonstrated this solution by two cases "FORUM, A space of WWI discussion" that is an open discussion about the war in addition to hosting for the shared digital objects about the war and
"Albus Valois albums" that is a set of 539 albums about the the front line of the WWI, putting them online will open new methods to share the data for a larger audience. The second presentation was "
Exploring French language song in the web archive" presented by Dominic Forest from University of Montreal. Forest discussed "
Project IMAGE"
that indexed the music, lyrics, and other metadata on a large scale. Lyrics are used as a description for the songs and the research question here is how to benefit from the music archives to mine the songs. Songs
metadata included artists, year, and recording label may be useful to describe the songs as it enriches the mining and machine learning tools.
After that we had a panel about "Scholarly use and issues of web archives". The panelists shared their ideas about the topic. Valérie Schafer from CNRS presented "
Internet historians and the “risks” of web archives". She mentioned that most users access web archives using Wayback machine but this model was good to 1996 but it should be extended to be beyond one document. The researchers can build collection by searching the archives. The archives should provide services to filter the results based on non-textual elements such as geolocation, data ranges, call dates, and link graph.
Jean-Marc Francony from University of Grenoble presented
Archive 1.0 to Archive 2.0. He described Archive 1.0 that started with the election on the web because the candidates showed more interest in the web during their campaigns. Archive 1.0 depends on
Archive-it service to collect the data. Now, they are working on Archive 2.0 that will depend on bigdata techniques. Anthony Cocciolo from Pratt Institute presented "
Web archives of deleted youth". He is watching the companies that are going to be shutdown (death watch) and contacting it to the internet archive to collect them. It affects a large portion of young people who got depressed for losing their data.
Anat Ben-David from University of Amsterdam presented "
Web Archive search as research". She spoke about one of the tools in
WebArt project called
WebARTIST that is a web archive search engine that is capable of searching, filtering, accessing, and exporting the data. In this research, they found that 10% of the pages in the collection are not part of the selection policy and archived unintentionally.
After lunch, we had one more keynote, by Wolfgang Nejdl from University of Hannover, entitled
Temporal retrieval, exploration and analytics in web archives. Nejdl discussed
ALEXANDRIA project that aims to develop models, tools and techniques necessary to archive, index, retrieve and explore the web.
ALEXANDRIA project has nine research questions such as:
Q1, How to link web archive content against multiple entity and event collections evolving over time?;
Q4, How to aggregate social media streams for archiving?; and
Q6, How to improve result ranking and clustering for time-sensitive and entity-based queries?
The next session entitled "large scale solutions and initiatives". Lisa Green, Jordan Mendelson, and Kurt Bollacker from Common Crawl presented
Common Crawl: enabling machine-scale analysis of web data .
Common Crawl is a non-profit organization that are doing periodical crawls of the web and make it available to the researchers via Amazon AWS. They started with the fact that web archiving analytics move from human scale of accessing one page using Wayback machine to the machine scale of accessing millions pages using clouds. Then, they gave some examples on succesful projects built on the top of Common Crawl corpus:
Web Data Commons,
Extracting Structured Data from the Common Craw,
WikiEntities, In What Context Is a Term Referenced? ,
Measuring the impact of Google Analytics,
Data Publica: Finding French Open Data , and
Improved Spell Checking. Robert Meusel from University of Mannheim gave more details about Web Data Commons project in his presentation
-
Mining a large web corpus. Web Data Commons depended on Common Crawl corpus on AWS where they adapted the extraction framework by modifying only one class. Meusel gave examples about the extracted datasets: Hyperlink Graph of 3.5 billion pages connected by over 128 billion links;
websites that contain structured data are 585 million of the 2.2 billion pages contain Microformat, Microdata or RDFa data (26.3%);
HTML Tables that computed as 14B raw tables, 154M are “good” relations (1.1%).
The next presenter was Masayuki Asahara from National Institute for Japanese Language and Linguistics with presentation
Web-based ultra large scale corpus where they are compiling a ten billion-word corpus of web texts for linguistic research. The project depends on four basic technologies:
Page collection is done by applying crawling techniques using remote harvesting with Heritrix; Linguistic annotation is done by applying normalization, Japanese morphological analysis, character normalization, and word segmentation; releasing the corpus by making it publicly available; and the preserving the data in WARC formats.
The last presentation in the session was "
From web archiving services to web scale data processing" by Chloé Martin from Internet Memory Research. The goal is
to make the web data available for new business models by developing new technologies using large scale crawler with high performance. The system depends on large scale crawler working on scalable platform (Big data components) and a set of analytics tool. Martin mentioned some interesting projects:
ArchiveThe.Net is a tool to collect and preserve web material for historical, cultural or heritage purpose,
mignify is web data processing platform supported with crawler on demand service, and
newstretto is an app to search and select your favorite news based on keyword.
The next session was "Building/integrating collections and tools". It started with a presentation from Helen Hockx-Yu from British Library about
Building a national collection of the historical UK web. Helen divided the interaction with web archives into three categories: Archive-driven, initiated by archival institutions; scholar-driven, initiated by scholars with specific research interests; and project-based, which varies on scope and scale. The scholary interaction started with building collections then formulating the research questions around this data. Helen defined the "go-to" state of the scholary interaction by having independent use of web archives focused on the scholary requirements and it needs a clear interfaces with open access.
Generally,
web archives should not be the bottleneck for scholar access due to the limited resource. Helen presented UK websites datasets supported by JISC, it's divided into two sets the 1990-2010 set with 30 TB and 2010-2013 set with 27.5 TB. Some completed work on these datasets are
Visualising links (to and from bl.uk),
Mapping the UK Webspace: Fifteen Years of British Universities on the Web and
The Three Truths of Margaret Thatcher: Creating and Analysing.
The next presentation was
Data analysis and resource discovery from Tom Storrar and Claire Newing from UK Government Web Archive (UKGWA). At UKGWA, they performed user surveys to determine their users and how they use the web archive. The survey showed that UKGWA was visited by various type of users where they found a limitation with search function. UKGWA has a search functionality that may suffer from the duplicate in the results and reliant on keyword matching. So
UKGWA developed semantic search service where they can model the facts and define linkage between the entities through time. The semantic search service is supported with machine readable interface APIs. Now the
Semantic Knowledge Base is a dataset by itself.
After that, Martin Klein from Los Alamos National Laboratory presented "
HIBERLINK: quantifying and addressing link rot in scholarly communications". Klein showed the
Hiberlink that aims to protect the scholary documents from the reference rot, which is divided into LinkRot (unfound page or 404 page) and Content drift (the content changed over time). Hiberlink can solve the reference rot by augmenting HTTP links with a temporal context to indicate the required access time with a link to the memento that refers to this time.
The last presentation in this session was "
Proprioception : a tool kit for web archive datamining" by
David Rapin from INA and Sophie Gebeil from Maison méditerranéenne des sciences de l’homme. They are studying the history of memories using web archives and moving from the traditional methods of libraries. In order to do that they need few filters for the data, fast QA feedback loop, and readable results. They used PigLatin for data analysis but it was not enough. Their system depends on 5 numerical indicators (e.g., unique URL count and
average size), 7 filters types (e.g., month and Top Level Domain), precomputation of data using Map-Reduce and Impala, and easy to understand interface for the users.
The
closing remark for the day was presented by Niels Brügger from University of Aarhus. Niels defined the phases of research in web archiving into 4 stages. First, corpus creation that includes searching for the seed URIs, removing the duplicate, selecting the best candidate, and isolating it from the web archive for further processing. Second stage is the analysis that includes the need for analytical tool as long as visualization technique to help understanding the data. Third stage is dissemination where the legal issues challenge the research ability to publish his results. Finally the storage is a challenge to ensure the long term preservation for the data and to provide the suitable access methods to users. Niels presented two on-gonig projects: "Probing a nation’s web sphere" that aims to study the historical development of a nation Danish web and
RESAW that is a research infrastructure for the study of archived web materials.
The complete list of presentations for the open day, members workshops, and the open workshops is available under
IIPC GA 2014 presentations.