Monday, September 26, 2016

2016-09-26: IIPC Building Better Crawlers Hackathon Trip Report

Trip Report for the IIPC Building Better Crawlers Hackathon in London, UK.                           

On September 22-23, 2016, I attended the IIPC Building Better Crawlers Hackathon (#iipchack) at the British Library in London, UK. Having been to London almost exactly 2 years ago for the Digital Libraries 2014 conference, I was excited to go back, but was more so anticipating collaborating with some folks I had long been in contact with during my tenure as a PhD student researcher at ODU.

The event was a well-organized yet loosely scheduled meeting that resembled more of an "Unconference" than a Hackathon in that the discussion topics were defined as the event progressed rather than a larger portion being devoted to implementation (see the recent Archives Unleashed 1.0 and 2.0 trip reports). The represented organizations were:

Day 0

As everyone arrived at the event from abroad and locally, the event organizer Olga Holownia invited the attendees to an informal get-together meeting at The Skinners Arms. There the conversation was casual but frequently veered into aspects of web archiving and brain picking, which we were repeatedly encouraged to "Save for Tomorrow".

Day 1

The first day began with Andy Jackson (@anjacks0n) welcoming everyone and thanking them for coming despite the short notice and announcement of the event over the Summer. He and Gil Hoggarth (@grhggrth), both of the British Library, kept detailed notes of the conference happenings as they progressed with Andy keeping an editable open document for other attendees to collaborate on building.

Tom Cramer (@tcramer) of Stanford, who mentioned he had organized hackathons in the past, encouraged everyone in attendance (15 in number) to introduce themselves and give a synopsis of their role and their previous work at their respective institutions. He also asked how we could go about making crawling tools accessible to non-web archiving specialists to stimulate conversation.

The responding discussion initiated a theme that ran throughout the hackathon -- that of the web archiving from a web browser.

One tool to accomplish this is Brozzler from Internet Archive, which combines warcprox and Chromium to preserve HTTP content sent over the wire into the WARC format. I had previously attempted to get Brozzler (originally forked from Umbra) up and running but was not successful. Other attendees either had previously tried or had not heard of the software. This transitioned later into Noah Levitt (of Internet Archive) giving an interactive audience-participation walk through of installing, setting up, and using Brozzler.

Prior to the interactive portion of the event, however, Jefferson Bailey (@jefferson_bail) of Internet Archive started a presentation by speaking about WASAPI (Web Archiving Systems API), a specification for defining data transfer of web archives. The specification is a collaboration with University of North Texas, Rutgers, Stanford via LOCKSS, and other organizations. Jefferson emphasized that the specification is not implementation specific; it does not get into issues like access control, parameters of a specific path, etc. The rationale behind this was so that the spec would not be just a preservation data transport tool but also a means of data transfer for researcher. Their in-development implementation takes WARCs, pulls out data to generates a derivative WARC, then defines a Hadoop job using Pig syntax. Noah Levitt added that the Jobs API requires you to supply an operation like "Build CDX" and the WARCs on which you want to perform the operation.

In a typical non-linear unconference fashion (also exhibited in this blog post), Noah then gave details on Brozzler (presentation slides). With a room full of Mac and Linux users, installation proved particularly challenging. One issue I had previously run into was latency in starting RethinkDB. This issue was also exhibited by Colin Rosenthal (@colinrosenthal) while he was on Linux and I on Mac. Noah's machine, which he showed in a demo as having the exact same versions of all dependencies I had installed did not show this latency, so Your Mileage Might Vary with installation but in the end both Colin and I (possibly others) were successful in crawling a few URIs using Brozzler.

Andy added to Noah's interactive session by referencing his effort in Dockerizing Brozzler and his other work in component-izing and Dockerized the other roles and tasks web archiving process with his project Wren. While one such component is the Archival Acid Test project I had created for Digital Libraries 2014, the other sub-projects of run allow for the mitigation of other tools that are otherwise difficult or time consuming to configure.

One such tool that was lauded throughout the conference was Alex Osborne's (@atosborne) tinycdxserver Andy also has created a Dockerized version of tinycdxserver. This tool was new to me but the reported statistics on CDX querying speed and storage have the potential for significant improvement for large web archives. Per Alex's description of the tool, the indexes are stored compressed using Facebook's RocksDB and are about a fifth of the size in tinycdxserver when compared to a flat CDX file. Further, Wayback instances can simply be pointed at a tinycdxserver instance using the built-in RemoteResourceIndex field in the Wayback configuration file, which makes for easy integration.


A wholly unconference discussion then commenced with topics we wanted to cover in the second part of the day. After coming up with and classifying various idea, Andy defined three groups: the Heritrix Wish List, Brozzler, and Automated QA.

Each attendee could join any of the three for further discussion. I chose "Automated QA", given the relevance of archival integrity is related to my research topic.

The Heritrix group expressed challenges that the members had encountered in transitioning from Heritrix version 1 to version 3. "The Heritrix 3 console is a step back from Heritrix 1's. Building and running scripts in Heritrix 3 is a pain." was the general sentiment from the group. Other concerns were scarce documentation, which might be remedied with funded efforts to improve it, as deep knowledge of the tool's working are needed to accurately represent the capability of the tool. Kristinn Sigurðsson (@kristsi), who was involved in the development of H3 (and declined to give a history documenting the non-existence of H2) has since resolved some issues. I was encouraged to use his fork of Heritrix 3 from he and others, my own recommendation inadvertent included:

The Brozzler group first identified the behavior of Brozzler versus a crawler in its handling of one page or site at a time (a la WARCreate) instead of adding discovered URIs to a frontier and seeding those URIs for subsequent crawls. Per above, Brozzler's use of RethinkDB as both the crawl frontier and the CDX service makes it especially appealing and more scalable. Brozzler allows multiple workers to pull URIs for a pool and report back to a RethinkDB instance. This worked fairly well in my limited but successful testing at the hackathon.

The Automated QA group first spoke about the National Library of Australia's Bamboo project. The tool consumes Heritrix's (and/or warcprox) crawl output folder and provides in-progress indexes from WARC files prior to a crawl finishing. Other statistics can also be added in as well as automated generation of screenshots for comparison of the captures on-the-fly. We also highlighted some particular items that crawlers and browser-based preservation tools have trouble capturing. For example, video formats that vary in support between browsers, URIs defined in the "srcset" attribute, responsive design behaviors, etc. I also referenced my work in Ahmed AlSum's (@aalsum) Thumbnail Summarization using SimHash, as presented at the Web Archiving Collaboration meeting.

After presentation by the groups, the attendees called it a day for further discussions at a nearby pub.

Day 2

The second day commenced with a few questions we all decided upon and agreed to while at the pub as good discussions for the next day. These questions:

  1. Given five engineers and two years, what would you build?
  2. What are the barriers in training for the current and future crawling software and tools?
Given Five...

Responses to the first included something like Brozzler's frontier but redesigned to allow for continuous instead of a single URI for crawling. With a segue toward Heritrix, Kristinn verbally considered the relationship between configurability and scalability. "You typically don't install heritrix on a virtual machine", he said, "usually a machine for this use requires at least 64 gigabytes of RAM." Also discussed was getting the raw data for a crawl versus being able to get the data needed to replicate the experience and the particular importance of the latter.

Additionally, there was talk of adapting the scheme used by Brozzler for an Electron application meant for browsing and the ability to toggle archiving through warcprox (related: see recent post on WAIL). On the flip side, Kristinn mentioned that it surprised him that we can potentially create a browser of this sort that can interact with a proxy but not build another crawler -- highlight the lack of options in other Heritrix-like robust archival crawlers.

Barriers in Training

For the second question, those involved with institutional archives seemed to agreed that if one was going to hire a crawl engineer, Java and Python experience are a pre-requisite to exposure to some of the archive-specific concepts. For current institutional training practice, Andy stated that he turns new developers in his organization loose on ACT, which is simply a CRUD application to introduce them into the web archiving domain. Others said it would be useful to have staff exchanges and internships for collaboration and getting more employees familiar with web archiving.


Another topic arose from the previous conversation about future methods of collaboration. For future work on writing documentation, more Getting Started fundamental guides as well as test sites for tools would be welcomed. For future communication, the IIPC Slack Channel as well as the newly created IIPC GitHub wiki will be the next iteration of the outdated IIPC Tools page and the COPTR initiative.

The whole-group discussion wrapped up with identifying concrete next steps from what was discussed at the event. These included creating setup guides for Brozzler, testing of any further use cases of Umbra versus Brozzler, future work on access control considerations as currently done by institutions and next steps regarding that, and a few other TODOs. A monthly online meeting is also planned to facilitate collaboration between meetings as well as more continued interaction via Slack instead of a number of outdated, obsolete, or noisy e-mail channels.

In Conclusion...

Attendance of the IIPC Building Better Crawlers Hackathon was invaluable to establishing contacts and gaining more exposure to the field and efforts done by others. Many of the conversations were open-ended, which lead to numerous other topics discussed and opened the doors to potential new collaborations. I gained a lot of insight from discussing my research topic and others' projects and endeavors. I hope to be involved with future Hackathons-turned-Unconferences from IIPC in the future and appreciate the opportunity I had to attend.

—Mat (@machawk1)

Kristinn Sigurðsson has also written a post about his take aways from the event.

Wednesday, September 21, 2016

2016-09-20: The promising scene at the end of Ph.D. trail

From right to left, Dr. Nelson (my advisor),
Yousof (my son), Yasmin (myself), Ahmed (my husband)
August 26th marked my last day as a Ph.D. student in the Computer Science department at ODU, while September 26 marks my first day as a Postdoctoral Scholar in Data Curation for the Sciences and Social Sciences at UC Berkeley. I will lead research in the areas of software curation, data science, and digital research methods. I will be honored to work under the supervision of Dr. Erik Mitchell, the Associate University Librarian and Director of Digital Initiatives and Collaborative Services at the University of California, Berkeley. I will have an opportunity to collaborate with many institutions across UC Berkeley, including the Berkeley Institute for Data Science (BIDS) research unit. It is amazing to see the light at the end of the long tunnel. Below, I talk about the long trail I took to reach my academic dream position. I'll recap the topic of my dissertation, then I'll summarize lessons learned at the end.

I started my Ph.D. in January 2011 at the same time that the uprisings of the Jan 25 Egyptian Revolution began. I was witnessing what was happening in Egypt while I was in Norfolk, Virginia. I could not do anything during the 18 days except watch all the news and social media channels, witnessing the events. I wished that my son Yousof, who was less than 2 years old at that time, could know what was happening as I saw it. Luckily, I knew about Archive-It, a subscription service by the Internet Archive that allows institutions to develop, curate, and preserve collections of Web resources. Each collection in Archive-It has two dimensions: time and URI. Understanding the contents and boundaries of these archived collections is a challenge for most people, resulting in the paradox of the larger the collection, the harder it is to understand.

There are multiple collections in Archive-It about the Jan. 25 Egyptian Revolution 

There is more than collection documenting the Arab Spring and particularly the Egyptian Revolution. Documenting long-running events such as the Egyptian Revolution results in large collections that have 1000s of URIs and each URI has 1000s of copies through time. It is challenging for my son to pick a specific collection to know the key events of the Egyptian revolution. The topic of my dissertation, which was entitled "Using Web Archives to Enrich the Live Web Experience Through Storytelling", focused on understanding the holdings of the archived collections.
Inspired by “It was a dark and stormy night”, a well-known storytelling trope:  
We named the proposed framework the Dark and Stormy Archive (DSA) framework, in which we integrate “storytelling” social media and Web archives. In the DSA framework, we identify, evaluate, and select candidate Web pages from archived collections that summarize the holdings of these collections, arrange them in chronological order, and then visualize these pages using tools that users already are familiar with, such as Storify. An example of the output is bellow. It shows three stories for the three collections about the Egyptian Revolution. The user can gain an understanding about the holdings of each collection from the snippets of each story.

The story of the Arab Spring Collection

The story of  the North Africa and the Middle East collection

The story of the Egyptian Revolution collection

With the help of Archive-It team and partners, we obtained a ground truth data set for evaluating the generated stories by the DSA framework. We used Amazon Mechanical Turk to evaluate the automatically generated stories against the stories that were created by domain experts. The results show that the automatically generated stories by the DSA are indistinguishable from those created by human subject domain experts, while at the same time both kinds of stories (automatic and human) are easily distinguished from randomly generated stories. I successfully defended my Ph.D. dissertation on 06/16/2016.

Generating persistent stories from themed archived collections will ensure that future generations will be able to browse the past easily. I’m glad that Yousof and future generations will be able to browse and understand the past easily through generated stories that summarize the holding of the archived collections.


To continue WS-DLer’s habit in providing recaps, lessons learned, and recommendations, I will list some of the lessons learned for what it takes to be a successful Ph.D. student and advice for applying in academia. I hope these lessons and advice will be useful for future WS-DLers and grad students. Lessons learned and advice:
  • The first one  and the one I always put in front of me: You can do ANYTHING!!

  • Getting involved in communities in addition to your academic life is useful in many ways. I have participated in many women in technology communities such as the Anita Borg Institute and the Arab Women in Computing (ArabWIC) to increase the inclusion of women in technology. I was awarded travel scholarships to attend several well-known women in tech conferences: CRA-W (Graduate Cohort 2013), Grace Hopper Celebration of Women in Computing (GHC) 2013, GHC 2014, GHC 2015, and ArabWIC 2015. I am a member of the leadership committee of ArabWIC. Attending these meetings grows maturity and enlarge personal connections and development that prepare students for future careers. I also gained leadership skills from being part of the leadership committee of ArabWIC. 
  • Publications matter! if you are in WS-DL, you will have to get the targeted score 😉. You can know more about the point system on the wiki. If you plan to apply in academia, the list of publication is a big issue. 
  • Teaching is important for applying in academia. 
  • Collaboration is a key for increasing your connections and also will help in developing your skills for working in teams. 
  • And at last, being a mom holding a Ph.D. is not easy at all!!
The trail was not easy, but it is worth it. I learned and have changed much since I started the program. Having enthusiastic and great advisors like Dr. Nelson and Dr. Weigle is a huge support that results in happy ending and achievement to be proud of.


Tuesday, September 20, 2016

2016-09-20: Carbon Dating the Web, version 3.0

Due to API changes, the old carbon date tool is out of date and some modules no longer work, such as topsy. I have taken up the responsibility of maintaining and extending  the service, beginning with the following now available in Carbon Date v3.0.

Carbon date 3.0

What's new

New services have been added, such as bing searching, twitter searching and pubdate parsing.

The new software architecture enable us to load given scripts or disable given services during runtime.

The server framework has been changed from CherryPy server to tornado server which is still a python minimalist WSGI server, with better performance.

How to use the Carbon Date service

  • Through the website, Given that carbon dating is computationally intensive, the site can only hold 50 concurrent requests, and thus the web service should be used just for small tests as a courtesy to other users. If you have the need to Carbon Date a large number of URLs, you should install the application locally.  Note that the old link still works.
  •  Through local installation: The project source can be found at the following repository: Consult for instructions on how to install the application.

Dockerizing the Carbon Date Tool

Carbon Date now only supports python 3. Due to potential  package conflicts between python 2 and python 3 (most machine have python 2 installed as default), we recommend running Carbon Date in docker.

Build docker image from source
  1.  Install the docker.
  2. Clone the git hub source to local directory.
  3. Run 
  4. Then you can choose either server or local mode
    • server mode

      Don't forget to mapping your port to server port in container.
      Then in the browser visit

      for index page or
      in the terminal

      for direct query
    • local mode
or get deployed image automatically from dockerhub :

System Design

In order to make Carbon Date tool easier to maintain and develop, the structure of the application has been refactored.  The system now has four layers:

When a query has been sent to application, the query proceed as following:

Add new module to Carbon Date

Now all the modules are loaded and executed automatically. The module manipulator will try to search and call the entry function of each module. A new module can be loaded and executed automatically without altering other scripts if it define the function in the way described below.

Name the module main script as cdGet<Module name>.py
And ensure the entry function is named:

or customize your own entry function name by assign string value to 'entry' variable in the beginning of your script.

For example, a new module using as search engine to find potential creation date of a URI. The script should be named  And the entry function should be:

The will pass outputArray, indexOfOutputArray and "displayArray"in the kwargs into the function. Note that outputArray is for to compute the earliest creation date, so only one value should be assigned here. And the displayArray is for return value, it can be the same as result creation date or anything else in the form of an array of tuples.

In this example, when we get the result from, the code to return these value is:

Source maintenance

Some web service may change, so some modules should be updated frequently.

Here, the twitter module should be updated when twitter has changed their page hierarchy. Because currently crawls the twitter search page and parses the  time stamp of each tweet in the result. So the old algorithm may not work when twitter moves the tweets' time stamp to other tags in the future.

Thus the twitter script should be updated periodically until twitter allows users to get old tweets more than one week ago through the twitter api.

I am grateful to everyone who helped me on Carbon Date especially Sawood Alam, who helped greatly with deploying the server and countless advice about refactoring the application, and John Berlin who advised me to use tornado instead of cherryPy. Further recommendations or comments about how this service can be improved are welcome and will be appreciated.


Tuesday, September 13, 2016

2016-09-13: Memento and Web Archiving Colloquium at UVa

Yesterday, September 12, I went to the University of Virginia to give a colloquium at the invitation of Robin Ruggaber to talk with her staff about Memento, Web Archiving, and related technologies.  I also had the pleasure of meeting with Worthy Martin of the CS department and the Institute for Advanced Technology in the Humanities.  I met Robin at CNI Spring 2016 and she was intrigued by our work at using storytelling to summarize archival collections, and was hoping to apply it to their Archive-It collections (which are currently not public).  My presentation yesterday was more of an overview of web archiving,  although the discussion did cover various details, including a proposal for Memento versioning in Fedora


Sunday, September 11, 2016

2016-09-11: Web Archives and Popular Media

At the Old Dominion University Web Science and Digital Libraries Research Group we have been studying web archiving for a long time.  In the past few years, we have noticed a significant uptick in the use of web archives in mainstream media, both to support stories and as the subject.  This post presents articles from the popular media that use web archive holdings (mementos) as evidence and concludes with articles about web archives.

Articles that Reference Web Archives

Tabloid Facing $100 Million Lawsuit Pulls Michael Jackson Abuse Story

2016-09-06 • Radar Online is known for a lot of things in the tabloid world, but factual reporting apparently isn't one of them. We first reported back in June a laundry list of items supposedly found in Michael Jackson's Neverland Ranch by the Santa Barbara County Sheriff's Department back in 2003.

This article uses an Internet Archive memento as evidence that a tabloid (Radar Online) might be attempting to "bury this piece to avoid a huge payout…".
2016-06-21 19:36:45
PDF of Santa Barbara County Sheriff's Department report.
Warning: although redacted, the photographs in this report are unsuitable for children and most workplaces.

Clinton's Website Deleted Statement Saying Rape Victims Have the 'Right to Be Believed'

2016-08-15 • Hillary Clinton's presidential campaign has deleted a statement on its website that said that all rape victims have the "right to be believed." BuzzFeed reported Sunday that the change was ma

This article uses a memento from to show that "right to be believed" was removed from Hillary Clinton's speech.
2015-11-30 01:45:14
Campus sexual assault page from Hillary Clinton's website.

2016-08-23 • The Epidemic Archives of the Future Will Be Born Digital

Colorful AIDS education posters from the 1980s. Black-and-white photos of mid-20th-century anatomy lessons for midwives. Eighteenth-century instructions for the administration of patent medicines. While a paper archival collection in the U.S. National Library of Medicine might contain items like these-handwritten or typed journals, correspondence, educational materials, and official reports, some digitized many years after their creation-the next generation of health information lives online.

This article describes how a National Library of Medicine (NLM) team uses the Archive-It web archive to collect webpages, blog posts, and social media streams to capture online health information generated during health crises.

Panic Mode: Khizr Khan Deletes Law Firm Website that Specialized in Muslim Immigration - Breitbart

2016-08-02 • This development is significant, as his website proved – as Breitbart News and others have reported – that he financially benefits from unfettered pay-to-play Muslim migration into America. A snapshot of his now deleted website, as captured by the Wayback Machine which takes snapshots archiving various websites on the Internet, shows that as a lawyer he engages in procurement of EB5 immigration visas and other "Related Immigration Services."

This article uses Internet Archive mementos to bolster the claim that Khizr Khan deleted his website (currently accessible – 2016-09-04) from the Internet to avoid publicizing that he financially benefits from Muslim migration to America.
2016-08-02 12:14:11
Kahn's E2, EB5 Immigration Services page, no longer present on Kahn's website.
The deletion of the Kahn law firm website may have been an administrative oversight.  The memento below shows GoDaddy offering the domain name for sale.
2016-08-04 18:40:47
GoDaddy landing page for abandoned domains.

Vote Leave wipes homepage after Brexit result

2016-06-27 • In the wake of the EU referendum, the Vote Leave campaign has wiped its homepage. Visitors to the site are now greeted by the above image. The only active links are to the campaign's Privacy Policy and contact details.

This article shows that the UK Vote Leaves deleted speeches can still be found in the Internet Archive.
2016-06-27 12:35:31
GoDaddy landing page for abandoned domains. foreign_secretary_getting_ the_facts_clear_on_turkey

Melania Trump's Website, Biography Have Disappeared From The Internet

2016-07-28 • The professional website of Melania Trump, wife of the Republican presidential nominee, has apparently been deleted from the internet as of Wednesday afternoon. The disappearance of Trump's elaborate website comes just days after news outlets, including The Huffington Post, raised serious questions about whether she actually earned an undergraduate degree in architecture from the University of Ljubljana, which is in Trump's native Slovenia.

Just days after news outlets raised questions about the veracity of Melania Trump's undergraduate degree, her website and biography were taken down.  However, the Internet archive had already captured her website over 250 times and her biography page 150 times.

2013-04-04 07:12:55

Melania Trump's undergraduate degree claim

Web evidence points to pro-Russia rebels in downing of MH17 (+video)

2014-07-14 • Igor Girkin, a Ukrainian separatist leader also known as Strelkov, claimed responsibility on a popular Russian social-networking site for the downing of what he thought was a Ukrainian military transport plane shortly before reports that Malaysian Airlines Flight MH17 had crashed near the rebel held Ukrainian city of Donetsk.

This article uses Internet Archive mementos of a Ukrainian separatist leader's (Igor Girkin) social media page as evidence that the separatists shot down Malaysian Airlines flight MH17.  The mementos below show the changes in Girkin's social media page as the news about MH17 unfolded.
2014-07-17 15:22:22
Shoot down claim of a Ukrainian AN-26 military transport.
2014-07-17 16:10:58
Both the original shoot down claim and denial of responsibility.
2014-07-17 16:56:38
Shoot down claim removed; denial of responsibility remains.

2013-11-21: The Conservative Party Speeches and Why We Need Multiple Web Archives

2013-11-21 • @Conservatives put speeches in Streisand's house: @UKWebArchive: via @lljohnston @hhockx - Michael L. Nelson (@phonedude_mln)November 13, 2013 Circulating the web last week the story of the UK's Conservative Party (aka the " Tories") removing speeches from their website (see Note 1 below).

This blog post discusses the UK Conservative Party's attempt to delete history by removing old speeches from their website.  The party also tried blocking display of the speeches using robots.txt.  However, as the post points out, several archives already had copies.

David Cameron 2009 speech returned a 404 (not found) on 2013-11-21.
2013-01-02 is one of several archives with copies of the Conservative Party's speeches.

Online Retailer Says If You Give It A Negative Review It Can Fine You $3,500

2013-11-13 • Lots of quasi-legal action has been taken over negative reviews left by customers at sites like Ripoff Report and Yelp. Usually, it takes the form of post-review threats about defamation and libel. Every so often, though, a company will make proactive...

The article discusses a law suit brought by against a customer who left negative feedback.  The Internet Archive memento cited in the article has since been excluded (probably using robots.txt):
2013-08-17 14:44:17
Kleargear Terms of Use has been excluded from the Internet Archive.
2013-08-17 14:44:17
Fortunately, has a copy.

Articles about Web Archives

Web archives have proven their use in journalism, law, and other research areas to the point that the New Yorker, Forbes, The Atlantic, U.S. News, and others have all published insightful articles recently.

2016-08-17 • U.S. News
Wayback Machine Won’t Censor Archive for Taste, Director Says After Olympics Article Scrubbed

Internet Archive removed article for safety of Olympians.
Screenshot of Forbes "Reimagining Libraries in the Digital Era" article.
2016-03-19 Forbes
Reimagining Libraries In The Digital Era

Lessons From Data Mining The Internet Archive.
Screenshot of New Yorker "Cobweb" article.
2015-01-26 New Yorker
The Cobweb

Can the Internet be archived?
Screenshot of The Atlantic "Raiders of the Lost Web" article.
2015-10-14 • The Atlantic
Raiders of the Lost Web
If a Pulitzer-finalist 34-part series of investigative journalism can vanish from the web, anything can.

The Internet's Dark Ages

2015-10-14 • The web, as it appears at any one moment, is a phantasmagoria. It’s not a place in any reliable sense of the word. It is not a repository. It is not a library. It is a constantly changing patchwork of perpetual nowness.

These lists of articles are just a beginning and will be expanded as new articles are discovered. Contributions and suggestions are welcome. Please email them to or tweet them to @galsondor with hashtag #mementoinmedia.
— Scott G. Ainsworth

Friday, September 9, 2016

2016-09-09: Summer Fellowship at the Harvard Library Innovation Lab Trip Report

Alexander Nwala standing at the main entrance of Langdell Hall
Myself standing at the main entrance of Langdell Hall
I was honored with the great opportunity of collaborating with the Harvard Library Innovation Lab (LIL) as a Fellow this Summer. Located at Langdell Hall, Harvard Law School, the Library Innovation Lab develops solutions to solve serious problems facing libraries. It consists of an eclectic group of Lawyers, Librarians, and Software Developers engaged in projects such as, Caselaw Access Project (CAP), The Nuremberg Project, among many others
The LIL Team
To help prevent link rot, creates permanent reliable links for web resources. The Caselaw Access Project is an ambitious project which strives to make all US case laws freely accessible online. The current collection to be digitized stands at over 42,000 volumes (nearly 40 million pages). The Nuremberg Project is concerned with the digitization of LIL's collection about the Nuremberg trials. 
I started work on June 6, 2016 (through August 24) as one of seven Summer Fellows, and was supervised by Adam Ziegler, LIL’s Managing Director. During the first week of the fellowship, we (Summer Fellows) were given a tour around the Harvard Law School Library and had the opportunity to share our research plans in the first Fellows hour - a session in which Fellows reported research progress, and received feedback from the LIL team as well as other Fellows. The Fellowship program was structured such that we had the flexibility to research subjects that interested us.
The 2016 LIL Summer Fellows
Harvard LIL 2016 Summer Fellows (See LIL's blog)
1. Neel Agrawal: Neel is a Law Librarian at LA Law Library Los Angeles, California. He is also a professional percussionist in various musical contexts such as Fusion, Indian and Western classical. He spent the Summer researching African drumming laws, to understand why/how colonial Government controlled, criminalized, and regulated drumming in Western/Northern Nigeria, Ghana, Uganda, Malawi, The Gambia, and Seychelles.
2. Jay Edwards: Jay was the lead database engineer for Obama for America in 2012 and also the ninth employee at Twitter. He spent the Summer working on the Caselaw Access Project, building a platform to enable non-programmers use Caselaw data.
3. Sara Frug: Sara is the Associate Director of the Cornell Law School Legal Information Institute, where she manages the engineering team which designs various tools that improve the accessibility and usability of legal text. Sara spent the Summer further researching how to improve the accessibility of legal text by developing a legal text data model.
4. Ilya KreymerIlya is the creator of Webrecorder  and Webrecorder is an interactive archiving tool which helps users create high-fidelity web archives of websites by simply browsing through the tool. Ilya spent the Summer improving Webrecorder.
5. Muira McCammonMuira just concluded her M.A in Comparative Literature/Translation Studies at the University of Massachusetts-Amherst and received her B.A. in International Relations and French from Carleton College. Her M.A thesis was about the history of the Guantanamo Bay Detainee Library. She spent the Summer further expanding her GiTMO research by drafting a narrative nonfiction book on her GiTMO research, designing a tabletop wargame to model the interaction dynamics of various parties at GiTMO  and organizing a GiTMO conference.
6. Alexander Nwala: I am a computer science Ph.D student at Old Dominion University under the supervision of Dr. Michael Nelson. I worked on projects such as Carbon date, What did it look like?, and I Can Haz Memento. Carbon date helps you estimate the birth date of website, and What did it look like renders an animated GIF which shows how a website changed over time. I spent the Summer expanding my current research which is concerned with building collections for stories and events.
7. Tiffany TsengTiffany is the creator of Spin and a Ph.D graduate of the LiFELONG KiNDERGARTEN group of the MIT media Lab. Spin is a photography turnable system used for capturing animations of the evolution of design projects. Her research at MIT primarily focused on supporting designers and makers document and share their design process. Tiffany also has a comprehensive knowledge about a wide range of snacks.
Interesting things happen when you have a group comprising of scholars from different fields with different interest together. The opportunity to learn about our various research from different perspectives as offered by the Fellows and LIL team was constant. Progress was constant, as was scrum and button making.
A few buttons assembled during one of the many button making rituals at LIL
A few buttons assembled during one of the many button making rituals at LIL
The 2016 LIL Summer Fellowship concluded with a Fellows share event in which the seven Summer Fellows presented the outcome of their work during the Fellowship.

During the presentation, Neel talked about his interactive African drumming laws website.

A paid permit was required by law in order to drum in the Western Nigeria District Councils
The website provides an online education experience by tracing the creation of about 100 drumming laws between the 1950s and 1970s in District Councils throughout Western Nigeria.

88 CPU Cores processing the CAP XML data
Jay talked about the steps he took in order make the dense XML Caselaw data searchable by first validating the Caselaw XML files. Second, he converted the files to a columnar data store format (Parquet). Third, he loaded the Caselaw preprocessed data into Apache Drill in order to provide query capabilities.

Examples of different classification system of legal text: Eurovoc (Left), Library of Congress Subject Headings (Center), and Legistlative Indexing Vocabulary (Right)
Sara talked about a general data model she developed which enables developers to harness information available in different legal text classification systems, without having to understand the specific details of each system. 

Ilya demonstrated the new capabilities in the new version of Webrecorder.
Muira talked about her investigation about GiTMO and other detainee libraries. She highlighted her work with the Harvard Law School Library to create a Twitter archive for Carol Rosenberg's (Miami Herald Journalist) tweets. She also talked about her experiences in filing Freedom Of Information Act (FOIA) requests.
I presented the Geo derivative of the Local Memory Project which maps zip codes to local news media outlets. I also presented a non-public prototype of the Local Memory Project Google chrome extension. The extension helps users build/archive/share collections about local events or stories collected from local news media outlets.

Tiffany's work at Hatch Makerspace: Spin setup (left), PIx documentation station (center), and PIx whiteboard for sharing projects (right)
The presentations concluded with Tiffany's talk about her collaboration with HATCH - a makerspace run by Watertown Public Library. She also talked about her work improving Spin (a turntable system she created).

I will link to the Fellows share video presentation and booklet when LIL posts them.

Tuesday, August 30, 2016

2016-08-30: Memento at the W3C

We are pleased to report that the W3C has embraced Memento for versioning its specifications and its wiki. Completing this effort required collaboration between the W3C and the Los Alamos National Laboratory (LANL) Research Library Prototyping Team. Here we inform others of the brief history of this effort and provide an overview of the technical aspects of the work done to make Memento at the W3C.

Brief History of Memento Work with the W3C

The W3C uses Memento for two separate systems:
Memento was implemented on both of these systems in 2016, but there were a lot of discussions and changes in direction along the way.
In 2010, Herbert Van de Sompel presented Memento as part of the Linked Data on the Web Workshop (LDOW) at WWW. The presentation was met with much enthusiasm. In fact, Sir Tim Berners-Lee stated "this is neat and there is a real need for it". Later, he met with Herbert to suggest that Memento could be used on the W3C site itself, specifically for time-based access to W3C specifications.
That same year, Harihar Shankar had finished the first working version of the Memento MediaWiki Extension. Ted Guild of the W3C installed this extension on their wiki for easy access to prior versions of pages.
At the time, the W3C kept their specifications in CVS. LANL and the W3C began discussions about how to use Memento with their CVS system and other associated web server software. This attempt ran into problems due to permissions issues and other concerns.
Fast forward to 2013, when Shawn Jones had joined the ODU Web Science and Digital Libraries Research Group. At this point, attempts to get the Memento MediaWiki Extension installed at Wikipedia had stalled. The extension had also ceased working with the version of MediaWiki then being used at the W3C. Shawn updated the extension, analyzing different design options, and evaluating their performance. He enlisted support from the MediaWiki development team in hopes that it would be acceptable for deployment at Wikipedia. Version 2.0.0 was released in 2014.
By 2014 Yorick Chollet had joined the LANL Prototyping Team. As part of work with the W3C, Yorick produced standalone TimeGate software that could be installed and run by anyone. The W3C had also started work on a web API for their specifications. The decision was made by both groups to develop the TimeGate as a microservice that would provide a Memento interface to the W3C API.
In 2015, Herbert notified the W3C that the latest version of the Memento MediaWiki Extension was available. After some planned updates to the W3C infrastructure, the updated extension was installed in January of 2016, restoring Memento support on their wiki.
By that time the W3C specifications API was nearing completion. Harish and Herbert collaborated with José Kahan at the W3C to ensure that the W3C TimeGate microservice worked with the API. Once testing was complete, the W3C added the Memento-Datetime header and updated Link headers to their resources in order to reference the new TimeGate. At the same time the W3C moved services to HTTPS, requiring HTTPS to be implemented at the TimeGate as well. Now both the W3C specifications and the W3C wiki use Memento.

Details of Memento Support for W3C Specifications

Work on Memento for the W3C Specifications entailed coordination between three components:
The diagram below provides an overview of the architecture of the Memento TimeGate microservice. The TimeGate accepts the Accept-Datetime header from Memento clients via HTTP. It then queries the W3C API using an API Handler. The result of that query is then used to discover the best revision of a specification that was active at the datetime expressed in the Accept-Datetime Header.

To demonstrate how these components work together, we will walk through Memento datetime negotiation using the specification for HTML 5 at URI-R and an Accept-Datetime value of Sat, 24 Apr 2010 15:00:00 GMT.
As shown in the curl request below, the W3C Apache Web server produces the appropriate TimeGate Link header for original resources. Memento clients use the timegate relation in this Link header to discover the URI-G of the TimeGate for this resource.
# curl -I "" HTTP/1.1 200 OK Date: Fri, 05 Aug 2016 20:41:42 GMT Last-Modified: Fri, 24 Oct 2014 16:15:24 GMT ETag: "20acd-5062d7cffff00" Accept-Ranges: bytes Content-Length: 133837 Cache-Control: max-age=31536000 Expires: Sat, 05 Aug 2017 20:41:42 GMT P3P: policyref="" Link: < TR/html5/>;rel="timegate" Access-Control-Allow-Origin: * Content-Type: text/html; charset=utf-8 Strict-Transport-Security: max-age=15552000; includeSubdomains; preload Content-Security-Policy: upgrade-insecure-requests
To continue datetime negotiation, a Memento client would then issue an HTTP request like the one below to this TimeGate - maintained by LANL.
HEAD /w3c/timegate/ HTTP/1.1 Host: Accept-Datetime: Sat, 24 Apr 2010 15:00:00 GMT Connection: close
The Memento TimeGate microservice extracts the shortname from the original URI, html5 in this case. It then queries the W3C API for this shortname directly, receiving a JSON response like the abridged one below. This response contains a version history the specification.
... ABRIDGED FOR BREVITY - SALIENT PARTS BELOW ... "_embedded": { "version-history": [ { "status": "Recommendation", "uri": "http:\/\/\/TR\/2014\/REC-html5-20141028\/", "date": "2014-10-28", "informative": false, "title": "HTML5", "shortlink": "http:\/\/\/TR\/html5\/", "editor-draft": "http:\/\/\/html\/wg\/drafts\/html\/master\/", "process-rules": "http:\/\/\/2005\/10\/Process-20051014\/", "_links": { "self": { "href": "https:\/\/\/specifications\/html5\/versions\/20141028" }, "editors": { "href": "https:\/\/\/specifications\/html5\/versions\/20141028\/editors" }, "deliverers": { "href": "https:\/\/\/specifications\/html5\/versions\/20141028\/deliverers" }, "specification": { "href": "https:\/\/\/specifications\/html5" }, "predecessor-version": { "href": "https:\/\/\/specifications\/html5\/versions\/20141028\/predecessors" } } }, ... MULTIPLE OTHER VERSIONS FOLLOW - ABRIDGED FOR BREVITY ...
From this JSON response, the TimeGate looks for the version-history array inside the _embedded object. From each entry in that array, it then extracts the uri and date. It then compares the value of the HTTP request's Accept-Datetime header with the URIs and dates from this version history to find the URI-M of the best memento that was active at the Accept-Datetime value.
In the case of our example, the datetime requested is Sat, 24 Apr 2010 15:00:00 GMT. Using the version history from the W3C API, the TimeGate discovers that the URI-M of best memento that was active at the Accept-Datetime value is at This URI-M is then used as the value of the Location header of the TimeGate's response. Because the TimeGate has access to the entire version history, it easily generates additional Link relations in its response, filling in the first and last relations in addition to the URI of the timemap. The TimeGate's full response is shown below, with the Location and Link headers in bold.
# curl -I -H 'Accept-Datetime: Sat, 24 Apr 2010 15:00:00 GMT' '' HTTP/1.1 302 Found Server: nginx/1.8.0 Content-Type: text/plain; charset=UTF-8 Content-Length: 0 Connection: keep-alive Date: Fri, 05 Aug 2016 21:18:29 GMT Vary: accept-datetime Location: Link: <>; rel="original", <>; rel="timemap"; type="application/link-format", <>; rel="timemap"; type="application/json", <>; rel="first memento"; datetime="Tue, 22 Jan 2008 00:00:00 GMT", <>; rel="memento"; datetime="Thu, 04 Mar 2010 00:00:00 GMT", <>; rel="last memento"; datetime="Tue, 28 Oct 2014 00:00:00 GMT"
A Memento client would then interpret the HTTP 302 status code as a redirect and make a subsequent request to the URI-M from the Location header. In the response, the W3C Apache Web server provides the Memento-Datetime header, identifying this resource as a memento. Also provided are the timegate and original relations in the Link header, so further datetime negotiation can occur if necessary.
# curl -I "" HTTP/1.1 200 OK Date: Fri, 05 Aug 2016 21:19:07 GMT Last-Modified: Tue, 08 Feb 2011 20:10:44 GMT Memento-Datetime: Tue, 08 Feb 2011 20:10:44 GMT ETag: "1d74a-49bcaf17c5900" Accept-Ranges: bytes Content-Length: 120650 Cache-Control: max-age=31536000 Expires: Sat, 12 Aug 2017 14:31:18 GMT P3P: policyref="" Link: < TR/html5/>;rel="timegate", <>;rel="original" Vary: upgrade-insecure-requests Access-Control-Allow-Origin: * Content-Type: text/html; charset=utf-8
From this example example, we see that datetime negotiation is now possible for W3C specifications, allowing users to find prior versions of any W3C specification using a given datetime. As seen in the datetime negotiation example above and in the link relations diagram below, the relations in the link header make this possible, even though LANL maintains the TimeGate and the W3C maintains the original resource (current version of specification) and the mementos (past versions of the specification).

And, of course, TimeMaps work as well, with a TimeMap microservice using the W3C API to find the version history of the page. An example TimeMap is shown below.
# curl '' <>; rel="original", <>; rel="timegate", <>; rel="self"; type="application/link-format", <>; rel="timemap"; type="application/json", <>; rel="first memento"; datetime="Tue, 22 Jan 2008 00:00:00 GMT", <>; rel="memento"; datetime="Tue, 10 Jun 2008 00:00:00 GMT", <>; rel="memento"; datetime="Thu, 12 Feb 2009 00:00:00 GMT", <>; rel="memento"; datetime="Thu, 23 Apr 2009 00:00:00 GMT", <>; rel="memento"; datetime="Tue, 25 Aug 2009 00:00:00 GMT", <>; rel="memento"; datetime="Thu, 04 Mar 2010 00:00:00 GMT", <>; rel="memento"; datetime="Thu, 24 Jun 2010 00:00:00 GMT", <>; rel="memento"; datetime="Tue, 19 Oct 2010 00:00:00 GMT", <>; rel="memento"; datetime="Thu, 13 Jan 2011 00:00:00 GMT", <>; rel="memento"; datetime="Tue, 05 Apr 2011 00:00:00 GMT", <>; rel="memento"; datetime="Wed, 25 May 2011 00:00:00 GMT", <>; rel="memento"; datetime="Thu, 29 Mar 2012 00:00:00 GMT", <>; rel="memento"; datetime="Thu, 25 Oct 2012 00:00:00 GMT", <>; rel="memento"; datetime="Mon, 17 Dec 2012 00:00:00 GMT", <>; rel="memento"; datetime="Tue, 29 Apr 2014 00:00:00 GMT", <>; rel="memento"; datetime="Tue, 17 Jun 2014 00:00:00 GMT", <>; rel="memento"; datetime="Thu, 31 Jul 2014 00:00:00 GMT", <>; rel="memento"; datetime="Tue, 16 Sep 2014 00:00:00 GMT", <>; rel="last memento"; datetime="Tue, 28 Oct 2014 00:00:00 GMT"
Contrast this TimeMap of 19 versions with the 1,243 observations made by the Internet Archive for the same page. If studying the evolution of a standard, 19 explicit versions are easier to work with than more than 1000 observations, many of which are for the same version.

Details of Memento Support on the W3C Wiki

The W3C is also running the full Memento MediaWiki Extension on their wiki. The full Memento MediaWiki Extension provides TimeGates and TimeMaps as well as other additional information in the Link headers of its HTTP responses. Shown below is an example HTTP response for the original resource
# curl -I "" HTTP/1.1 200 OK X-Powered-By: PHP/5.4.45-0+deb7u4 X-Content-Type-Options: nosniff Link: <>; rel="original latest-version",<>; rel="timegate",<>; rel="timemap"; type="application/link-format"; from="Mon, 14 Mar 2011 19:25:12 GMT"; until="Thu, 21 Jul 2011 22:24:53 GMT",<>; rel="first memento"; datetime="Mon, 14 Mar 2011 19:25:12 GMT",<>; rel="last memento"; datetime="Thu, 21 Jul 2011 22:24:53 GMT" Content-language: en Vary: Accept-Encoding,Cookie Cache-Control: s-maxage=18000, must-revalidate, max-age=0 Last-Modified: Wed, 03 Aug 2016 04:40:32 GMT Content-Type: text/html; charset=UTF-8 Content-Length: 24053 Accept-Ranges: bytes Date: Wed, 03 Aug 2016 19:27:11 GMT X-Varnish: 877421307 877181026 Age: 35199 Via: 1.1 varnish X-Cache: HIT Strict-Transport-Security: max-age=15552000; includeSubdomains; preload Content-Security-Policy: upgrade-insecure-requests Content-Security-Policy-Report-Only: default-src *; img-src * data:; style-src * 'unsafe-inline'; script-src * 'unsafe-inline'; frame-ancestors *; report-uri
And also for prior versions of the same resource, we see that the Memento-Datetime and Link headers are returned.
# curl -I "" HTTP/1.1 200 OK X-Powered-By: PHP/5.4.45-0+deb7u4 X-Content-Type-Options: nosniff Memento-Datetime: Thu, 21 Jul 2011 22:24:53 GMT Link: <>; rel="original latest-version",<>; rel="timegate",<>; rel="timemap"; type="application/link-format"; from="Mon, 14 Mar 2011 19:25:12 GMT"; until="Thu, 21 Jul 2011 22:24:53 GMT",<>; rel="first memento"; datetime="Mon, 14 Mar 2011 19:25:12 GMT",<>; rel="last memento"; datetime="Thu, 21 Jul 2011 22:24:53 GMT" Content-language: en Vary: Accept-Encoding,Cookie Expires: Thu, 01 Jan 1970 00:00:00 GMT Cache-Control: private, must-revalidate, max-age=0 Content-Type: text/html; charset=UTF-8 Content-Length: 24966 Accept-Ranges: bytes Date: Sat, 06 Aug 2016 19:12:58 GMT X-Varnish: 878886405 Age: 0 Via: 1.1 varnish X-Cache: MISS Strict-Transport-Security: max-age=15552000; includeSubdomains; preload Content-Security-Policy: upgrade-insecure-requests Content-Security-Policy-Report-Only: default-src *; img-src * data:; style-src * 'unsafe-inline'; script-src * 'unsafe-inline'; frame-ancestors *; report-uri
For more information on the extension, we suggest consulting those resources, as well as its GitHub and MediaWiki sites.


Since its inception, we have identified many use cases for Memento, from reconstructing web pages from many existing archives to avoiding spoilers in fiction to managing the temporal nature of semantic web data. We are happy that the W3C has adopted Memento for use in their work as well.
Even though the W3C maintains the Apache server holding mementos and original resources, and LANL maintains the systems running the W3C TimeGate software, it is the relations within the Link headers that tie everything together. It is an excellent example of the harmony possible with meaningful Link headers. Memento allows users to negotiate in time with a single web standard, making web archives, semantic web resources, and now W3C specifications all accessible the same way. Memento provides a standard alternative to a series of implementation-specific approaches.
We have been trying to bring Memento support to Wikipedia for the past few years, demonstrating the technology at conferences, working with their development team, and even getting direct feedback on the software from MediaWiki developers such as LegoTKM, Jeroen De Dauw, and ricordisamoa. Unfortunately, we have so far been unsuccessful with discussing deployment to Wikipedia. Perhaps they can be our next major adopter?

Herbert Van de Sompel
- and -
Harihar Shankar
- and -