Monday, January 23, 2017

2017-01-23: Finding URLs on Twitter - A simple recommendation


A prompt from Twitter indicating no search results
As part of a research experiment, I had the need to find URLs embedded in tweets from Twitter's web search service. Most of the URLs where much older than 7 days, so using the Twitter search API was not an option since the search is performed on a sample of tweets published in the past 7 days, so I used the web search service. 
I began the experiment by pasting URLs from tweets into the search box on twitter.com:
Searching Twitter for a URL by pasting the URL into the search box
I noticed I was able to find some URLs embedded in tweets, but this was not always the case. Based on my observations, finding the URLs was not correlated with the age of the tweet. I discussed this observation with Ed Summers and he recommended adding a "url:" prefix to the URL before searching. For example, if the search URL is: 
      "http://www.cnn.com", 
he recommended searching for
      "url:http://www.cnn.com"
I observed that prepending search URLs with the "url:" prefix improved my search success rate. For example, the search URL: "http://www.motherjones.com/environment/2016/09/dakota-access-pipeline-protest-timeline-sioux-standing-rock-jill-stein" was not found except with the "url:" prefix.
Example of a URL that was not found except with the "url:" parameter
Example of a URL that was not found with the "url:" parameter, but found without
Based on these observations, and considering that there was no apparent protocol switching, or URL canonicalization, I scaled the experiment to gain a better insight about this search behavior. I wanted to know the proportion of URLs that are:
  1. found exclusively with the "url:" prefix
  2. found exclusively without the "url:" prefix
  3. found with and without the "url:" prefix (both 1 and 2).
I issued 3,923 URL queries to Twitter and observed the following proportions:
  1. Count of URLs found exclusively with the "url:" prefix: 1,519
  2. Count of URLs found exclusively without the "url:" prefix: 129
  3. Count of URLs found with and without the "url:" prefix (both 1 and 2): 853
  4. Count of URLs not found: 1,422
My initial non-automated tests gave the false impression that the "url:" prefix was the only consistent method to find all URLs embedded in tweets, but these tests result show that even though the "url:" prefix search method exhibits a higher hit rate, it is not self sufficient.
Consequently, to find a URL "U" via twitter web search, I recommend beginning the search with "url:U". If "U" is not found, search for U, because this promises a higher hit ratio.
--Nwala

Friday, January 20, 2017

2017-01-20: CNN.com has been unarchivable since November 1st, 2016

CNN.com has been unarchivable since 2016-11-01T15:01:31, at least by the common web archiving systems employed by the Internet Archive, archive.is, and webcitation.org. The last known correctly archived page in the Internet Archive's Wayback Machine is 2016-11-01T13:15:40, with all versions since then producing some kind of error (including today's; 2017-01-20T09:16:50). This means that the most popular web archives have no record of the time immediately before the presidential election through at least today's presidential inauguration.
Given the political controversy surrounding the election, one might conclude this is a part of some grand conspiracy equivalent to those found in the TV series The X-Files. But rest assured, this is not the case; the page was archived as is, and the reasons behind the archival failure are not as fantastical as those found in the show.  As we will explain below, other archival systems have successfully archived CNN.com during this period (e.g, Perma.cc).

To begin the explanation of this anomaly, let's consider the raw HTML of the memento on 2016-11-01T15:01:31. At first glance, the HTML appears normal with few apparent differences (disregarding the Wayback injected tags) from the live web when comparing the two using only the browser's view-source feature. Only by looking closely at the body tag will you notice something out of place: the body tag has several CSS classes applied to it one of which seems oddly suspicious.

<body class="pg pg-hidden pg-homepage pg-section domestic t-light">

The class that should jump out is pg-hidden which is defined in the external style sheet page.css. Its definition seen below can be found on lines 28625-28631.
.pg-hidden { display: none }
As the definition is extremely simple a quick fix would be to remove it. So let's remove it.


What is revealed after removing the pg-hidden class is a skeleton page i.e. a template page sent by the server that relies on the client-side JavaScript to do the bulk of the rendering. A hint to confirm this can be found in the number of errors thrown when loading the archived page.


The first error occurs when JavaScript attempts the change the domain property of the document.

Uncaught DOMException: Failed to set the 'domain' property on 'Document' 
'cnn.com' is not a suffix of 'web.archive.org'. at (anonymous) @ (index):8 

This is commonly done to allow a page on a subdomain to load resources from another page on the superdomain (or vice versa) in order to avoid cross-origin restrictions. In the case of cnn.com, it is apparent that this is done in order to communicate with their CDN (content delivery network) and several embedded iframes in the page (more on this later). To better understand this consider the following excerpt about Same-origin policy from the MDN (Mozilla Developer Network):
A page may change its own origin with some limitations. A script can set the value of document.domain to a suffix of the current domain. If it does so, the shorter domain is used for subsequent origin checks. For example, assume a script in the document at http://store.company.com/dir/other.html executes the following statement:
document.domain = "company.com";
After that statement executes, the page would pass the origin check with http://company.com/dir/page.html. However, by the same reasoning, company.com could not set document.domain to othercompany.com.
There are four other exceptions displayed in the console from three JavaScript files (brought in from the CDN)
  • cnn-header.e4a512e…-first-bundle.js
  • cnn-header-second.min.js
  • cnn-footer-lib.min.js 
that further indicate that JavaScript is loading and rendering the remaining portions of the page.

Seen below is the relevant portion of JavaScript that does not get executed after the document.domain exception.

This portion of code sets up the global CNN object with the necessary information on how to load the assets for each section (zone) of the page and the manner by which to load them. What was not shown is the configurations for the sections, i.e the explicit definition of the content contained in them. This is important because these definitions are not added to the global CNN object due to the exception being thrown above (at window.document.domain). This causes the execution of the remaining portions of the script tag to halt before reaching them. Shown below is another inline script that is further in the document which does a similar setup.
In this tag the definitions that tell how the content model (news stories contained in the sections) are to be loaded along with further assets to be loaded. This code block does get executed in its entirety, which is important to note because the "lazy loading" definitions seen in the previous code block are added here. By defining that the content is to be lazily loaded (loadAllZonesLazy) the portion of Javascript responsible for revealing the page will not execute because the previous code blocks definitions are not added to the global CNN object. The section of code (from cnn-footer-lib.min.js) that does the reveal is seen below

As you can see the reveal code depends on two things: zone configuration defined in the section of code not executed and information added to the global CNN object in the cnn-header files responsible for the construction of the page. These files (along with the other cnn-*.js files) were un-minified and assignments to the global CNN object reconstructed to make this determination. For those interested, the results of this process can be viewed in this gist.

At this point, you must be wondering what changed between the time when the CNN archives could be viewed via the WaybackMachine and now. These changes can be summarized by considering the relevant code sections from the last correctly archived memento on 2016-11-01T13:15:40 seen below

When considering the non-whiteout archives, CNN did not require all zones to be lazy loaded and intelligent loading was not enabled. From this, we can assume they did not wait for the start of the more dynamic sections of the page to begin loading or to be loaded before showing the page.

As you can see in the above image of the memento on 2016-11-01T13:15:40, the headline of the page and the first image from the top stories section of the page are visible. The remaining sections of the page are missing as they are the lazily loaded content. Now compare this to the first not correctly archived memento on 2016-11-01T15:01:3. The headline and the first image from the top stories are a part of the sections lazily loaded (loadAllZonesLazy); thus, they contain dynamic content. This is confirmed when the pg-hidden CSS class is removed from the body tag to reveal that only the footer of the page is rendered but without any of the contents.

Even today the archival failure is happening as seen in the memento on 2017-01-20T16:00:45 seen below

In short, the archival failure is caused by changes CNN made to their CDN; these changes are reflected in the JavaScript used to render the homepage. The Internet Archive is not the only archive experiencing the failure, archive.is and webcitation.org are also affected. Viewing a capture on 2016-11-29T23:09:40 from archive.is, the preserved page once again appears to be an about:blank page.
Removing the pg-hidden definition reveals that only the footer is visible which is the same result as the memento from the Internet Archive's on 2016-11-01T15:01:31.
But unlike the Internet Archive's capture the archive.is capture is only the body of CNN's homepage with the CSS styles inlined (style="...") on each tag. This happens because archive.is does not preserve any of the JavaScript associated with the page and performs the transformation previously described to the page in order to archive it. This means that cnn.com's JavaScript will never be executed when replayed thus the dynamic contents will not be displayed.   
WebCitation, on the other hand, does preserve some of the page's JavaScript, but it is not immediately apparent due to how pages are replayed. When viewing a capture from WebCitation on 2016-11-13T33:51:09 the page appears to be rendered "properly" albeit without any CSS styling. 
This happens because WebCitation replays the page using PHP and a frame. The replay frame's PHP script loads the preserved page into the browser; then, any of the preserved CSS and JavaScript is served from another PHP script. However, using this process of serving the preserved contents may not work successfully as seen below.
WebCitation sent the CSS style sheets with the MIME type of text/html instead of text/css. This would explain why the page looks as it does. But cnn.com's JavaScript was executed with the same errors occurring that were present when replaying the Internet Archive's capture. This begs the question, "How can we preserve cnn.com as cnn.com is unarchivable, at least by the most commonly used means?". 
The solution is not as simple as one may hope, but a preliminary solution (albeit band-aid) would be to archive the page using tools such as WARCreate, Webrecorder or Perma.cc. These tools are effective since they preserve a fully rendered page along with all network requests made when rendering the page. This ensures that the JavaScript requested content and rendered sections of the page are replayable. Replaying of the page without the effects of that line of code is possible but requires the page to be replayed in an iframe. This method of replay is employed by Ilya Kreymer's PyWb (Python implementation of the Wayback Machine) and is used by Webrecorder and Perma.cc. 
This is a fairly old hack used to avoid the avoid cross-origin restrictions. The guest page, brought in through the iframe, is free set the document.domain thus allowing the offending line code to execute without issue. A more detailed explanation can be found in this blog post but the proof is in the pudding by preservation and replay. I have created an example collection through Webrecorder that contains two captures of cnn.com. 
The first is named "Using WARCreate" which used WARCreate for preservation on 2017-01-18T22:59:43, 
and the second is named "Using Webrecorder" which used Webrecorders recording feature as the preservation means on 2017-01-13T04:12:34. 

A capture of cnn.com on 2017-01-19T16:57:05 using Perma.cc for preservation is also available for replay here

All three captures will be replayed using PyWb and when bringing up the console, the document.domain exception will no longer be seen. 
The CNN archival failure highlights some of the issues faced when preserving online news and was a topic addressed at Dodging The Memory 2016. The whiteout, a function of the page itself not the archives, raises two questions "Is using web browsers for archiving the only viable option?" and "How much modification of the page is required in order to make replay feasible?".

- John Berlin

Sunday, January 15, 2017

2017-01-15: Summary of "Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data"

Example: original URI vs. trusty URI
Based on the paper:

Kuhn, T., Dumontier, M.: Trusty URIs: Verifiable, immutable, and permanent digital artifacts for linked data. Proceedings of European Semantic Web Conference (ESWC) pp. 395–410 (2014).

A trusty URI is a URI that contains a cryptographic hash value of the content it identifies. The authors introduced this technique of using trusty URIs to make digital artifacts, specially those related to scholarly publications, immutable, verifiable, and permanent. With the assumption that a trusty URI, once created, is linked from other resources or stored by a third party, it becomes possible to detect if the content that the trusty URI identifies has been tampered with or manipulated on the way (e.g., trusty URIs to prevent man-in-the-middle attacks). In addition, trusty URIs can verify the content even if it is no longer found at the original URI but still can be retrieved from other locations, such as Google's cache, and web archives (e.g., Internet Archive).

The core contribution of this paper is the ability of creating trusty URIs on different kind of content. Two modules are proposed: in the module F, the hash is calculated on the byte-level file content while in the second module R the hash is calculated on RDF graphs. The paper introduced an algorithm to generate the hash value on RDF graphs independent of any serialization syntax (e.g., N-Quads or TriX). Moreover, they investigated how trusty URIs work on the structured documents (nanopublications). Nanopublications are small RDF graphs (named graphs -- one of the main concepts of Semantic Web) to describe information about scientific statements. The nanopublication as a Named Graph itself consists of multiple Named Graphs: the "assertion" has the actual scientific statement like "malaria is transmitted by mosquitos" in the example below; the "provenance" has information about how the statement in the "assertion" was originally derived; and the "publication information" has information like who created the nanopublication and when.

A nanopublication: basic elements from http://nanopub.org
Nanopublications may cite other nanopublications resulting in having complex citation tree. Trusty URIs are designed not only to validate nanopublications individually but also to validate the whole citation tree. The nanopublication example shown below, which is about the statement "malaria is transmitted by mosquitos", is from the paper ("The anatomy of a nanopublication") and it is in TRIG format:

@prefix swan: < http://swan.mindinformatics.org/ontologies/1.2/pav.owl> .
@prefix cw: < http://conceptwiki.org/index.php/Concept>.
@prefix swp: <http://www.w3.org/2004/03/trix/swp-1/>.
@prefix : <http://www.example.org/thisDocument#> .

:G1 = { cw:malaria cw:isTransmittedBy cw:mosquitoes }
:G2 = { :G1 swan:importedBy cw:TextExtractor,
:G1 swan:createdOn "2009-09-03"^^xsd:date,
:G1 swan:authoredBy cw:BobSmith }
:G3 = { :G2 ann:assertedBy cw:SomeOrganization }

In addition to the two modules, they are planning to define new modules for more types of content (e.g., hypertext/HTML) in the future.

The example below illustrates the general structure of trusty URIs:



The artifact code, everything after r1, is the part that make this URI a trusty URI. The first character in this code (R) is to identify the module. In the example, R indicates that this trusty URI was generated on a RDF graph. The second character (A) is to specify any version of this module. The remaining characters (5..0) represents the hash value on the content. All hash values are generated by SHA-256 algorithm. I think it would be more useful to allow users to select any preferred cryptographic hash function instead of enforcing a single hash function. This might result in adding more characters to the artifact code to represent the selected hash function. InterPlanetary File System (IPFS), for example, uses Multihash as mechanism to prefix the resulting hash value with an id that maps to a particular hash function. Similar to trusty URIs, resources in the IPFS network are addressed based on hash values calculated on the content. For instance, the first two characters "Qm" in the IPFS address "/ipfs/QmZTR5bcpQD7cFgTorqxZDYaew1Wqgfbd2ud9QqGPAkK2V" indicates that SHA256 is the hash function used to generate the hash value "ZTR5bcpQD7cFgTorqxZDYaew1Wqgfbd2ud9QqGPAkK2V".

Here are some differences between the approach of using trusty URIs and other related ideas as mentioned in the paper:

  • Trusty URIs can be used to identify and verify resources on the web while in systems like Git version control system, hash values are there to verify "commits" in Git repositories only. The same applies to IPFS where hashes in addresses (e.g., /ipfs/QmZTR5bcpQD7cFgTorqxZDYaew1Wqgfbd2ud9QqGPAkK2V) are used to verify files within the IPFS network only.
  • Hashes in trusty URIs can be generated on different kinds of content while in Git or ni-URI, hash values are computed based on the byte level of the content.
  • Trusty URIs support self-references (i.e., when trusty URIs are included in the content).

The same authors published a follow-up version to their ESWC paper ("Making digital artifacts on the web verifiable and reliable") in which they described in some detail how to generate trusty URIs on content of type RA for multiple RDF graphs and RB for a single RDF graph (RB was not included in the original paper). In addition, in this recent version, they graphically described the structure of the trusty URIs.

While calculating the hash value on the content of type F (byte-level file content) is a straightforward task, multiple steps are required to calculate the hash value on content of type R (RDF graphs), such as converting any serialization (e.g, N-Quads or TriG) into RDF triples, sorting of RDF triples lexicographically, serializing the graph into a single string, replacing newline characters with "\n", and dealing with self-references and empty nodes.

To evaluate their approach, the authors used the Java implementation to create trusty URIs for 156,026 of small structured data files (nanopublications) which are in different serialization format (N-Quads and TriX). By testing these files, again using the Java implementation, they all were successfully verified as they matched to their trusty URIs. In addition, they tested modified copies of these nanopublications. Results are shown in the figure below:


Examples of using trusty URIs:

[1] Trusty URI for byte-level content

Let say that I have published my paper on the web at http://www.cs.odu.edu/~maturban/pubs/tpdl-2015.pdf, and somebody links to it or saves the link somewhere. Now, if I intentionally (or not) change the content of the paper, for example, by modifying some statistics, adding a chart, correcting a typo, or even replacing the PDF with something completely different (read about content drift), anyone downloads the paper after these changes by dereferencing the original URI will not be able to realize that the original content has been tampered with. Trusty URIs may solve this problem. For testing, I used Trustyuri-python, the Python implementation, to first generate the artifact code on the PDF file "tpdl-2015.pdf":

%python ProcessFile.py tpdl-2015.pdf

The file (tpdl-2015.pdf) is renamed to (tpdl-2015.FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao.pdf) containing the artifact code (FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao) as a part of its name -- in the paper, they call this file a trusty file. Finally, I published this trusty file on the web at the trusty URI (http://www.cs.odu.edu/~maturban/pubs/tpdl-2015.FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao.pdf). Anyone with this trusty URI can verify the original content using the the library Trustyuri-python, for example:

%python CheckFile.py http://www.cs.odu.edu/~maturban/pubs/tpdl-2015.FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao.pdf
Correct hash: FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao

As you can see, the output "Correct hash: FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao" indicates that the hash value in the
trusty URI is identical to the hash value of the content which means that this resource contains the correct and the desired content.

To see how the library detects any changes in the original content available at http://www.cs.odu.edu/~maturban/pubs/tpdl-2015.FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao.pdf, I replaced all occurrence of the number "61" with the number "71" in the content. Here is the commands I used to apply these changes:

%pdftk tpdl-2015.FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao.pdf output tmp.pdf uncompress
%sed -i 's/61/71/g' tmp.pdf
%pdftk tmp.pdf output tpdl-2015.FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao.pdf compress

The figures below show the document before and after the changes:

Before changes
After changes
The library detected that the original resource has been changed:

$python CheckFile.py http://www.cs.odu.edu/~maturban/pubs/tpdl-2015.FAofcNax1YMDFakhRQvGm1vTOcCqrsWLKeeICh9gqFVao.pdf
*** INCORRECT HASH ***

[2] Trusty URI for RDF content

I downloaded this nanopublication serialized in XML from "https://github.com/trustyuri/trustyuri-java/blob/master/src/main/resources/examples/nanopub1-pre.xml":




This nanopublication (RDF file) can be transformed into a trusty file using:

$python TransformRdf.py nanopub1-pre.xml http://trustyuri.net/examples/nanopub1

The Python script "TransformRdf.py" performed multiple steps to transform this RDF file into the trusty file "nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg.xml". The steps as mentioned above include generating RDF triples, sorting those triples, handling self-references, etc. The Python library used the second argument "http://trustyuri.net/examples/nanopub1", considered as the original URI, to manage self-references by replacing all occurrences of "http://trustyuri.net/examples/nanopub1" with "http://trustyuri.net/examples/nanopub1. " in the original XML file. You may have noticed that this ends with '.' and blank space. Once the artifact code is generated, the new trusty file is created "nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg.xml". In this trusty file all occurrences of "http://trustyuri.net/examples/nanopub1. " are replaced with "http://trustyuri.net/examples/nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg#pubinfo". The trusty file is shown below:



To verify this trusty file we can use the following command which resulting in having "Correct hash" --the content is verified to be correct. Again, to handle self-references, the Python library replaces all occurrences of "http://trustyuri.net/examples/nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg#pubinfo" with "http://trustyuri.net/examples/nanopub1. " before recomputing the hash.

%python CheckFile.py nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg.xml
Correct hash: RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg

Or by the following command if the trusty file is published on the web:

%python CheckFile.py http://www.cs.odu.edu/~maturban/nanopub1.RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg.xml
Correct hash: RAq2P3suae730r_PPkfdmBhBIpqNO6763sJ0yMQWm6xVg

What we are trying to do with trusty URI:

We are working on a project, funded by the Andrew W. Mellon Foundation, to automatically capture and archive the scholarly record on the web. One part of this project is to come up with a mechanism through which we can verify the fixity of archived resources to ensure that these resources have not been tampered with or corrupted. In general, we try to collect information about the archived resources and generate manifest file. This file will then be pushed into multiple archives, so it can be used later. Herbert Van de Sompel, from Los Alamos National Laboratory, pointed to this idea of using trusty URIs to identify and verify web resources. In this way, we have the manifest files to verify archived resources, and trusty URIs to verify these manifests.

Resources:

    --Mohamed Aturban


    Sunday, January 8, 2017

    2017-01-08: Review of WS-DL's 2016

    Sawood and Mat show off the InterPlanetary Wayback poster at JCDL 2016

    The Web Science and Digital Libraries Research Group had a productive 2016, with two Ph.D. and one M.S. students graduating, one large research grant awarded ($830k), 16 publications, and 15 trips to conferences, workshops, hackathons, etc.

    For student graduations, we had:
    Other student advancements:
    We had 16 publications in 2016:

     
    In late April, we had Herbert, Harish Shankar, and Shawn Jones visit from LANL.  Herbert has been here many times, but this was the first visit to Norfolk for Harish.  It was also on the visit that Shawn did his breadth exam.


    In addition to the fun road trip to JCDL 2016 in New Jersey (which included beers on the Cape May-Lewes Ferry!), our group traveled to:
    WS-DL at JCDL 2016 Reception in Newark, NJ
    Alex shows off his poster at JCDL 2016
    Although we did not travel to San Francisco for the 20th Anniversary of the Internet Archive, we did celebrate locally with tacos, DJ Spooky CDs, and a series of tweets & blog posts about the cultural impact and importance of web archiving.  This was in solidarity with the Internet Archive's gala which featured taco trucks and a lecture & commissioned piece by Paul Miller (aka DJ Spooky). We write plenty of papers, blog posts, etc. about technical issues and the mechanics of web archiving, but I'm especially proud of how we were able to assemble a wide array of personal stories about the social impact of web archiving.  I encourage you to take the time to go through these posts:


    We had only one popular press story about our research this year, with Tech.Co's "You Can’t Trust the Internet to Continue Existing" citing Hany SalahEldeen's 2012 TPDL paper about the rate of loss of resources shared via Twitter.

    We released several software packages and data sets in 2016:
    In April we were extremely fortunate to receive a major research award, along with Herbert Van de Sompel at LANL, from the Andrew Mellon Foundation:
    This project will address a number of areas, including: Signposting, automated assessment of web archiving quality, verification of archival integrity, and automating the archiving of non-journal scholarly output.  We will soon be releasing several research outputs as a result of this grant.

    WS-DL reviews are also available for 2015, 2014, and 2013.  We're happy to have graduated Greg, Yasmin, and Justin; and we're hoping that we can get Erika back for a PhD after her MS is completed.  I'll close with celebratory images of me (one dignified, one less so...) with Dr. AlNoamany and Dr. Brunelle; may 2017 bring similarly joyous and proud moments.

    --Michael



    Saturday, January 7, 2017

    2017-01-07: Two WS-DL Classes Offered for Spring 2017



    Two WS-DL classes are offered for Spring 2017:

    Information Visualization is being offered both online (CRNs 26614/26617 (HR), 26615/26618  (VA), 26616/26619 (US)) and on-campus (CRN 24698/24699).  Web Science is offered on-campus only (CRNs 25728/25729).  Although it's not a WS-DL course per se, WS-DL member Corren McCoy is also teaching CS 462/562 Cybersecurity Fundamentals this semester (see this F15 offering from Dr. Weigle for an idea about its contents).

    --Michael

    Tuesday, December 20, 2016

    2016-12-20: Archiving Pages with Embedded Tweets

    I'm from Louisiana and used Archive-It to build a collection of webpages about the September flood there (https://www.archive-it.org/collections/7760/).

    One of the pages I came across, Hundreds of Louisiana Flood Victims Owe Their Lives to the 'Cajun Navy', highlighted the work of the volunteer "Cajun Navy" in rescuing people from their flooded homes. The page is fairly complex, with a Flash video, YouTube video, 14 embedded tweets (one of which contained a video), and 2 embedded Instagram posts. Here's a screenshot of the original page (click for full page):

    Live page, screenshot generated on Sep 9, 2016

    To me, the most important resources here were the tweets and their pictures, so I'll focus here on how well they were archived.

    First, let's look at how embedded Tweets work on the live web. According to Twitter: "An Embedded Tweet comes in two parts: a <blockquote> containing Tweet information and the JavaScript file on Twitter’s servers which converts the <blockquote> into a fully-rendered Tweet."

    Here's the first embedded tweet (https://twitter.com/vernonernst/status/765398679649943552), with a picture of a long line of trucks pulling their boats to join the Cajun Navy.
    First embedded tweet - live web

    Here's the source for this embedded tweet:
    <blockquote class="twitter-tweet" data-width="500"> <p lang="en" dir="ltr">
    <a target="_blank" href="https://twitter.com/hashtag/CajunNavy?src=hash">#CajunNavy</a> on the way to help those stranded by the flood. Nothing like it in the world! <a href="https://twitter.com/hashtag/LouisianaFlood?src=hash">#LouisianaFlood</a> <a href="https://t.co/HaugQ7Jvgg">pic.twitter.com/HaugQ7Jvgg</a> </p> — Vernon Ernst (@vernonernst) <a href="https://twitter.com/vernonernst/status/765398679649943552">August 16, 2016</a> </blockquote>
    <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

    When the widgets.js script executes in the browser, it transforms the <blockquote class="twitter-tweet"> element into a <twitterwidget>:
    <twitterwidget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-0" style="position: static; visibility: visible; display: block; transform: rotate(0deg); max-width: 100%; width: 500px; min-width: 220px; margin-top: 10px; margin-bottom: 10px;" data-tweet-id="765398679649943552">

    Now, let's consider how the various archives handle this.

    Archive-It

    Since I'd been using Archive-It to create the collection, that was the first tool I used to capture the page. Archive-It uses the Internet Archive's Heritrix crawler and Wayback Machine for replay. I set the crawler to archive the page and embedded resources, but not to follow links. No special scoping rules were applied.

    http://wayback.archive-it.org/7760/20160818180453/http://ijr.com/2016/08/674271-hundreds-of-louisiana-flood-victims-owe-their-lives-to-the-cajun-navy/
    Archive-It, captured on Aug 18, 2016
    Here's how the first embedded tweet displayed in Archive-It:
    Embedded tweet as displayed in Archive-It


    Here's the source (as rendered in the DOM) upon playback in Archive-It's Wayback Machine:
    <blockquote class="twitter-tweet twitter-tweet-error" data-conversation="none" data-width="500" data-twitter-extracted-i1479916001246582557="true">
    <p lang="en" dir="ltr"> <a href="http://wayback.archive-it.org/7760/20160818180453/https://twitter.com/hashtag/CajunNavy?src=hash" target="_blank" rel="external nofollow">#CajunNavy</a> on the way to help those stranded by the flood. Nothing like it in the world! <a href="http://wayback.archive-it.org/7760/20160818180453/https://twitter.com/hashtag/LouisianaFlood?src=hash" target="_blank" rel="external nofollow">#LouisianaFlood</a> <a href="http://wayback.archive-it.org/7760/20160818180453/https://t.co/HaugQ7Jvgg" target="_blank" rel="external nofollow">pic.twitter.com/HaugQ7Jvgg</a> </p> <p>— Vernon Ernst (@vernonernst) <a href="http://wayback.archive-it.org/7760/20160818180453/https://twitter.com/vernonernst/status/765398679649943552" target="_blank" rel="external nofollow">August 16, 2016</a></p></blockquote>
    <p> <script async="" 
    src="//wayback.archive-it.org/7760/20160818180453js_/http://platform.twitter.com/widgets.js?4fad35" charset="utf-8"></script> </p>

    Except for the links being re-written to point to the archive, this is the same as the original embed source, rather than the transformed version.  Upon playback, although widgets.js was archived (http://wayback.archive-it.org/7760/20160818180456js_/http://platform.twitter.com/widgets.js?4fad35), it is not able to modify the DOM as it does on the live web (widgets.js loads additional JavaScript that was not archived).

    webrecorder.io

    Next up is the on-demand service, webrecorder.io. Webrecorder.io is able to replay the embedded tweets as on the live web.

    https://webrecorder.io/mweigle/south-louisiana-flood---2016/20160909144135/http://ijr.com/2016/08/674271-hundreds-of-louisiana-flood-victims-owe-their-lives-to-the-cajun-navy/

    Webrecorder.io, viewed Sep 29, 2016

    The HTML source (https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/http://ijr.com/2016/08/674271-hundreds-of-louisiana-flood-victims-owe-their-lives-to-the-cajun-navy/) looks similar to the original embed (except for re-written links):
    <blockquote class="twitter-tweet" data-width="500"><p lang="en" dir="ltr"><a target="_blank" href="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/https://twitter.com/hashtag/CajunNavy?src=hash">#CajunNavy</a> on the way to help those stranded by the flood. Nothing like it in the world!  <a href="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/https://twitter.com/hashtag/LouisianaFlood?src=hash">#LouisianaFlood</a> <a href="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/https://t.co/HaugQ7Jvgg">pic.twitter.com/HaugQ7Jvgg</a></p>&mdash; Vernon Ernst (@vernonernst) <a href="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/https://twitter.com/vernonernst/status/765398679649943552">August 16, 2016</a></blockquote>
    <script async src="//wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135js_///platform.twitter.com/widgets.js" charset="utf-8"></script>

    Upon playback, we see that webrecorder.io is able to fully execute the widgets.js script, so the transformed HTML looks like the live web (with the inserted <twitterwidget> element):
    <twitterwidget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-0" style="position: static; visibility: visible; display: block; transform: rotate(0deg); max-width: 100%; width: 500px; min-width: 220px; margin-top: 10px; margin-bottom: 10px;" data-tweet-id="765398679649943552"></twitterwidget>
    <script async="" src="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135js_///platform.twitter.com/widgets.js" charset="utf-8"></script>

    Note that widgets.js is archived and is loaded from webrecorder.io, not the live web.

    archive.is

    archive.is is another on-demand archiving service.  As with webrecorder.io, the embedded tweets are shown as on the live web.

    http://archive.is/5JcKx
    archive.is, captured Sep 9, 2016

    archive.is executes and then flattens JavaScript, so although the embedded tweet looks similar to how it's rendered in webrecorder.io and on the live web, the source is completely different:
    <article style="direction:ltr;display:block;">
    ...

    <a href="https://archive.is/o/5JcKx/twitter.com/vernonernst/status/765398679649943552/photo/1" style="color:rgb(43, 123, 185);text-decoration:none;display:block;position:absolute;top:0px;left:0px;width:100%;height:328px;line-height:0;background-color: rgb(255, 255, 255); outline: invert none 0px; "><img alt="View image on Twitter" src="http://archive.is/5JcKx/fc15e4b873d8a1977fbd6b959c166d7b4ea75d9d" title="View image on Twitter" style="width:438px;max-width:100%;max-height:100%;line-height:0;height:auto;border-width: 0px; border-style: none; border-color: white; "></a>
    ...

    </article>
    ...
    <blockquote cite="https://twitter.com/vernonernst/status/765398679649943552" style="list-style: none outside none; border-width: medium; border-style: none; margin: 0px; padding: 0px; border-color: white; ">
    ...

    <span>#</span><span>CajunNavy</span></a>
    on the way to help those stranded by the flood. Nothing like it in the world! <a href="https://archive.is/o/5JcKx/https://twitter.com/hashtag/LouisianaFlood?src=hash" style="direction:ltr;background-color: transparent; color:rgb(43, 123, 185);text-decoration:none;outline: invert none 0px; "><span>#</span><span>LouisianaFlood</span></a>
    </p>
    ...
    </blockquote>


    WARCreate

    WARCreate is a Google Chrome extension that our group developed to allow users to archive the page they are currently viewing in their browser.  It was last actively updated in 2014, though we are currently working on updates to be released in 2017.

    The image below shows the result of the page being captured with WARCreate and replayed in webarchiveplayer

    WARCreate, captured Sep 9, 2016, replayed in webarchiveplayer
    Upon replay, WARCreate is not able to display the tweet at all.  Here's the close-up of where the tweets should be:

    WARCreate capture replayed in webarchiveplayer, with tweets missing
    Examining both the WARC file and the source of the archived page helps to explain what's happening.

    Inside the WARC, we see:
    <h4>In stepped a group known as the <E2><80><9C>Cajun Navy<E2><80><9D>:</h4>
    <twitterwidget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-1" data-tweet-id="765398679649943552" style="position: static; visibility: visible; display: block; transform: rotate(0deg); max-width: 100%; width: 500px; min-width: 220px; margin-top: 10px; margin-bottom: 10px;"></twitterwidget>
    <p><script async="" src="//platform.twitter.com/widgets.js?4fad35" charset="utf-8"></script></p>


    This is the same markup that's in the DOM upon replay in webarchiveplayer, except for the script source being written to localhost:
    <h4>In stepped a group known as the “Cajun Navy”:</h4>
    <twitterwidget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-1" data-tweet-id="765398679649943552" style="position: static; visibility: visible; display: block; transform: rotate(0deg); max-width: 100%; width: 500px; min-width: 220px; margin-top: 10px; margin-bottom: 10px;"></twitterwidget>
    <p><script async="" src="//localhost:8090/20160822124810js_///platform.twitter.com/widgets.js?4fad35" charset="utf-8"></script></p>


    WARCreate captures the HTML after the page has fully loaded.  So what's happening here is that the page loads, widgets.js is executed, the DOM is changed (thus the <twitterwidget> tag), and then WARCreate saves the transformed HTML. But, what we don't get is the widgets.js script in order to be able to properly display <twitterwidget>. Our expectation is that with fixes to allow WARCreate to archive the loaded JavaScript, the embedded tweet would be displayed as on the live web.

    Discussion
     
    Each of these four archiving tools operates on the embedded tweet in a different way, highlighting the complexities of archiving asynchronously loaded JavaScript and DOM transformations.
    • Archive-It (Heritrix/Wayback) - archives the HTML returned in the HTTP response and JavaScript loaded from the HTML
    • Webrecorder.io - archives the HTML returned in the HTTP response, JavaScript loaded from the HTML, and JavaScript loaded after execution in the browser
    • Archive.is - fully loads the webpage, executes JavaScript, rewrites the resulting HTML, and archives the rewritten HTML
    • WARCreate - fully loads the webpage, executes JavaScript, and archives the transformed HTML
    It is useful to examine how different archiving tools and playback engines render complex webpages, especially those that contain embedded media.  Our forthcoming update to the Archival Acid Test will include tests for embedded content replay.

    -Michele