Tuesday, April 24, 2018

2018-04-24: Why we need multiple web archives: the case of blog.reidreport.com

This story started in December, 2017 with Joy-Ann Reid (of MSNBC) apologizing for "insensitive LGBT blog posts" that she wrote on her blog many years ago when she was a morning radio talk show host in Florida.   This apology was, at least in some quarters, (begrudgingly) accepted.   Today's update was news that Reid and her lawyers had in December claimed that either her blog, and/or the Internet Archive's record of the blog had been hacked (Mediaite, The Intercept).  Later today, the Internet Archive issued a blog post deny the claim that it was hacked, stating:
This past December, Reid’s lawyers contacted us, asking to have archives of the blog (blog.reidreport.com) taken down, stating that “fraudulent” posts were “inserted into legitimate content” in our archives of the blog. Her attorneys stated that they didn’t know if the alleged insertion happened on the original site or with our archives (Reid’s claim regarding the point of manipulation is still unclear to us).
At some point after our correspondence, a robots.txt exclusion request specific to the Wayback Machine was placed on the live blog. That request was automatically recognized and processed by the Wayback Machine and the blog archives were excluded, unbeknownst to us (the process is fully automated). The robots.txt exclusion from the web archive remains automatically in effect due to the presence of the request on the live blog.   
Checking the Internet Archive for robots.txt, we can see that on 2018-02-16 blog.reidreport.com had a standard robots.txt page that blocked the admin section of WordPress, but by 2018-02-21 they had a version that blocked all robots, and as of today (2018-04-24) they had a version that specifically blocked only the Internet Archive's crawler ("ia_archiver").  As of about 5pm EDT, the robots.txt file had been removed (probably because of the Internet Archive's blog post calling out the presence of the robots.txt; cf. a similar situation in 2013 with the Conservative Party in the UK), but it may take a while for the Internet Archive to register its absence.

2018-04-25 update: Thanks to Peter Sterne for pointing out that www.blog.reidreport.com/robots.txt still exists, even though blog.reidreport.com/robots.txt does not.  They technically can be two different URLs though the convention is for them to canonicalize to the same URL (which is what the Wayback Machine does).  HTTP session info provided below, but the summary is that robots.txt is still in effect and the need for other web archives is still paramount. 

Until the Internet Archive begins serving blog.reidreport.com again, this is a good time to remind everyone that there are web archives other than the Internet Archive.  The screen shot above shows the Memento Time Travel service, which searches about 26 public web archives.  In this case, it found mementos (i.e., captures of web pages) in five different web archives: Archive-It (a subsidiary of the Internet Archive), Bibliotheca Alexandrina (the Egyptian Web Archive), the National Library of Ireland, the archive.is on-demand archiving service, and the Library of Congress.  For a machine readable service, below I list the TimeMap (list of mementos) generated by our MemGator service; the details aren't important but it is the source of the URLs that will appear next.  

Beginning with the original tweets by @Jamie_Maz (2017-11-30 thread, 2018-04-18 thread), I scanned through the screen shots (no URLs were given) and looked for screen shots that had definitive datetimes (most images did not have them).  The datetimes are (with ones for which we have evidence in bold, and the ones that we inferred by matching text are maked with "(inferred)"):

2006-01-20 (inferred)
2006-06-13 (inferred)

Most of those dates are pretty early in web archiving times, when the Internet Archive was the only archive commonly available, and many (all?) of the mementos in other web archives were surely originally crawled by the Internet Archive, even if on a contract basis (e.g., for the Library of Congress).  Nonetheless, with multiple copies geographically and administratively dispersed throughout the globe, an adversary would have had to hack multiple web archives and alter their contents (cf. lockss.org), or have hacked the original site (blog.reidreport.com) approximately 12 years ago for adulterated pages to have been hosted at all the different web archives.  While both scenarios are technically possible, they are extraordinarily unlikely.  

While we don't know the totality of the hacking claims, we can offer three archived web pages, hosted at the Library of Congress web archive (webarchive.loc.gov), that corroborate at least some of the claims @Jamie_Maz.


Evidence for this tweet can be found at (approximately midway): http://webarchive.loc.gov/all/20060125004941/http://blog.reidreport.com/ 


Evidence for this tweet can be found at (approximately 2/3 down): http://webarchive.loc.gov/all/20060608144033/http://blog.reidreport.com/


I'm not sure this evidence maps directly to one of tweets, but it fits the general theme of anti-Charlie Crist: http://webarchive.loc.gov/all/20060615134635/http://blog.reidreport.com/

This memento also exists at archive.is; it is a copy of the Internet Archive's copy but it is not blocked by robots.txt because it is in another archive: http://archive.is/20060615134635/http://blog.reidreport.com/


Evidence for this tweet can be found at (approximately midway): http://webarchive.loc.gov/all/20061010125903/http://blog.reidreport.com/


Evidence for this tweet can be found at (approximately 1/3 down): http://webarchive.loc.gov/all/20081018020856/http://blog.reidreport.com/ 

In summary, of the many examples that @Jamie_Maz provides, I can find five copies in the Library of Congress's web archive.  These crawls were probably performed on behalf of the Library of Congress by the Internet Archive (for election-based coverage); even though there are many different (and independent) web archives now, in 2006 the Internet Archive was pretty much the only game in town.  Even though these mementos are not independent observations, there is no plausible scenario for these copies to have been hacked in multiple web archives or at the original blog 10+ years ago.  There may be additional evidence in the other web archives, but I haven't exhaustively searched them.

We don't know the full details of what Reid's lawyers alleged, so perhaps there are details that we don't know.  But the analysis from the Internet Archive crawl engineers, plus evidence in separate web archives suggest that the claim has no merit.

The case of blog.reidreport.com is another example of why we need multiple web archives.  


Thanks to Prof. Michele Weigle and John Berlin for bringing this issue to my attention and uncovering some of the examples.   

Memento TimeMap for blog.reidreport.com:

2018-04-25 update: As noted above, Peter Sterne brought to my attention that the non-standard URL of www.blog.reidreport.com/robots.txt still exists (and is blocking "ia_archiver") even though the more standard blog.reidreport.com/robots.txt is 404. 

Another 2018-04-25 update: The NYT has covered the story ("MSNBC Host Joy Reid Blames Hackers for Anti-Gay Blog Posts, but Questions Mount"), and there was an interview with Reid's computer security expert ("Should We Believe Joy Reid’s Blog Was Hacked? This Security Consultant Says We Should"), Jonathon Nichols.  

 I embed a statement from Nichols (released by Erik Wemple), and a tweet from Nichols clarifying that they were not suggesting that Wayback Machine's mementos were hacked, but rather the hacked blog was crawled by the Internet Archive.  

This is where it's important to note that there maybe a discrepancy between the posts that Nichols is concerned with and those that @Jamie_Maz surfaced.  There is (semi-)independent evidence of @Jamie_Maz's pages, with the ultimate implication that for those pages to have been the result of a hack, blog.reidreport.com would have had to been hacked as many as 12 years ago -- and for nobody to have noticed at the time.  

Reid (& Nichols) could always unblock the Internet Archive and share the evidence of the hack. 

Yet another 2018-04-25 update: Apparently there are some holes in the http vs. https canonicalization wrt robots.txt blockage, allowing some of posts to surface.  Here's an example (via @YanceyMc):

Also, @wvualphasoldier deleted his tweets then protected his account, so that's the reason the above embed no longer formats correctly. 

2018-04-24: Let's Get Visual and Examine Web Page Surrogates

Why visualize individual web pages? A variety of visualizations of individual web pages exist, but why do we need them when we can just choose a URI from a list and put it in our web browser? URIs are intended to be opaque: text from the underlying web resource does not need to exist in the URI.

Consider http://dx.doi.org/10.1007/s00799-016-0200-8. Where does it go? Should we click on it? What content exists under the veil of the URI? Will it meet our needs?

Now consider this web page surrogate produced by embed.ly for the same URI:

Avoiding spoilers: wiki time travel with Sheldon Cooper

A variety of fan-based wikis about episodic fiction (e.g., television shows, novels, movies) exist on the World Wide Web. These wikis provide a wealth of information about complex stories, but if...
If we were looking for research papers about avoiding spoilers for TV shows, then we know that clicking on this surrogate will take us to something that meets our information needs. If we were searching for marine mammals, then this surrogate shows us that the underlying page will not be very satisfying. In this case, the surrogate is intended to give the user enough information to answer the question: should I click on this?

Last year, when I reviewed a number of live web curation and social media tools, I was primarily focused on tools that produce social cards like the one above. This was because social cards appeared to be the lingua franca of web page surrogates. Social cards are not the only surrogate in use today and definitely not the only surrogate evaluated in literature. In this post, I cover several surrogates that have been evaluated and then talk about the studies in which they played a part. I was curious as to which surrogate might be best for collections of mementos.

Different Web Page Surrogates

Text Snippet

Text snippets are one of the earliest surrogates. They only require fetching a given web page before selecting the text to be used in the snippet. The text selection can be done via many different methods like El-Beltagy's "KP-Miner: A Keyphrase Extraction System for English and Arabic Documents" and Chen's "A Practical System of Keyphrase Extraction for Web Pages". Text snippets are typically used by search engines for displaying results.

The Google search result text snippet for Michele Weigle's ODU CS page.
The Bing search result text snippet for Michele Weigle's ODU CS homepage. Note that Bing did not capture the last modified date, but does list a series of links on the bottom of the snippet, drawn from the menu of homepage
The DuckDuckGo search result for Michele Weigle's ODU CS homepage. Note that DuckDuckGo displays the favicon and generates a different text snippet from Google and Bing.

In the above search results for Michele Weigle's ODU CS homepage, the text snippets are slightly different depending on the search engine. Because there is a lot of variation in web pages, there are a lot of possibilities when building text snippets.

Text snippets still receive a bit of research, with Maxwell evaluating the effectiveness of snippet length in 2017 as part of "A Study of Snippet Length and Informativeness" (university repository copy).

As a group, text snippets listed one per row on a web page. This is optimal for search results, as the position of the result conveys its relevancy. This format affects how many surrogates can be viewed at once. Where text snippets are viewed one per row, more thumbnails can fit into the same amount of space.


A thumbnail is produced by loading the given page in a browser and taking a screenshot of the contents of the browser window. They have been used in many forms. The Safari web browser uses them to display the content of tabs.

The Safari web browser uses thumbnails to show surrogates for web pages  that are currently loaded in its tabs.
In "Visual preview for link traversal on the World Wide Web", Kopetzky demonstrated that thumbnails could be used to provide a preview of a linked page via a mouseover effect so that users could decide if a link was worth clicking. In "Data Mountain: Using Spatial Memory for Document Management" (Microsoft Research copy), Robertson proposed using a 3D virtual environment for organizing a corpus of web pages where each page is visualized as a thumbnail. Outside of the web, file management tools, such as macOS's Finder, use thumbnails to provide visual previews of documents.

An example of the interface for Data Mountain, a 3D environment for browsing web pages via thumbnails.

macOS Finder displaying thumbnails of file contents.

In the web archiving world, the UK Web Archive uses thumbnails to show a series of mementos so one can compare the content of each memento, effectively viewing the content drift over time. Thumbnails are also used in our own What Did It Look Like?, a platform that animates thumbnails so one can watch the changes to a web page over the years. Our group is also investigating the use of thumbnails for summarizing how a single webpage has changed over time, using three different visualizations: an animation, a grid view, and an interactive timeline view.

The UK Web Archive uses thumbnails to show different mementos for the same resource, allowing the user to view web page changes over time.

What Did It Look Like? allows the user to watch a web page change over time by animating the thumbnails of the mementos of a resource.

The size of thumbnails has a serious effect on their utility. If the thumbnail is too large, it does not provide room for comparison of surrogates. If the thumbnail is too small, users cannot see what is in the image. Thumbnails are also difficult for users to understand if a page consists mostly of text or has no unique features. In "How People Recognize Previously Seen Web Pages from Titles, URLs and Thumbnails", Kaasten established that the optimal thumbnail size is 208x208 pixels.

The viewport of a thumbnail is also an important part of its construction. Depending on what we want to emphasize on a web page, we may need to generate a thumbnail from content "below the fold". Aula evaluated the use of thumbnails that were the same size, but had magnified a portion of a web page at 20% versus 38%. She found that users performed better with thumbnails at a magnification of 20%.

Enhanced Thumbnail

In 2001, Woodruff introduced the enhanced thumbnail in "Using Thumbnails to Search the Web" (author copy). Prior to taking the screenshot of the browser as with a normal thumbnail, the HTML of the page is modified to make certain terms stand out. In the example below, changes in font size and background color emphasize certain terms of a page. The goal is to draw attention to these terms in hopes that search engine users could find relevant pages faster.

Examples of Thumbnails and Enhanced Thumbnails:
(a) Plain thumbnail
(b) Enhanced Thumbnail using HTML modification to emphasize the words "Recipe" and "Pound Cake"
(c) Enhanced Thumbnail using HTML and image modification to make "Recipe" and "Pound Cake" stand out more
(d) Emphasis on "MiniDisc Player"
(e) Emphasis on "hybrid", "car", and "mileage"
(f) Emphasis on "Hellerstein"
(g) Plain thumbnail of a page only consisting of text
(h) Enhanced thumbnail emphasizing specific terms in the text page

Even though enhanced thumbnails have performed well, they are computationally expensive to create. This likely explains why they have not been seen in use outside of laboratory studies.

In "Evaluating the Effectiveness of Visual Summaries for Web Search", Al Maqbali developed something similar by adding a tag cloud to each thumbnail and named the concept a "visual tag".

Internal Image

An internal image is an image embedded within the web page. For some web pages, like news stories and product pages, these internal images can be good surrogates because of their uniqueness. Pinterest uses internal images as surrogates.

Pinterest uses internal images as surrogates for web pages.

The key is identifying which embedded image is best for representing the page. Hu identified the issues with solving this problem as part of "Categorizing Images in Web Documents", identifying a number of features such as using the text surrounding an image and evaluating the number of colors in the image. Maekawa worked on classifying images and achieved an 83.1% accuracy in "Image Classification for Mobile Web Browsing" (conference copy). While these studies provided solutions for classifying images, we really need to know which images are unique and relevant to the web page. Research does exist to address this issue, such as the work described in Li's "Improving relevance judgment of web search results with image excerpts" (conference copy). These solutions are imperfect, which may be why Pinterest and other sites ask the user to choose an image from those embedded in the page.

Visual Snippet

In 2009, Teevan introduced visual snippets as part of "Visual snippets: summarizing web pages for search and revisitation" (Microsoft Research copy, conference slides). Teevan gave 20 web pages to a graphic designer and asked him to generate a small 120x120 image representing each page. She observed a pattern in the resulting images and derived a template to use as a surrogate. These surrogates combine the internal image, placed within the background of the surrogate, with a title running across the top of the page, and a page logo.

Examples of thumbnails on the bottom and their corresponding visual snippets on top.
She used machine learning to choose a good internal image and logo. This is more complex than merely selecting a salient internal image as noted in the previous section. Not only does the visual snippet require two images, but two different types of images.

External Image

In 2010, Jiao put forth the idea of using external images in "Visual Summarization of Web Pages". Jiao notes that detecting the internal image may be difficult if not impossible for some pages. Instead, he suggests using image search engines to find a representative image to use as a surrogate.

A simplified version of his algorithm is:

  1. Extract key phrases from the target web page using Chen's KEX algorithm
  2. Use these phrases as queries for an image search engine
  3. Rerank the search engine results based on textual similarity to the target web page
  4. Choose the top ranked image
Though this would likely work well for live web pages about products, it may be a poor fit for mementos due to the temporal nature of words. Consider a memento from the late 1990s where one of the key phrases extracted contains the word Clinton. In the 1990s, the document was likely referring to US President Bill Clinton. If we use a search engine in 2018, it may return an image of 2016 presidential candidate Hillary Clinton. Some of these temporal issues have been detailed as part of the Longitudinal Analytics on Web Archive Data (LAWA) project.

Text + Thumbnail

In "A Comparison of Visual and Textual Page Previews in Judging the Helpfulness of Web Pages" (google research copy) by Aula and "Do Thumbnail Previews Help Users Make Better Relevance Decisions about Web Search Results?" by Dziadosz, the authors consider the combination of text with a thumbnail as a surrogate.

The Internet Archive uses text and thumbnails for its search results, seen in the screenshot below.

The Internet Archive uses thumbnails and text together as part of its search results.
Al Maqbali further extended this concept with a text + visual tags.

Social Card

The social card goes by many names: rich link, snippet, social snippet, social media card, Twitter card, embedded representation, rich object, or social card. The social card typically consists of an image, a title, and a text snippet from the web page it visualizes.

The data within the social card is typically drawn from data within the meta tags of the HTML of the target web page. As an artifact of social media, different social media platforms consult different meta tags within the target page.

For example, for Twitter, I used the following tags to produce the card below:

Social card for https://www.shawnmjones.org as seen on Twitter.

For Facebook, I used the following tags to produce the card below:
Social card for https://www.shawnmjones.org as seen on Facebook.

Note how the HTML tags are different for each service. Facebook supports the Open Graph Protocol, developed around 2009 (according to the CarbonDate service) whereas Twitter's features were developed around 2010 (according to CarbonDate). There are pages that lack this kind of assistive markup. To produce those cards, social media platforms will often use other methods, like those mentioned above, to extract a text snippet and an internal image. Any mementos captured prior to 2009 will not have the benefit of this assistive markup.

Though most platforms generate social cards come in landscape form, some do generate a portrait form as well. The intended use of the social cards on the platform and the nature of other visual cues on the platform often drive the decision as to which form the social card should take. All of the studies in this blog post evaluated social cards in their landscape form.
A landscape social card from Facebook.
A portrait social card from Google+.

Social cards are not just used by social media. Wikipedia uses social cards to provide a preview of links if the user hovers over the link, like what Kopetzky had envisioned with thumbnails. Google News often uses social cards for individual stories. Social cards sometimes include additional information beyond text snippet and image. In "What's Happening and What Happened: Searching the Social Web" Omar Alonso detailed the use of social cards in a prototype for Bing search results. Those cards also incorporated lists of users who shared the target web page as well as associated hashtags.

When a user hovers over an internal link, Wikipedia uses social cards  to display a preview of the linked web page.
Google News often uses social cards to list individual news articles.

There are similar concepts that are not instances of the social card. Some of the cards used by Google News are not social cards because each is a surrogate for a news story spanning multiple resources, rather than a single resource. Likewise, search engines use entity cards to display information about a specific entity drawn from multiple sources. Entity cards have been found to be useful by Bota's 2016 study "Playing Your Cards Right: The Effect of Entity Cards on Search Behaviour and Workload". I do not consider entity cards to be social cards because each social card is a surrogate for a single web resource, whereas an entity card is a surrogate for a conceptual entity and is drawn from multiple sources.
This card used by Google News is not a surrogate for a single web resource, and hence I do not consider it a social card.
This card format, used by Google is also not a surrogate for a single web resource. This is an entity card, drawing from multiple web resources.

The creation of social cards can also be a lucrative market, with Embed.ly offering plans for web platforms ranging from $9 to $99 per month. They provide embedding services for the long form blogging service Medium, supporting a limited number of source websites. Individual cards can be made on their code generation page.

Evaluations of these Surrogates

Web page surrogates have been of great interest to those studying search engine result pages. I have review eight studies on web surrogates, most mentioned above. I focused on how these studies compared surrogates with each other.

Author & Year Text
Thumbnail Enhanced
Visual Tags
Text + Thumbnail Social Card
Woodruff 2001 X X X
Dziadosz 2002 X X X
Li 2008 X X
Teevan 2009 X X X
Jiao 2010 X X X
Aula 2010 X X X
Al Maqbali 2010 X X X X X
Loumakis 2011 X X X
Capra 2013 X X X

As noted above Woodruff introduced the concept of enhanced thumbnails in "Using Thumbnails to Search the Web". To evaluate their effectiveness, she generated questions based on tasks users commonly perform on the web. The questions were divided into 4 categories and 3 questions per category were each given to 18 participants. The participants were presented with search engine result pages consisting of 100 text snippets, thumbnails, or enhanced thumbnails. In their attempt to find web resources that would address their assigned questions, participants were evaluated based on their response times. The results indicated that enhanced thumbnails provided the fastest response times overall, but the results varied depending on the type of task. For locating an entity's homepage, text snippets and enhanced thumbnails performed roughly the same. For finding the picture of an entity, thumbnails and enhanced thumbnails performed roughly the same. All three surrogate types performed just as well for e-commerce or medical side-effect questions.

Dziadosz tested the concept of text snippets combined with thumbnails in "Do Thumbnail Previews Help Users Make Better Relevance Decisions about Web Search Results?" In this study of 35 participants, each was given 2 queries each and 2 tasks. Each participant was given a different surrogate type. Their first task was to identify all search engine results on the page that they assumed to be relevant to their query. Their second task was to visit the pages being the surrogates and identify which were actually relevant. The number of correct decisions for text snippets combined with thumbnails was higher than just for text or just for thumbnails. Aula, in "A Comparison of Visual and Textual Page Previews in Judging the Helpfulness of Web Pages" also evaluated text snippets, thumbnails, and a combination. She discovered that both were effective in making relevance judgements.

Teevan evaluated the effectiveness of visual snippets in "Visual snippets: summarizing web pages for search and revisitation". Her study consisted on 276 participants who were each given 12 search tasks and a set of 20 search results, with 4 of the 12 tasks completed with different surrogates. She discovered that text snippets required the fewest clicks compared to thumbnails, which required the most. This indicates a lot of false positive matches for participants when using thumbnails. Participants preferred visual snippets or text snippets equally over thumbnails and preferred visual snippets for shopping tasks. Most participants found thumbnails to be too small to be useful.

Jiao introduced the concept of using external images as a surrogate in "Visual Summarization of Web Pages". He compared the use of internal images, external images, thumbnails, and visual snippets. Like Dziadosz's study, participants were asked to guess the relevance of the web page behind the surrogate and then later evaluate if their earlier guess was correct. To generate search results, they randomly sampled 100 queries from the KDD CUP '05 dataset and submitted them to Bing. His results show that none of the surrogates works for all types of pages. Overall internal images were best for pages that contained a dominant image whereas thumbnails or external images were best for understanding pages that did not contain a dominant image.

In "Improving relevance judgment of web search results with image excerpts", Li was interested in identifying dominant images in web pages. I focus here on the second study in his work which compares text snippets and social cards. They randomly sampled 100 queries from the KDD CUP '05 dataset and submitted them to Google. The search engine results were then evaluated and reformatted into either text snippets or social cards. Two groups of 12 students each were given the queries either classified by their functionalities or semantic categories. The participants were evaluated based on the number of clicks of relevant results and also on the amount of time they took with each search. Social cards were the clear winner over text snippets in terms of time and clicks.

Loumakis, in "This Image Smells Good: Effects of Image Information Scent in Search Engine Results Pages" (university copy) attempted to compare the performance of images, text snippets, and social cards. Using preselected queries and 81 participants, Loumakis also reformatted Google search results. He did not get the same level of performance in his study, noting that "Adding an image to a SERP result will not significantly help users in identifying correct results, but neither will it significantly hinder them if an image is placed with text cues where the scents may conflict."

In "Evaluating the Effectiveness of Visual Summaries for Web Search", Al Maqbali explored the use of different image augmentations for visual snippets, text + thumbnail, social card, text + visual snippet, and a text + tag cloud/thumbnail combination. Al Maqbali had 65 participants evaluate the relevance of search engine result pages as in the prior studies. This study reached the same conclusion as Loumakis: adding images to text snippets does not appear to make a difference to the performance of search engine users.

To further understand the disagreement between the results of Loumakis, Al Maqbali, and Li, in "Augmenting web search surrogates with images", Capra explored the effectiveness of text snippets and social cards. He wanted determine if the quality or relevance of the image used in the social card had any effect on performance. Prior to any relevance study, he had one set of participants rate individual internal images for a social card as good, bad, and mixed. For individual surrogates, Capra discovered that text snippets with good images have a slightly higher statistically significant accuracy score than just text snippets alone, at the cost of judgement duration for each surrogate. The accuracy for text snippets was 0.864, the accuracy for social cards with bad images was also 0.864, and the accuracy for social cards with good images was 0.884. If the search engine result pages were evaluated overall, then there was evidence that good images showed improvement in accuracy with ambiguous queries (e.g., jaguar the car or the cat?), but in this case the improvements were not statistically significant.

Deciding on the best surrogate for use with web pages depends on a number of factors, and the studies comparing these surrogates have some disagreement. Text snippets continue to endure for search results likely due to Capra's, Al Maqbali's, and Loumakis' results. Social cards are preferred by users, but the minor improvement in search time and relevance accuracy does not warrant the effort necessary to select a good internal image for the card. This means that social cards are effectively relegated for use in social media where each can be generated individually rather than with hundreds of search results. This also means that thumbnails are relegated to other tasks, such as a surrogate for a file on a filesystem or within a browser's interface. As most of these studies focused primarily on search engine results, it is likely that many of these surrogates work better with other use cases.

Surrogates for Mementos

There are more uses for surrogates than search engine results. When grouped together, some surrogates provide more information than the answer to the question should I click on this?.

Enhanced thumbnails often reflect the search terms of the query provided by the user. Most memento applications do not have a query, and hence there are not words or phrases to enhance within the thumbnail. Mabali's tag cloud concept may be of interest here. I am examining other ways to expose words and phrases of interest from archived collections, so this surrogate type may find new life in mementos.

Internal images are often used as part of social cards. If one could expose the images that tie to a particular theme in a web archive collection, then it is possible that we could select images for use as memento surrogates within the theme of the collection. This would likely require some form of machine learning to be viable. This same process goes for visual snippets.

As noted above, external images are problematic surrogates for mementos due to the temporal nature of words. If we could divide a web archive into specific time periods, then external images could be extracted from pages around the same time, limiting the amount of temporal shift.

Thumbnails are often useful in groups to demonstrate the content drift of a single web resource. For this surrogate group to be useful, the consumer of such a thumbnail gallery needs to understand the direction that time flows in the visualization. Thumbnails are not limited to the "one-per-row" paradigm of landscape social cards or text snippets, and hence thumbnails can be presented in a grid formation. This can be confusing to the user trying to compare the content drift of a resource, but textual cues, such as the memento-datetime, placed above or below the thumbnail can clear up this confusion.

Storytelling often uses surrogates in the form of social cards to tell a story. In this case, the surrogates are visualizations of the underlying web pages. When provided as a series of social cards, one per row, in order to publication date or memento-datetime, collections of these surrogates can convey information about an unfolding news story, such as in AlNoamany's collection summarization work (preprint version, dissertation). Many mementos do not have the metadata that might assist in finding a good internal image. This means that any service providing social cards to mementos must instead rely upon a number of image selection algorithms with differing levels of success. Because text snippets are essentially social cards lacking an image, is it possible that they, too, would be suitable in this context?


I started on this journey looking for the best surrogate for use with mementos. I discovered many different surrogates for web resources. The studies evaluating these different surrogates focused on the success of users finding relevant information in search engine results. It appears that the search engine industry has largely focused on text snippets as they are the least expensive surrogate to produce and studies indicate that the addition of images has minimal impact on their effectiveness. Mementos have many different uses and it is possible that one or more of these surrogates may be better fit for their temporal nature. Now that I am developing a vocabulary for these surrogates, I can start to explore how they might best be used with mementos, bringing other useful visualizations to web archive collections.

-- Shawn M. Jones