Tuesday, October 25, 2016

2016-10-25: Paper in the Archive

Mat reports on his journalistic experience and how we can relive it through Internet Archive (#IA20)                            

We have our collections, the things we care about, the mementos that remind us of our past. Many of these things reside on the Web. For those we want to recall and should have (in hindsight) saved, we turn to the Internet Archive.

As a computer science (CS) undergrad at University of Florida, I worked at the student-run university newspaper, The Independent Florida Alligator. This experience became particularly relevant with my recent scholarship to preserve online news. At the paper, we reported mostly on the university community, but also on news that catered to the ACRs through reports about Gainesville (e.g., city politics).

News is compiled late in the day to maximize temporal currency. I started at the paper as a "Section Producer" and eventually evolved to be a Managing Editor. I was in charge of the online edition, the "New Media" counterpart of the daily print edition -- Alligator Online. The late shift fit well with my already established coding schedule.

Proof from '05, with the 'thew' still intact.

The Alligator is an independent newspaper -- the content we published could conflict with the university without fear of being censored by the university. Typical associated college newspapers have this conflict of interest, which potentially limits their content only to that which is approved. This was part of the draw to the paper for me and I imagine, the student readers seeking less biased reporting. The orange boxes were often empty well before day's end. Students and ACRs read the print paper. As a CS student, I preferred Alligator Online.

With a unique technical perspective among my journalistic peers, I introduced a homebrewed content management system (CMS) into the online production process. This allowed Alligator Online to focus on porting the print content and not on futzing with markup. This also made the content far more accessible and, as time has shown thanks to Internet Archive, preservable.

Internet Archive's capture of Alligator Online at alligator.org over time with my time there highlighted in orange.

After graduating from UF in 2006, I continued to live and work elsewhere in Gainesville for a few years. Even then technically an ACR, I still preferred Alligator Online to print. A new set of students transitioned into production of Alligator Online and eventually deployed a new CMS.

Now as a PhD student of CS studying the past Web, I have observed a resultant decline in accessibility that occurred after I had moved on from the paper. This corresponds further with our work On the Change in Archivability of Websites Over Time (PDF). Thankfully, adaptations at Alligator Online and possibly IA have allowed the preservation rate to recover (see above, post-tenure).

alligator.org before (2004) and after (2006) I managed, per captures by Internet Archive.

With Internet Archive celebrating 20 years in existence (#IA20), IA has provided the means for me to see the aforementioned trend in time. My knowledge in the mid-2000s of web standards and accessibility facilitated preservation. Because of this, with special thanks to IA, the collections of pages I care about -- the mementos that remind me of my past -- are accessible and well-preserved.

— Mat (@machawk1)

Monday, October 24, 2016

2016-10-24: 20th International Conference on Theory and Practice of Digital Libraries (TPDL 2016) Trip Report

"Dad, he is pushing random doorbell buttons", Dr. Herzog's daughter complained about her brother while we were walking back home late night after having dinner in the city center of Potsdam. Dr. Herzog smiled and replied, "it's rather a cool idea, let's all do it". Repeating the TPDL 2015 tradition, Dr. Michael Herzog's family was hosting me (Sawood Alam) at their place after the TPDL 2016 conference in Hannover. Leaving some cursing people behind (who were disturbed by false doorbells), he asked me, "how was your conference this year?"

Day 1

Between the two parallel sessions of the first day, I attended the Doctoral Consortium session as a participant. The chair Kjetil Nørvåg, Norwegian University of Science and Technology, Norway, began the session with the formal introduction of the session structure and timeline. Out of the seven accepted Doctoral Consortium submissions, only five could make it to the workshop.
My talk was mainly praised for the good content organization, an easy to follow story for the problem description, tiered approach to solving problems, and inclusion of the work and publication plans. Konstantina's talk on political bias identification generated the maximum discussion during the QA session. I owe her references to A visual history of Donald Trump dominating the news cycle and Text analysis of Trump's tweets confirms he writes only the (angrier) Android half.

Each presenter was assigned a mentor for more in-depth feedback on their work and provide and outsider's perspective that would help define the scope of the thesis and recognize parts that might need more elaboration. After formal presentation session, presenters were spread apart for one-to-one session with their corresponding mentor. Nattiya Kanhabua, from Aalborg University, Denmark, was my mentor. She provided great feedback and some useful references that might be relevant to my research. We also talked about the possibilities of collaboration in future where our research interest intersects.

After the conclusion of the Doctoral Consortium Workshop we headed to Technische Informationsbibliothek (TIB) where Mila Runnwerth welcomed us to German National Library of Science and Technology. She gave us an insightful presentation followed by a guided tour to the library facilities.

Day 2

The main conference started on the second day with David Bainbridge's keynote presentation on "Mozart's Laptop: Implications for Creativity in Multimedia Digital Libraries and Beyond". He introduced a tool named Expeditee that gives a universal UI for text, image, and music interaction. The talk was full of interesting references and demonstrations such a querying music by humming. Following the keynote, I attended the Digital Humanities track while missing the other two parallel tracks.
Then I moved to another track for Search and User Aspects sessions.
Following the regular presentation tracks the Posters and Demos session was scheduled. It came to me as a surprise that all the Doctoral Consortium submissions were automatically included in the Posters session (apart from the regular poster and demo submissions) and assigned reserved places in the hall for posters, which means I had to do something for the traditional Minute Madness event that I was not prepared for. So I ended up reusing #IAmNotAGator gag that I prepared for JCDL 2016 Minute Madness and utilized the poster time to advertise MemGator and Memento.

Day 3

On the second day of the conference I had two papers to present. So, I decided to wear business formal attire. As a consequence, the conference photographer stopped me at the building entrance and asked me to pose for him near the information desk. The lady on the information desk tried to explain me routes to various places of the city, but the modeling session extended so long that it became awkward and we both started smiling.

The day began with Jan Rybicki's keynote talk on "Pretty Things Done with (Electronic) Texts: Why We Need Full-Text Access". For the first time I came to know about the term Stylometry. His slides were full of beautiful visualizations. The tool used to generate the data for the visualizations is published as an R package called stylo. After the keynote, I attended the Web Archives session.

After the lunch break I moved to the Short Papers track where I had my second presentation of the day.

After the coffee break I attended the Multimedia and Time Aspects track while missing the panel session on Digital Humanities and eInfrastructures.
In the evening we headed to the XII Apostel Hannover for the conference dinner. The food was good. During the dinner they announced Giannis Tsakonas and Joffrey Decourselle as the best paper and the best poster winners respectively.

Day 4

On the last day of the main conference I decided to skip the panel and tutorial tracks in the favor of the Digital Library Evaluation research track.
After a brief coffee break everyone gathered for the closing keynote presentation by Tony Veale on "Metaphors All the Way Down: The many practical uses of figurative language understanding". The talk was very informative, interesting, and full of hilarious examples. He mentioned the Library of Babel which reminded me of a digital implementation of it and a video talking about it. Slides looked more like a comic strip which was very much in line with the theme of the talk which ended up talking about various Twitter bots such as MetaphorIsMyBusiness and MetaphorMirror.

Following the closing keynote the main conference was concluded with some announcements. Next year TPDL 2017 will be hosted in Thessaloniki, Greece, during September 17-21, 2017. TPDL is willing to expand the scope and encouraging young researchers to come forward with session ideas, chair events, and take the lead. People who are active on social media and scientific communities are encouraged to spread the word out to bring more awareness and participation. This year's Twitter hashtag was #TPDL2016 where all the relevant Tweets can be found.

The rest of the afternoon I spent in the Alexandria Workshop.

Day 5

It was my last day in Hannover. I checked out from the conference hotel, Congress Hotel am Stadtpark Hannover. The hotel was located next to the conference venue and the views from the hotel were good. However, the experience at the hotel was not very good. It was located far away from the city center and there were no restaurants nearby. Despite complaints I have found an insect jumping on my laptop and bed on fifteenth floor, late night, for two consecutive nights. The basic Wi-Fi was useless and unreliable. In my opinion, nowadays, high-speed Wi-Fi in hotels should not be counted in luxury amenities, especially for business visitors. The hotel was not cheap either. These factors should be considered when choosing a conference venue and hotel by organizers.

I realized I still have some time to spare before I begin my journey. So, I decided to go to the conference venue where the Alexandria Workshop was ongoing. I was able to catch the keynote by Jane Winters in which she talked about many Web archiving related familiar projects. Then I headed to the Hannover city center to catch the train to Stendal.

"I know the rest of the story, since I received you in Stendal", Dr. Herzog interrupted me. We have reached home and it was already very late, hence, we called it a night and went to our beds.

Post-conference Days

After the conference, I spent a couple of days with Dr. Herzog's family on my way back. We visited Stendal University of Applied Sciences, met some interesting people for lunch at Schlosshotel Tangermünde, explored Potsdam by walking and biking, did some souvenir shopping and kitchen experiments, visited Dr. Herzog's daughter's school and the Freie Universität Berlin campus along with many other historical places on our way, and had dinner in Berlin where I finally revealed the secret of the disappearing earphone magic trick to Mrs. Herzog. On Sunday morning Dr. Herzog dropped me to the Berlin airport.

Dr. Herzog is a great host and tour guide. He has a beautiful, lovely, and welcoming family. Visiting his family is a single sufficient reason for me to visit Germany anytime.

Sawood Alam

2016-10-24: Are My Favorite Arabic Websites Archived?

In this work, I collected the top 20 Arabic websites that I like and browse, and in my personal judgment consider as popular (shown in Table 1). For each, I checked its ranking globally and locally based on Alexa Ranking. Then I used MemGator tool to check if it is archived, and got the estimated creation date based on the first memento date. After that, I checked who archived the webpage first (shown in Table 2).

Arabic websites in general were evaluated based on how well they were archived and indexed in my previous work, How Well Are Arabic Websites Archived?. We sampled 300,646 Arabic language pages, and found that 46% are not archived and 31% are not indexed by Google.

Table 1: List of my favorite Arabic Websites and its description
Website Site Description
Yahoo! مكتوب
A major internet portal and email service provider in Arabic language.
الجزيرة نت
News channel, political, economic, and thoughts.
طرب موقع طرب اغاني طرب كوم
Arabic singers and music directory.
Arabic language news network. Breaking news and features along with videos, photo galleries and In Focus sections on major news topics.
كووورة: الموقع العربي الرياضي الأول
The first Arabic website of football featuring World Championships with an Arab follow-up and analysis of all events of football.
منتديات عالم حواء
Women's network concerned with women affairs, life, family, cooking, children, and Beauty.
ترفيه، جدول البرامج، مشاهير،أفلام، مسلسلات، برامج تلفزيونية
The Middle East Broadcasting Center Group is the first private free-to-air satellite broadcasting company in the Arab World.
جريدة الرياض
The first daily newspaper published in Arabic in the capital of Saudi Arabia.
جامعة الملك سعود
Established in 1957, King Saud University is the largest educational institution in the Kingdom of Saudi Arabia, and generally considered the premier institute for academics and research in Arab and Muslim countries.
شبكة أبو نواف - المتعة والفائدة
The site specializes in multimedia and entertainment, it also contains the biggest content for a mailing lists.
شبكة الإقلاع
Sites and forums comprehensive that occur daily with what is new in all fields.
سامبا: خدمات الافراد و المصرفية الالكترونية
Samba Financial Group (formerly known as The Saudi American Bank), is a large banking firm in Saudi Arabia.
صحيفة سبق الإلكترونية‎
Saudi newspaper was founded in 2007. Working in the field of electronic media, dealing with the most important local events in particular and the Arab and international in general.
ويكيبيديا، الموسوعة الحرة
A free encyclopedia built collaboratively using wiki software in Arabic language.
مكشات - الصفحة الرئيسة
A website interested in trips and camping.
تجمع طلبة جامعة الملك سعود
King Saud University students gathering, a group of human energies are working on a good environment for dialogue and purposeful upscale between students and faculty.
جامعة حائل
The University of Ha'il was officially established in 2006. The university is located in the north of Saudi Arabia.
موقع نمشي للأزياء, وجهتك الأولى لتسوق الأزياء في السعودية
Website is interested in fashion and online shopping.
منتديات المسافرون العرب الاصلي
Travelers Arab forum is interested in tourism and travel.
موقع فانيلا
Website is interested in fashion and online shopping.

Table 2: Alexa Ranking and Archiving Results of My Favorite Websites
Website Global Alexa Ranking
Oct, 2016
Local Alexa Ranking
Oct, 2016
Memento count First memento date Who
maktoob.yahoo.com 5 (US)=5 28,866 2009-08-31 IA
aljazeera.net 1,673 (SA)=174 20,468 1998-11-11 IA+BA
6arab.com 113,624 (Egypt)=6,120 12,991 1999-11-27 IA+BA+Archive Today
alarabiya.net 1,548 (SA)=37 9,737 2003-11-26 IA+BA
kooora.com 517 (Algeria)=15 4,658 2002-10-19 IA+BA
hawaaworld.com 10,026 (SA)=166 4,149 2001-01-10 IA+BA
mbc.net 1,195 (SA)=57 3,924 1999-10-13 IA+BA
alriyadh.com 5,136 (SA)=45 3,415 2000-02-29 IA+BA
ksu.edu.sa 6,093 (SA)=87 3,025 2000-03-02 IA
abunawaf.com 24,238 (SA)=413 2,446 2002-05-23 IA+BA
eqla3.com 9,741 (SA)=107 1,906 2000-05-10 IA+BA
samba.com 10,756 (SA)=129 1,451 1999-01-17 IA+BA
sabq.org 793 (SA)=5 1,170 2007-02-23 IA
ar.wikipedia.org 6 (US)=6 1,106 2003-02-09 IA+BA
mekshat.com 24,343 (SA)=404 828 2001-04-28 IA+BA
cksu.com 47,807 (SA)=708 643 2004-02-21 IA+BA
uoh.edu.sa 88,397 (SA)=806 210 2006-07-16 IA
ar-sa.namshi.com 10,968 (SA)=279 85 2012-04-05 IA
arabtravelersforum.com 31,259 (SA)=960 43 2014-11-29 IA
vanilla.sa 118,442 (SA)=1,053 13 2015-03-27 IA

Alexa calculates the global and local ranking of a website based on its traffic statistics. However, this tool is based on calculating the traffic of the domain. For example, if we check the ranking of the Arabic Wikipedia, ar.wikipedia.org, the tool will return the statistics of wikipedia.org instead. Based on this information we note that the top two global ranking in my list had an English domain. The two websites are maktoob.yahoo.com with a global ranking 5 and ar.wikipedia.org with a global ranking of 6. The third top global ranking website is kooora.com with a global rank of 15 and local rank of 15 in Algeria. Followed by sabq.org with a global ranking of 793 and a high local ranking in Saudi Arabia of 5.

In my list, I found that 4 out of the 20 websites were created before 2000. However, when looking in the archive I found that mbc.net domain was created in 1999 and it was in Korean language, then it became the Arabic website written in English in 2003, finally the Arabic version of the website was created in 2004.

mbc.net in 1999
mbc.net in 2003
mbc.net in 2004

Also, as expected I found that the Internet Archive was first to archive the webpage. However, in some websites I found that the Bibliotheca Alexandrina archive had a copy of the exact memento records, that is due to that the BA having duplicate record of the IA. Only, 6arab.com was first archived by three separate archives: archive.is, the IA, and the BA archive.

As for the memento count, I would expect that the websites that existed before 2000 had more mementos. However, two of the four webpages, samba.com and mbc.net, have only 1,451 and 3,924 mementos, respectively, which seems low considering how long they have existed. On other hand, aljazeera.net's first memento was in 1998 and has around 20,468 mementos, which is the second highest memento count in my list after maktoob.yahoo.com with memento count of 28,866.

The website 6arab.com is currently being blocked in Saudi Arabia (and access via the IA is blocked as well), due to violating the Saudi regulation of the Ministry of Culture and Information. So it does not have a local ranking in Saudi Arabia. Instead the top local ranking of this site is in Egypt.

-Lulwah M. Alkwai

2016-10-24: Fun with Fictional Web Sites and the Internet Archive

As we celebrate the 20th anniversary of the Internet Archive, I realize that using Memento and the Wayback Machine has become second nature when solving certain problems, not only in my research, but also in my life. Those who have read my Master's Thesis, Avoiding Spoilers on Mediawiki Fan Sites Using Memento, know that I am a fan of many fictional television shows and movies. URIs are discussed in these fictional worlds, and sometimes the people making the fiction actually register these URIs, seen in the example below, creating an additional vector for fans to find information on their favorite characters and worlds.
Real web site at http://www.piedpiper.com/ for the fictional company Pied Piper from HBO's TV series Silicon Valley
Unfortunately, interest in maintaining these URIs fades once the television show is cancelled or the movie is no longer showing. As noted in my thesis, the advent of services like Netflix and Hulu allow fans to watch old television shows for the first time, sometimes years after they have gone off of the air. Those first-time fans might want to visit a URI they encountered in one of these shows, but instead encounter the problems of link rot and content drift shown in the examples below.
Link rot for http://www.starkexpo2010.com/ showing the StarkExpo,
a fictitious technology fair from the Marvel Studios film Iron Man 2 (left -memento),
now leads to a dead link (right - current dead site)
Content drift for http://www.richardcastle.net/
from the fictional character's web site (left -memento)
now leads to an advertisement
for the cancelled ABC television show Castle (right - live site)
Fortunately, the Internet Archive can come to the rescue. Below is a chart listing some fictional URIs and the television shows in which they occur. The content at these URIs is no longer available live, but is still available thanks to the efforts of the Internet Archive. Included in the far right column are links to example URI-Ms from the Internet Archive for each of these URI-Rs, showing how fans can indeed go back and visit these URIs.
TV Show or
Network or
Production Company

Status Compared to URI-M
Link to URI-M from the Internet Archive
The Simpsons FOX http://www.dorks-gone-wild.com/ Link Rot

No HTTP server at hostname
True Blood HBO http://www.americanvampireleague.com/ Content Drift

301 Redirect to HBO.com
30 Rock NBC http://jdlutz.com/karen/proof/ Link Rot

500 HTTP Status
Iron Man 2 Marvel Studios http://www.starkexpo2010.com/ Link Rot

Hostname does not resolve
Castle ABC http://www.richardcastle.net/ Content Drift

301 Redirect to ABC.com Castle page
LOST ABC http://www.oceanic-air.com/ Link Rot

301 Redirect and 404 HTTP Status
Jurassic World Universal Studios http://www.jurassicworld.com/ Content Drift

Was Fictional Content, Now Advertises Movie and Games
The practice of publishing content at these fictional URIs shows no signs of abating. For example, the HBO TV Series Silicon Valley is a comedy about the lives of tech entrepeneurs working in Silicon Valley. The television show features several fictional companies that have real web sites that fans can visit, such as http://www.piedpiper.com, http://www.hooli.com/, and http://www.bachmanity.com. Because the show is about software developers, there is even a real Github account for one of the fictional characters, shown in the screenshot below. Using the "Save Page Now" feature, I just created a URI-M for it today in the Internet Archive.
This concept will become more important over time. As historians and sociologists study our past, some of these resources may be important to understanding these fictional worlds and how they fit into the time period in which they were developed. This makes improved archivability and reduction in Memento damage important even for these pages.
As to the meaning of the content, that's up to the fans to evaluate and discuss.
-- Shawn M. Jones

2016-10-23: Institutional Repositories, OAI-PMH, and Anonymous FTP

Richard Poynder's recent blog post "Q&A with CNI’s Clifford Lynch: Time to re-think the institutional repository?" has generated a lot of discussion, including a second post from Richard to address the comments and the always insightful commentary from David Rosenthal ("Why Did Institutional Repositories Fail?").  There surely have been enough articles about institutional repositories to fill an institutional repository, but of particular interest to me are discussions about the technical and aspirational goals of OAI-PMH.

A year ago Herbert and I reflected on OAI-PMH and other projects ("Reminiscing About 15 Years of Interoperability Efforts"), which I wish Richard would have referenced in his discussion (although Cliff does allude to this in his interview (MLN edit: Richard points out that I missed his quoting of that paper in his second blog post), as well as the original SFC and UPS papers.  For his response to Richard, Herbert had a series of tweets which I collected:

I also put forward my own perspective in a series of tweets, which I will summarize below.  To me, OAI-PMH is the logical conclusion of the trajectory of the computer science department tradition of publishing technical reports on anonymous FTP servers.  These were both pre- and post-print versions, and whereas arXiv.org was based on a centralized approach (due in part to its SMTP origins), the anonymous FTP approach was inherently distributed, and was a departmental-level institutional repository.  

Within the CS community, the CS-TR project (which produced Dienst) and WATERS project evolved into NCSTRL, which was arguably one of the first open source institutional repository systems.  An unrelated effort that is often overlooked was the Unified Computer Science Technical Report Index (UCSTRI), whose real innovation was that it provided a centralized interface to the distributed anonymous FTP servers without requiring them to do anything.  It would cleverly crawl and index known FTP servers, parse the README files, and construct URLs from the semi-structured metadata.  The parsing results weren't always perfect, but for 1993 it was highly magical and presaged the idea of building centralized services on top of existing, uncoordinated servers.

At NASA Langley Research Center in 1993, I brought the anonymous FTP culture to NASA technical reports (mostly their own report series, but some post-prints, see NASA TM-4567), followed by a web interface in 1993 (NASA TM-109162).  In 1994, we integrated several of these web interfaces into the NASA Technical Report Server (NTRS, AIAA-95-0964), which continues in name to this day (ntrs.nasa.gov) as an institutional repository that largely goes unrecognized as such (albeit covering a smaller range of subjects than a typical university). NTRS is a centralized operation today, but it was originally a distributed search model.  Due in part to the limited number of NASA Centers, projects, and affiliated institutes (there were probably never more than a dozen in NTRS) it was initially a distributed architecture.

By 1999 there was a proliferation of both subject-based and institutional repositories, which lead to the UPS experiment and ultimately OAI-PMH itself.  The proliferation of the web made it possible to greatly enhance the functionality of the anonymous FTP server (searching, better browsing, etc.).  But at the same time the web also killed the CS departmental technical report series and the servers that hosted them.  Although some may exist somewhere, off the top of my head I'm not aware of any CS departments with an active CS technical report series, at least not like the 80s and 90s.

The web made it possible for individuals to list their pre- and post-prints on their own page (e.g., my publication page, Herbert's publication page), and systems like CiteSeer, Google Scholar, and others -- much like UCSTRI before them -- evolved to discover these e-prints linked from individuals' home pages and centrally index them with no administrative or author effort.

In summary, I believe any discussion of institutional repositories (and OAI-PMH) has to acknowledge that while the web allowed for their evolution of repository systems to their current advanced state, the web also obsoleted many of the models and assumptions that drove the development of repository systems in the first place.  The web allowed for "fancy" anonymous FTP servers, but it also meant that we no longer needed them.  Or perhaps we need them differently and a lot less: institutional repositories still have a functional role, but they need to be operated more like Google Scholar et al.


Saturday, October 22, 2016

2016-10-13: Dodging The Memory Hole 2016 Trip Report (#dtmh2016)

Dodging the Memory Hole 2016, held at UCLA's Charles Young Research Library in Los Angeles California, was a two-day event to discuss and highlight potential solutions to the issue of preserving born-digital news. Organized by Edward McCain (digital curator of journalism at the Donald W. Reynolds Journalism Institute and University of Missouri Libraries) this event brought together technologists, archivists, librarians, journalists and fourteen graduate students who had won travel scholarships for attendance.  Among the attendees were four members of the WS-DL group (l-r): Mat KellyJohn BerlinDr. Michael Nelson, and Shawn Jones.

Day 1 (October 13, 2016)

Day one started off at 9am with Edward McCain welcoming everyone to the event and then turning it over to Ginny Steel, UCLA University Librarian, for opening remarks.
In the opening remarks, Ginny reflected on her career as a lifelong librarian, the evolution of printed news to digital and in closing she summarized the role archiving has to play in the digital-born news era.
After opening remarks, Edward McCain went over the goals and sponsors of the event before transitioning to the first speaker Hjalmar Gislason.

In the talk, Hjalmar touched on issues concerning the amount of data currently being generated, how to determine context about data and the importance of if and that data lost due to not knowing if it is important could mean losing someone's life work. Hjalmar ended his talk with two takeaway points: "There is more to news archiving than the web: there is mobile content" and "Television news is also content that is important to save".

After a short break, panel one which consisted of Chris Freeland, Matt Weber, Laura Wrubel, and moderator Ana Krahmer addressed the question of "Why Save Online News".

Matt Weber started off the discussion by talking about the interactions between web archives and news media. Stating that digital only media has no offline surrogate and how it is becoming increasingly difficult to do anything but look at it now as it exists. Following Mat Weber were Laura Wrubel and Chris Freeland who both talked about the large share Twitter has in online news.  Laura Wrubel brought up that in 2011 journalists primarily used Twitter to direct people to articles rather than for conversation. Chris Freeland stated that Twitter the primary source of information during the Ferguson protests in St. Louis and that the local news outlets were far behind in reporting the organic story as it happened.
Following panel one was Tim Groeling (professor and former chair of the UCLA Department of Communication Studies) giving presentation one entitled "NewsScape: Preserving TV News".

The NewsScape project is currently migrating analog recordings of TV news to digital for archival lead by Tim Groesling.  The collection contains recording dating back to 1950's and is the largest collection of TV news and public affairs programs containing a mix of U-matic, Betamax, and VHS tapes.

Currently, the project is working its way through the collections tapes completing 36k hours of encoding this year. Tim Groeling pointed out that VHS despite being the newest tapes are the most threatened.
After lunch, the attendees were broken up into fifteen groups for the first of two breakout sessions. Each group was tasked with formulating three things that could be included in a national agenda for news preservation and to come up with a project to advance the practice of online news preservation.

Each group sent up one person who briefly went over what they had come up with. Despite the diverse background of the attendees at dtmh2016 the ideas that each group came up with had a lot in common:
  • A list of tools/technologies for archiving (awesome memento)
  • Identifying broken links in new articles 
  • Increase awareness of how much or how little is archived
  • Work with news organization to increase their involvement in archiving 
  • More meetups, events, hackathons that bring together technologists
    with journalists and librarians  
The final speaker of the day was Clifford Lynch giving a talk entitled "Born-digital news preservation in perspective".
In his talk, Clifford Lynch spoke about problems that plague news preservation such as link rot and the need for multiple archives.

He also spoke on the need to preserve other kinds of media like data dumps and that archival record keeping goes hand in hand with journalism.
After his talk was over Edward McCain gave final remarks for day one and transitioned us to reception for the scholarship winners. The scholarship winners purposed projects (to be completed by December 2016) that would aid in digital news preservation and of these students three were WS-DL members (Shawn JonesMat KellyJohn Berlin).

Day 2 (October 14, 2016)

Day two of dodging the memory hole 2016 began with Sharon Farb welcoming us back.

Followed by the first presentation of the day by our very own Dr. Nelson titled "Summarizing archival collections using storytelling techniques"

The presentation highlighted the work done by Yasmin AlNoamany in her doctoral dissertation, in particular, The Dark and Stormy Archives (DSA) Framework.
Up next was Pulitzer prize winning journalist Peter Arnett who presented "Writing The First Draft of History - and Saving It!" talking about his experiences while covering the Vietnam War and how he saved the Associated Presses Saigon office archives.
Following Perter Arnett was the second to last panel of dtmh2016 Kiss your app goodbye: the fragility of data journalism featuring Ben Welsh, Regina Roberts, Meredith Broussard and moderated by Martin Klein.

Meredith Broussard spoke about how archiving of news apps has become difficult as their content does not live in a single place.
Ben Welsh was up next speaking about the work he has done at the LA Times Data Desk.
In his talk, he stressed the need for more tools to be made that allowed people like himself to make archiving and viewing of archived news content easier.
Following Ben Welsh was Regina Roberts who spoke about the work done at Standford for archiving and adding context to the data sets that live beside the codebases of research projects.
The last panel of dtmh2016 "The future of the past: modernizing The New York Times archive" featured members of the technology team at the New York Times Evan Sandhaus, Jane Cotler, and Sophia Van Valkenburg with moderator Edward McCain.

Evan Sandhause presented the New York Times own take on the wayback machine called TimesMachine. The TimesMachine allows users to view the microfilm archive of The New York Times.
Sophia Van Valkenburg spoke about how the New York Times was transitioning its news archives into a more modern system.
After Sophia Valkenburg, was Jan Cotler who spoke about the gotchas encountered during the migration process. Most notable of the gotchas was that the way in which the articles were viewed (i.e, visual aesthetics) was not preserved in the migration process in favor of a "better user experience" and that in migrating to the new system links to the old pages would no longer work.
Lightning rounds were up next.

Mark Grahm of the Internet Archive was up first with a presentation on the wayback machine and how later this year it would be getting site search.
Jefferson Bailey also of the Internet Archive spoke on the continual efforts at the Internet Archive to get the web archives into the hands of researchers.
Terry Britt spoke about how social media over time establishes "collective memory".
Katherine Boss presented "Challenges facing the preservation of born-digital news applications" and how they end up in dependency hell.
Eva Revear presented a tool to discover frameworks and software used for news apps
Cynthia Joyce talked about a book on Hurricane Katrina and its use of archived news coverage of the storm.
Jennifer Younger presented the work being done by the Catholic News Archive.
Kalev Leetaru talked about the work he and the gdeltproject  are doing in web archival.
The last presentation of the event was by Kate Zwaard titled "Technology and community Why we need partners, collaborators, and friends".

Kate Zwaard talked about the success of web archival events such as the recent Collections as Data and Archives Unleashed 2.0 held at the Library of Congress.
The web archive collection at the Library of Congress.
How they are putting Jupyter notebooks on top of database dumps.
And the diverse skill sets required for librarians of today.
The final breakout sessions of dtmh2016 consisted of four topic discussions.

Jefferson Bailey's session, Web Archiving For News, was an informal breakout where he asked the attendants about collaboration between the Archive and other organizations. A notable response was from the NYTimes representative Evan Sandhaus with a counter question about whether organizations or archives should be responsible for the preservation of news content. Jefferson Bailey responded that he wished organizations were more active in practicing self-archiving. Others responded with their organizations or ones they knew about approaches to self-archiving.

Ben Welsh's session, News Apps, discussed issues archiving news apps which are online web applications providing rich data experiences. An example app to illustrate this was California's War Dead which was archived by the Internet Archive but with diminished functionality. In spite of this "success", Ben Welsh brought up the difficulty in preserving the full experience of the app as web crawlers only interact with client side code, not server side which is required. To address this issue, he suggested solutions such as the python library django-backery for producing flat, static versions of news apps based on database queries. These static versions can be more easily archived while still providing a fuller experience when replayed.
Eric Weig's session, Working with CMS, started out with him sharing his experience of migrating one the Univeristy of Kentucky Libraries Special Collections Research Center newspaper sites cms from a local data center using sixteen cpus to a less powerful cloud-based solution using only two cpus. One of the biggest performance increases came when he switched from dynamically generating pages to serving static html pages. Generating the static html pages for the eighty-two thousand issues contained in this cms took only three hours on the two cpu cloud-based solution. After sharing this experience the rest of the time was used to hear from the audience about their experiences using cms and an impromptu roundtable discussion on cms.

Kalev Leetaru's session, The GDELT Project: A Look Inside The World's Largest Initiative To Understand And Archive The World's News, was a more in depth version of the lightning talk he gave. Kalev Leetaru shared experiences that The GDELT Project had with archival crawling of non-English language news sites, his work with the Internet Archive on monitoring news feeds and broadcasts, the untapped opportunities for exploration of Internet Archive and A Vision Of The Role and Future Of Web Archives. He also shared two questions he is currently pondering: "Why are archives checking certain news organizations more than others?" and "How do we preserve GeoIP generated content especially in non-western news sites?".
The last speaker of dtmh2016 was Katherine Skinner with Alignment and Reciprocity. In her speech Katherine Skinner called for volunteers to carry out some of the actions mentioned at dtmh2016 and reflected on the past two days.
Closing out dtmh2016 was Edward McCain who thanked everyone for coming and expressed how enjoyable this event was especially with the graduate students and Todd Grappone's closing remarks. In the closing remarks, Todd Grappone reminded attendees of the pressing problems in news archival and how they require both academic and software solutions.
Video recordings of DTMH2016 can be found on the Reynolds Journalism Institute's Facebook pageChris Aldrich recorded audio along with a transcription of days one and two. NPR's Research, Archive & Data Strategy team created a storify page of tweets covering topics they found interesting.

-- John Berlin