The Future of Digital Archives

Digital Libraries

Ever since the announcement that portions of the Library's new Library of Congress Experience initiative would have some user-interface components implemented using Microsoft's Silverlight in exchange by Microsoft for some $3 million worth of hardware for kiosks, the net has been ablaze with fury over some perceived "selling-out" of the organization. For some reason, people think that www.loc.gov will shortly be 301ing to loc.microsoft.com, or something ridiculous.

(Disclaimer: These are my opinions, based on my experience and work as a digital librarian. They are not necessarily those of the Library of Congress, or anybody else. Also, I am deliberately avoiding the topic of content born-digital in a proprietary format. It's an especially painful area for digital archivists, as the content is still important notwithstanding the problems reading it both now and in the future, and it really doesn't bear on the current complaints, anyway.)

It's a tricky problem, to be sure. The cost of digital conversion, preservation, and access is extremely high. The cost outlay by the National Endowment for the Humanities for digitizing the 1/2 million (and growing!) newspaper pages in Chronicling America runs into the millions of dollars. And that doesn't even account for the cost of keeping that data around for any length of time! Funding has to come from somewhere, and so, as Ars says, there will likely be more and more of these quid pro quo arrangements in the future.

But I'll let you in on a little secret: It doesn't matter. The future of digital libraries, and of organizations like the Library of Congress, isn't in the web site that people visit. It isn't in the technology choices made for viewing content. And it certainly isn't going to be controlled by any particular company.

See, though the user interface may be written in something proprietary, like Flash or Silverlight, the archived bits aren't stored that way. Librarians are an insanely conservative bunch, made that way through hundreds of years of experience in attempting to keep old stuff and make it available for new people. The guidelines for digital preservation reflect that. We're still working with TIFFs for a good reason: TIFF has been around for decades, and it'll continue to be around for decades. We know we'll be able to read them in a hundred years. In the library world, lack of shininess on a file format isn't a bug - it's a feature!

But the real reason it doesn't matter is because the digital library as a web site you visit to view content won't exist in a decade. Instead, we'll be serving out content via web-exposed APIs, opening the inner archives to anyone who cares to look. Sure, there will still be curated special presentations of content just like the new Experience program, and they might even use some new, fancy, proprietary technology, maybe even in exchange for $200 million of computer equipment (adjusting for inflation, here). But those special presentations will just be the tip of the content iceberg, and the rest will be there to look at in very standard and open formats.

0 Comments

Chronicling America - Now With More Page!

Digital Libraries

I spend a significant fraction of my time working at the Library on Chronicling America. We went live with the latest release a little while ago, finally breaching the half-million-page mark. We were just shy of it last time, but as of now we have 568,261 pages for your perusal.

0 Comments

Waterboarding in the Press - 106 Years Ago

Politics Digital Libraries

The current debate over the use of waterboarding seems new and fresh and relevant. But if working on a digital archive of historical newspapers has taught me anything, it's that we've pretty much already done everything.

I'd never thought to search for waterboarding in Chronicling America, but the post on boingboing this morning led me to do a search for water cure, with some fascinating results.

1902 WaterboardingIt turns out that practice was just as fraught with controversy then as it was now. It prompted a Congressional investigation, and the court-martial of a general (although he was acquitted).

It was even used as a discipline tool in a Kansas mental hospital! The article states that the torture was used on patients who failed to follow the orders of the institution's head, Miss Houston, who administered the punishment herself. The practice ceased when Miss Houston was replaced as head of the hospital by the ostensibly more gentle Miss Gower. She merely had the problem patients "strapped to a bench and whipped."

0 Comments

Chronicling America Update

Digital Libraries

As usual, I've been crazy busy. We released a new version of my project at the Library of Congress on June 14th. The cut-over was nice and smooth, with only a few minutes of downtime while we switched over from the live application to the staged application. Our development team did a bang-up job on this release, with a bunch of new features, enhancements, clean-up, and about 80,000 more newspaper pages.

So, what's new in Chronicling America version 1.1? For starters, we've put in place permanently-resolvable URLs, based on data derived from the paper's and page's metadata. Thus, we're putting an explicit guarantee out there that the URL http://www.loc.gov/chroniclingamerica/lccn/sn84026749/1904-01-30/ed-1/seq-1 will always get you back to page 1 of the January 30, 1904 edition of the Washington Times. (For completeness' sake, I should note that it actually gets you to image 1, not page 1. They aren't always the same thing!) Similarly, stripping off various parts of the URL will lead to the appropriate places. For example, http://www.loc.gov/chroniclingamerica/lccn/sn84026749 will get you to the directory record for the 1902-1939 Washington Times.

If you clicked on that page link, you'll have no doubt noticed our new and shiny dynamic page viewer, inspired by similar UIs like Google Maps. It has all of the functionality of our old, clunky Flash viewer, while simultaneously being faster, prettier, and easier to use.

On the back-end, we made some pretty drastic improvements to the speed of our image rendering. Not only do we simply render images faster, but we also make more careful choices about how we ask for the images to be rendered, based on the information specific to each image's JPEG 2000 service copy. Hopefully you'll find the system quite a bit more responsive.

And, of course, there are roughly 80,000 new pages in the system. That brings our grand total to up over 300,000 - and counting! Expect even more in the future.

0 Comments

Chronicling America

Digital Libraries

Wanted Men-Haters to Act as TeachersEver wonder what I do for a living? Check it out: Chronicling America is now live!

That site is the direct public access portion of the National Digital Newspaper Program. The program details are all there on the Library's NDNP page. Feel free to check it out if it suits you; I won't go into it much more here.

But I will talk about the technical aspects of the program. It has consumed me for the last few years now, so I have a lot to say! Those posts will come out in the future, under the Digital Libraries category here on my site. If you're interested only in that, you can subscribe to the Digital Libraries RSS feed.

For now, though, let's take a tour of some of my favorites among the quarter-million-odd pages now online:

  • The Washington Times: March 31, 1905 - I found this page while doing a test search for cheese, and the headline in the middle of the page immediately struck me: "Wanted Men-Haters to Act as Teachers". Oh how far we've come in a mere 102 years! The differences in culture between then and now astound me, and the Chronicling America repository is replete with examples. Words, articles, and pictures that today would be considered racist, misogynist, and classist abound. The title of The Colored American might be seem out of place today, but in 1905 it was just a descriptive newspaper title.
  • The San Francisco Call: June 10, 1900 - Bathing suits were indistinguishable from cocktail dresses! How did anybody swim in these things?
  • The San Francisco Call: September 7, 1901 - The assassination of President McKinley is big news. What a front page!
  • The Washington Herald: July 11, 1909 - It seems somebody who lived in the same building I do - almost 100 years ago! - lost a hat pin, and would like it returned. I'll keep an eye out for it. I wonder how much the reward was?
0 Comments

Adventures in Enormous Lucene Indexes on AIX

Java Digital Libraries

I have been working hard over the last several weeks to port our system at work from our x86 Linux development environment to the PowerPC AIX production environment. Fortunately for us, most of the platform differences are well hidden because our code is generally platform independent: Java, XSLT, and JavaScript. There are a few cases where we make calls to a JNI library, but the libraries exist and are supported for the varying platforms, and we have had no trouble with those.

What we unexpectedly had trouble with, though, was our fulltext Lucene index. Weighing in at a massive 55 GB, and only expected to get bigger, we were duly impressed at our development environment's ability to process the index with no hiccups, in addition to consistently speedy search times. When I moved it to AIX, however, something went amiss. We started receiving this exception, which the stack trace revealed was coming from Lucene's index reading code:

java.io.IOException: Unknown format version:-16056063

We confirmed with MD5 hashes that the files were identical in both environments, and we confirmed that the Lucene libraries were all correct. That left us with some obscure platform difference we had to track down.

Using a smaller test index, we were able to confirm that Lucene was able to successfully open an index on AIX, confirming Lucene's own touted endian agnosticism. We also lifted file write size ulimits on certain users to confirm that that limit didn't unintentionally affect the ability to read files as well.

Finally, we discovered through some documentation (of all places!) that 32-bit IBM programs are limited to file reads of no more than ~2 GB - that magic 2^31 - 1 limit - and our Java virtual machine was only 32-bit! Simply upgrading to the 64-bit JVM solved the problem.

We hadn't thought of this because we were using a 32-bit JVM in development, with no problems, but the crucial difference is that it was the Sun JVM. We later installed the 32-bit IBM JVM onto a development environment and confirmed that it cannot open our index file there, either. Notably, however, it provided a much more useful error message:

java.io.IOException: Value too large for defined data type at java.io.RandomAccessFile.length(Native Method) at org.apache.lucene.store.FSIndexInput.(FSDirectory.java:440)

Rather than throwing an IOException from the java.io code, the IBM JVM on AIX simply returned bogus data. This caused Lucene's index reader to throw an exception because, coincidentally, the number it was trying to read at that magic signed integer limit was expected to be a file version number. It was expecting to see -1, but instead got -16056063.

And so everything seems to running swimmingly now. The moral of the story is: Beware of big files on 32-bit machines.

0 Comments

IFLA Newspaper Conference: Gerardo Valencia - National Library of Mexico

Digital Libraries

This is a country report by Gerardo Valencia from the National Library of Mexico.

In the last 30-40 years, they have been microfilming papers - spotty at first, but more consistent in the last 20 years. One of the goals is to minimize access to the original material in order to preserve it physically, while at the same time making the content available to as many people as possible. They are scanning in 300 DPI in 1-bit B&W TIFFs, converted to PDF on the fly, with full-text search of the uncorrect OCR, with hit highlighting.

They provide simple search with phrasing and boolean operators. Proximity and fuzzy features are still in development. They also have browsing options by title, publication location, or catalogue reference. Other options like timelines are still in development.

He's doing a live demo of their Browse by State option now. (It's password protected. Boo.) It starts with a map of Mexico broken down by state, and from their it lists the cities, below that the cities, and below that the dates. They have added watermarks depending on user profiles, so an ordinary user will get a watermark on the printout, but researchers inside might not. They don't display much of the metadata, but they preserve much of it.

They have 814 titles in their digitized list. They have some modern collections, as well, with 7 million searchable pages of two modern papers. (Sorry, I missed the names.)

Now he's demoing thier full-text search. The collection is page-accessed. The search results go to the image. It provides a magnifying glass showing a way-zoomed-in view with the highlight. (The highlighting is pretty off.) You can then browse the search results one at a time. When you click the thumbnail, they download and display the PDF. The advanced search allows filtering by location, date, and title.

Overall, they've got a pretty damn good system going there!

0 Comments