Life and code.
RSS icon Home icon
  • Adventures in Enormous Lucene Indexes on AIX

    Posted on September 30th, 2006 Brian No comments

    I have been working hard over the last several weeks to port our system at work from our x86 Linux development environment to the PowerPC AIX production environment. Fortunately for us, most of the platform differences are well hidden because our code is generally platform independent: Java, XSLT, and JavaScript. There are a few cases where we make calls to a JNI library, but the libraries exist and are supported for the varying platforms, and we have had no trouble with those.

    What we unexpectedly had trouble with, though, was our fulltext Lucene index. Weighing in at a massive 55 GB, and only expected to get bigger, we were duly impressed at our development environment’s ability to process the index with no hiccups, in addition to consistently speedy search times. When I moved it to AIX, however, something went amiss. We started receiving this exception, which the stack trace revealed was coming from Lucene’s index reading code:

    java.io.IOException: Unknown format version:-16056063

    We confirmed with MD5 hashes that the files were identical in both environments, and we confirmed that the Lucene libraries were all correct. That left us with some obscure platform difference we had to track down.

    Using a smaller test index, we were able to confirm that Lucene was able to successfully open an index on AIX, confirming Lucene’s own touted endian agnosticism. We also lifted file write size ulimits on certain users to confirm that that limit didn’t unintentionally affect the ability to read files as well.

    Finally, we discovered through some documentation (of all places!) that 32-bit IBM programs are limited to file reads of no more than ~2 GB – that magic 2^31 – 1 limit – and our Java virtual machine was only 32-bit! Simply upgrading to the 64-bit JVM solved the problem.

    We hadn’t thought of this because we were using a 32-bit JVM in development, with no problems, but the crucial difference is that it was the Sun JVM. We later installed the 32-bit IBM JVM onto a development environment and confirmed that it cannot open our index file there, either. Notably, however, it provided a much more useful error message:

    java.io.IOException: Value too large for defined data type
    at java.io.RandomAccessFile.length(Native Method)
    at org.apache.lucene.store.FSIndexInput.(FSDirectory.java:440)

    Rather than throwing an IOException from the java.io code, the IBM JVM on AIX simply returned bogus data. This caused Lucene’s index reader to throw an exception because, coincidentally, the number it was trying to read at that magic signed integer limit was expected to be a file version number. It was expecting to see -1, but instead got -16056063.

    And so everything seems to running swimmingly now. The moral of the story is: Beware of big files on 32-bit machines.

  • IFLA Newspaper Conference: Gerardo Valencia – National Library of Mexico

    Posted on May 25th, 2006 Brian No comments

    This is a country report by Gerardo Valencia from the National Library of Mexico.

    In the last 30-40 years, they have been microfilming papers – spotty at first, but more consistent in the last 20 years. One of the goals is to minimize access to the original material in order to preserve it physically, while at the same time making the content available to as many people as possible. They are scanning in 300 DPI in 1-bit B&W TIFFs, converted to PDF on the fly, with full-text search of the uncorrect OCR, with hit highlighting.

    They provide simple search with phrasing and boolean operators. Proximity and fuzzy features are still in development. They also have browsing options by title, publication location, or catalogue reference. Other options like timelines are still in development.

    He’s doing a live demo of their Browse by State option now. (It’s password protected. Boo.) It starts with a map of Mexico broken down by state, and from their it lists the cities, below that the cities, and below that the dates. They have added watermarks depending on user profiles, so an ordinary user will get a watermark on the printout, but researchers inside might not. They don’t display much of the metadata, but they preserve much of it.

    They have 814 titles in their digitized list. They have some modern collections, as well, with 7 million searchable pages of two modern papers. (Sorry, I missed the names.)

    Now he’s demoing thier full-text search. The collection is page-accessed. The search results go to the image. It provides a magnifying glass showing a way-zoomed-in view with the highlight. (The highlighting is pretty off.) You can then browse the search results one at a time. When you click the thumbnail, they download and display the PDF. The advanced search allows filtering by location, date, and title.

    Overall, they’ve got a pretty damn good system going there!

  • IFLA Newspaper Conference: Kenning Arlitsch and John Herbert – University of Utah

    Posted on May 18th, 2006 Brian No comments

    This panel is on Statewide Digitization Initiatives, with Kenning Arlitsch and John Herbert.

    First up is Kenning: The Mountain West Digital Library provides political and technical support for statewide cooperative digitization in Utah and Nevada. They have 400,000 objects excluding newspapers.

    They have four different CONTENTdm servers among four different partners, and then an aggregating CONTENTdm server at the University of Utah harvests from the partners and then provides searching and access. This allows local control and identity for partners and a common metadata standard.

    The University of Utah has been microfilming since 1951. In 2001, they were awarded a grant to attempt digitization of 30,000 newspaper pages. They scanned in as 1-bit B&W TIFFs, converted them into MrSID, and loaded them into CONTENTdm. They also added index information. You could search by directory information (no full-text), where you could view the images of the pages. As they realized they could not do all that work themselves, they began to work with iArchives to do a large portion of the work. Some of the lessons learned: Bad film, a company that was sold to El Paso where the master microfilm went with it, and more funding. Also, it was really popular. Their second grant was to improve the process and do 100,000 more pages.

    John is now speaking. The main site is http://digitalnewspapers.org. It consists of Utah newspapers from 1850 – 1961. It is full-text searchable, with 48 titles and 450,000 pages and 5 million articles. Content is distributed across the state.

    Their archival format is 4-bit TIFFs (~28 MiB), so that works out to about 14 TiB for 500,000 pages. They attempt to scan from paper whenever possible. Scanning from paper gives better quality images, but the size of the papers necessitates local scanners, which can be hard to find. Film is cheaper and easier to ship. Articles are all segmented and are viewed stand-alone from the page with no zoom and pan. They find that the OCR is more accurate due to the smaller corpus size. Also, iArchives keys in the headlines manually, leading to near perfect data there.

    For OCR options, their testing shows that single word searching is optimized for 2 alternatives. However, adding additional words often causes bizzare search result scenarios. Things like “butch/dutch cassidy” are difficult to overcome, but he’s hopeful that new proximity searches will help with this. Also, they filter through dictionaries and lists of surnames, and then throw out all numbers greater than 100 except for four-digit numbers beginning with 18, 19, or 20.

    Imaging trade-offs are the classing size vs. quality problem.

    Cost per unit page is low, usually less than $2 per page, but the number of pages is enormous, making it very expensive to digitize. They’ve been digitizing since 2002, and have “only” 500,000 pages.

    Demo time. He amusingly does a search for digital newspapers in order razz the group that Utah is first, and NDNP is second.

    Time for lunch!

  • IFLA Newspaper Conference: Maritza Failla – National Library of Chile

    Posted on May 18th, 2006 Brian No comments

    This is a country report by Maritza Failla from the National Library of Chile. Chilean law requires a deposit of 15 copies of each paper or book, along with two copies in a non-paper version, like on a CD. (Sweet! These kind of laws are so awesome at preserving the knowledge for the public good.)

    They want to reduce the amount of handling of the physical copies in order to reduce damage for long-term preservation. With that said, they want to make it available as freely as possible. To that end, they focus on very deteriorated papers, as well as those with high cultural value.

    Their strategy for preservation is to maintain a reserve collection, microfilm them, and then digitize them. Filming has been the primary preservation scheme. They currently have 8 million pages filmed. Some challenges are the lack of expertise among Chilean companies and library officials in filming historical periodicals. Creating copies is more difficult with film, and it requires specialized equipment to read, taking up valuable space in reading rooms.

    Their digitization priorities require only conversion of Chilean national newspapers. They focus on newspapers of the year (What does that mean?) and highly requested papers. Digitization problems include high cost, lack of expertise, highly complex bibliographic material, large-sized papers require non-standard standards, rapid obsolescence, as well as the ever-present copyright problems.

    The Memoria Chilena site has about 100,000 newspaper images scanned and available. Papers are chosen based on their importance to Chile, as well as papers covering important Chile and international news stories. There are currently six different electronic newspapers in Chile, although only one is soley electronic. According to the law, they are required to deposit CDs. However, even the storage and access of CDs has its issues. The problems seem to focus around the constant change in technology: training and upgrades, for example. She mentions the lack of library-oriented databases and meta-databases.

  • IFLA Newspaper Conference: David Adams – National Library of New Zealand

    Posted on May 18th, 2006 Brian No comments

    This is a country report from David Adams, from the National Library of New Zealand. He shows Google Maps, and says, “There will be Google Newspapers. Do we even need to do this work at all? Can we just wait for Google to come around and do this for us?”

    New Zealand has 1 million newspaper pages online. They have been running since 2001 in a sustainable program. There has been no funding for filming un-filmed papers, or re-filming poorly filmed content. This is a mistake, and the approach of the British Library – where the cost of repair and such is factored in – should always be taken. He also notes that the cost of filming is rising, and so the number of pages is decreasing.

    The Papers Past project has 41 titles, from 1840 to 1915, across all of New Zealand. They have no aggreements with current publishers to digitize their current work, however. Currently, there is only browsing by date, title, etc. and only in large TIFF formats. This makes access difficult. They are working to fix these problems.

    Their survey of the current environment for the digitization of newspapers found that there are more and more vendors providing capability for this, and there are many emerging standards like METS and ALTO. (I missed some here.) Their report on users showed that page access vs. article access is a big issue that always comes up. Their user base self-describes themselves as almost 50% “Family Historian.”

    They have put out a RFP in February for work. They want to have 100,000 pages available online in a demo site by July 2006. The work comes in three parts:

    1. Microform and Digitization
    2. OCR
    3. Online Delivery via a Customized Application

    After the new application exists, they will move their existing Papers Past collection into the new system in order to maintain one point of access. They will slowly OCR the old papers according to priority and budget over the next several years, and then perhaps add new content for other institutions. Finally, they want to move beyond just newspapers.

    In the end, he hopes they’ll have more newspapers than sheep. 🙂

  • IFLA Newspaper Conference: Sandra Burrows – Library and Archives Canada

    Posted on May 18th, 2006 Brian No comments

    This is a country report from Sandra Burrows from the Library and Archives Canada. The Library and Archives site has a list on their site of mostly free online Canadian-related newspaper content that can be full-text searched. They also have a similar international list, sorted by country.

    So far in Canada, beyond the Paper of Record project, there is very little digitization due to copyright restrictions and issues. Some projects that do exist include Canadian Newstand via ProQuest, Paper of Record, and the Virtual News Library.

    They are hoping to begin two projects in the near future. The first is to scan multicultural papers across Canada, published generally prior to 1915 (to avoid copyright problems). They hope that initial success will lead to other partners who are interested in providing permission for their own papers to be digitized. One problem is that Canada does not microfilm papers themselves. Instead, they rely on third-party microfilming and archiving, purchasing microfilm as necessary or desired. Thus, it may not be easy to obtain a master film for scanning, as well as a disparate quality of the existing films.

    A second, smaller, more feasible project is the Engine of Immortality. They are scanning first pages of papers from around Canada for a museum near Niagra Falls. This is extending to newspapers in general for Canadian newspapers from 1752 to present. (I’m not sure what the difference in subset is between this and the previous project.)

    There is Canadian legislation to require deposit of online and born-digital newspapers. (Sweet!) They are currently examining how to archive and index this stuff as a prelude to the legislation.

    Archiving must continue in the microform and physical form. Electronic archiving cannot be a substitution for the physical, and it will not be cheaper.

  • IFLA Newspaper Conference: Utah Olympic Park

    Posted on May 18th, 2006 Brian No comments

    We took a trip up to the Utah Olympic Park in Park City for a dinner and show last night. The drive up the mountain was quite gorgeous, although I spent most of the bus ride chatting with Alison from New Zealand.

    Dinner was sponsored by iArchives, with the highlight being a beautiful rainbow over the mountains in the distance. After dinner, some past and future members of the U.S. Olympic Freestyle Skiing team and the U.S. Olympic Trampoline team did some crazy flips and tricks on trampolines. It was really cool, but unfortunately I hadn’t brought my camera.

    I did buy a shot glass, though.

  • IFLA Newspaper Conference: James Simon – Center for Research Libraries

    Posted on May 17th, 2006 Brian No comments

    The title of the presentation is “Cooperative Digitization and Dissemination of World Newspapers: A Proposal.” James is from the Center for Research Libraries (CRL), which is an association of institutions dedicated to preservation of content for use by academics. CRL has a fair number of historic newspapers, and about 25 million pages of newspaper content are distributed through library loan every year, constituting about 75% of the distribution. The limiting nature of physical library loan does not apply to digital formats, so it’s exciting.

    There is a huge corpus of historical newspaper data, spanning four centuries. Microform is most commonly used for long term preservation, but are crappy at providing discovery and use. Unfortunately, less-developed regions are unlikely to see their content digitized any time soon.

    The International Coalition on Newspapers (ICON) products, funded by NEH funding, includes a database of international newspapers, coordination of cooperative preservation microfilming efforts, and coordination of centralized and distributed cataloging.

    The World News Archive (“as Icarus flying towards the sun, we have boldy called it…”) is attempting to digitize and make available newspapers from around the world, to the extent of the funding and energy available. The scope is currently “deliberately vague” because they are still in planning. Their long-long-term vision is for a fully-comprehensive newspaper repository. The CRL will accept world-wide standards, whenver those become determined. (I don’t think that top-down standards are the right way to do this. Use individual standards with semantic mapping as appropriate.)

    The balance between not-for-profit, scholarly interests and for-profit, commercial interests is very important to a successful repository. CRL would take care of clearing copy rights for digital use. Ownership must obtain an escrow copy, with future rights to “any and all use” of the content once some term expires. Transparency in the rights process is key to ensuring that the system works – they don’t want any crazy digital rights restrictions schemes. Finally, right to pricing – who prices and to whom for how much – is important as well.

  • IFLA Newspaper Conference: Hartmut Walravens – Berlin State Library

    Posted on May 17th, 2006 Brian No comments

    This is a report on the IFLA Newspaper Section and other worldwide newspaper programs, presented by Hartmut Walravens from the Berlin State Library. The IFLA Newspaper Section meets in 2006 in Seoul, and they will be discussing the archiving of the east Asian press.

    Low-quality microfilming degrades quickly, and often requires re-filming of the source material after maybe twenty years. Furthermore, bad film really causes problems when it comes to digitization. IFLA has a publication talking about microfilming standards. (It really seems like poor filming choices made a generation ago are coming back to haunt our present digital efforts.)

    There are several world-wide digitization projects going on around the world that aren’t represented at this conference:

    • The Deutsche Bibliothek project has digitized exiled serials, so-called because they were published by German refugees escaping from the Nazi regime in the mid-1900s. An interesting problem is that these papers aren’t really newspapers in the proper sense. Due to the lack of funds, their papers were often very short – in the example only one page.
    • The Compact Memory project is focused on Jewish serials, and is similar technically to the previous project.
    • The Austrian Newspapers Online (ANNO) project covers approximately 2 million pages currently. They are suffering from lack of funding presently. No full-text search, just date searching, and only the images are available.
    • The Luxemburgensia Online project had five papers digitized last year. This year, they now have ten papers on the net. Like ANNO, there is no full-text search, and only the images are available.
    • The National Library of Greece has about five titles digitized, including one from Egypt, with about 220,000 pages total. Again there is no full-text searching.
    • The National Library of Latvia has approximately 200,000 pages available online.
    • The University of Hawaii is working to digitize Hawaiian Language newspapers. There are at least fifteen digitized, but again there is no full-text searching. In this case, the image quality is so bad as to preclude perhaps ever doing OCR.
    • Japan is not doing much digitization at all, preferring to work on preservation microfilming for the time being.
    • Russia has a lot of newspapers, but there is no funding for performing any mass digitization. There is some equipment out there, but they only do so on-demand for a specific paid-for subset.

    Finally, the technological gap between the rest of the world and the U.S. is highlighted by the fact that a Belgium paper has done a pilot of some e-paper in their publication. How long until the U.S. gets this?

  • IFLA Newspaper Conference: Morgan Cundiff

    Posted on May 17th, 2006 Brian No comments

    Morgan Cundiff, from the Library of Congress is the final presenter for the Technical Panel. I’m friends with Morgan – we work at the Library together. He works in the NetDev office, working on many of the standards the LOC produces.

    This is an XML game now. Echoing some comments from George earlier, standards are definitely the most important thing in creating interoperable digital libraries, and XML is that de-facto standard on the web. The lack of standards has been the biggest barrier to creating joint libraries.

    (Morgan is doing a technical overview of METS and MODS right now, which I’m skipping over. My fingers need a rest. Plus I already know it.)

    A METS Profile describes a class of METS documents to provide programmers and authors the guidance to create and process METS documents of a particular profile. A sufficiently detailed METS Profile can be considered a standard, and there is a schema for creating profiles. However, it is human-readable, and not machine actionable. (The NetDev office really needs to get some system in place for doing that, like the earlier schematron-based profile descriptor and validator I wrote.)

    Morgan has created a draft for the digital newspapers, found here.

    (He’s going through the parts of the profile document now, which can just be viewed from the link above. As much as I like the newspaper profile and the related work, I think spending too much time on a particular standard like this is not too useful. People will always have different requirements, and the standard will not always work for all needs. A more valuable approach is to define semantic correlations between different models and then using semantic tools to translate between them as appropriate. Far more along the lines of what Alison Stevenson & Elizabeth Styron are doing.)

    (One really clever thing done with the mets:fptr is the use of the mets:area to point a region to a portion of the image held in the mets:fileSec. Cool!)

    The newspaper community has a major “quality vs. quantity” problem. People talk in terms of millions of pages, and so that requires machine processing. That means lower quality, which is just a law of nature.