IFLA Newspaper Conference: Perry Willett

May 17, 2006

Perry Willett from the University of Michigan is the first of the technical panel presenters. He is discussing the university’s newspaper experiences. He begins with an overview of their DLXS system, which models books/journals, finding aids, and images, using standard XML/XSLT/CSS processing. They use their XPAT search engine.

Their book model, which was the basis of their newspaper model, is:

One image per page, 100-200 KiB
Uncorrected OCR, about 1-3 KiB
Usually no more than 1-2 columns per page

The usual usage scenario is “search the OCR text, and display a page image.” they also offer browsing by author and title, along with simple sequence-based navigation. Searches can also be limited by year.

Their first project was digitizing British Library newspapers. (This was one of the projects Ed mentioned earlier.) Their next project was to digitize their student newspaper, the Michigan Daily. Publication began in 1890, and has very spotty preserved continuity and often with narrow margins. Starting in the 1950s, they had some locally done microfilm, but it is generally very low quality, despite their own in-house microfilming processes.

Newspapers are actually much different than books, though. The problems seem much more related to page size than anything else, although the multiplicity of columns is a notable exception.

No equipment to digitize such large sized pages
Pages are much larger, from 3-6 MiB
Many, irregular columns
Poor paper, with discolorations, tears, etc.
Poor print quality, leading to really crappy OCR, about 1-3 KiB

One attempt to decrease the file size is to crop the articles out of the pages. Problems with that include jumping to the rest of the article when it is continued on another page. Another problem occurs when attempting to do full-text hit hightlighting. If your word coordinate origin is at the page, then you have to re-work your coordinates for the newly cropped article. It’s not a hard problem to solve, but the lack of standardization proves annoying.

Other problems include: Often no authors for the articles, titles are not ususally important, date browsing requires more specificity, and legal problems associated with web publication of wire service stories. He also is discussing some of the problems in the general community. This includes:

Lack of standardization on column/page/article support
Lack of standardization between vendors and libraries and publishers on coordinates and relationships between parts
Copyright issues

It seems to me that their attempt to shoehorn newspapers into the book model is really their downfall. Their system was not designed to be a general-purpose repository, and so it is painful to work in new content models.

Brian Vargas