Life and code.
  • IFLA Newspaper Conference: Majlis Bremer-Laamanen – National Library of Finland

    Posted on May 17th, 2006

    In 2005, paper newspapers are a way of life – delivered to the front door, read on the subway, on the trains, and so on. In Finland, unlike a large portion of the world, 86% of the population, and 71% of the youth (!) read at least one newspaper per day. The permiation is complete: They commonly have “newspaper weeks” and “newspaper days” in schools. Their high literacy rate seems to be related, and interestingly, those most connected to technology tend to read print the most. On average, they see about 48 minutes per day per person spent reading newspapers, and almost 90% of them are free. There are a total of 64 electronic papers.

    She mentions some of the problems with newspapers in Finland. Distinct language groups tend to cause papers to focus on specific regions. It feels to me almost like she’s describing the ultra-niche-style presntation and content that the rest of the world is trying to move towards. Except that they are accidentally there by default!

    The Finnish project has 165 titles from 1771 to 1890, fr a total of 1 million pages. They also have references to the articles available. They are worried about four things: the user, long-term preservation, funding, and mass digitization.

    The user is looking for readability of easy-to-find content using an easy-to-use system. That requires high image quality from high quality microfilm. It requires fulltext and article searching, along with hit highlighting, and a PDF format for downloading. Finally, it must be easy to use: “Google-like.”

    For preservation, constant collection of metadata is important. Things like document structure, word location, or scanning parameters. That metadata must be held in a standard format, though, like METS. Document structure is very time consuming to determine and document.

    Financially, digitizing lots of content is expensive. They’ve done amazing things on very limited budgets! (Only 10,000 Euros the first year!!) A major problem is character recognition of both Fraktur and Roman text, both on the same page next to each other. The multi-lingual aspect is certainly something we don’t often think about here in the U.S.

    Mass digitization requires automation. Cost per page has decreased as time has gone on and the automation has improved. They’re working with CCS to do a lot of their work.

    She’s demoing their site now. There’s a simple browse capability, divided between title, and then by year, month, and day. The basic search funtionality is a search box with two boxes for date ranges and the list of newspapers. A search for “salt lake city” produced some 500 results, and the page produces was from August 9, 1877. The view seems to be simple pre-rendered images, such as “small image,” “medium image,” and “large image.” There’s plenty of laughs as she reads the article describing a much smaller Salt Lake City, where the cows wander between the houses and drink water from the ditches. There’s also an article index that was created in the late 1800s, which is browsable online. It doesn’t follow standard indexing now, but it provides browsing by people’s names, and regions, and such.

    When you’re scanning from microfilm, the quality of the film is the most important factor. Because their old filming was so bad, they decided to re-film their titles. They are doing the filming and digitization simultaneously. There’s a neat picture of a tech scanning some huge pages.

    Optical character recognition is handled seperately for Fraktur and Roman. The Fraktur has totally different fonts than Roman.

    The feedback they have gotten has been positive. It’s free to use, and their users describe it as “great Cultural Service.” It also started receiving references from researchers and other newspapers. They get about 160,000 visitors annually, with people’s searching ranging from the local economy to Napoleon. They want automated text translation from Nordic to English. Maybe they could use the Amazon Mechanical Turk. Also, they want to incorporate more newspapers, including those protected by copyright. I think that last issue is going to be a big one for the future – our laws are so antiquated! Finally, they want to provide more personal service, but budgets are tight, so people can’t really be used. Automated help and recommendation systems like Amazon’s might prove useful.

