Life and code.
RSS icon Home icon
  • IFLA Newspaper Conference: Alison Stevenson & Elizabeth Styron

    Posted on May 17th, 2006 Brian No comments

    Alison and Elizabeth from the New Zealand Electronic Text Centre, part of the Victoria University of Wellington, are the next presenters in the technical panel. They are working to put content online using semantic web frameworks and topic maps. This could be really cool….

    Alison starts. Challenges of newspaper content are:

    • How should content be delivered? Images? Transcribed text? Pages are nice because they show the placement on the page and other context, but text is good for searching and delivery in other formats. The lack of OCR quality is understood as a reason why text is often not used.
    • What level of access? (page vs. article)
    • How is it accessed? What kind of search? What pathways for browsing? (Date, region, topic, etc.) How do you provide access to machines? Machine access is often overlooked, but the value increases dramatically if you make it available. They didn’t mention Metcalf’s Law, but they should have.
    • How extensible or customizable? Branding for third-party aggregators?
    • Usability

    So, on to the Topic Map stuff. It’s based on XML and TEI, as well as Topic aps and CIDOC CRM. They use Apache Cocoon and Apache Lucene. (w00t!) The early versions were based on the work done at the University of Virginia. Due to costs, they went open source immediately. At UVA, the collections were cross-searchable, but there was no cross browsing, and there was no auto-linking between content. “You don’t know that the author whose book you’re reading has letters that could be browsed in the collection.”

    Now Elizabeth. “Topic Maps are an ISO standard for subject-based indexing.” They are the idea of a “back-of-the-book” index, but translated into the modern web-based world using modern standards. They layer on top of the resources. They build some ontologies based on “what they do,” and then describe the content using that. (This is precisely the kid of stuff I want to do with the RDC relation indexing stuff! Babak, can we get these girls to work with us?) Some of the benefits they achieved are:

    • Related resources enable “serendipity” (Mentioned by Jim Wall this morning.)
    • Browsing across collections
    • Navigation and interface for all resources
    • Integration of lists of authority (What does this mean?)
    • Better visibility, such as accidental users (most of them) via Google. A hit onto an author’s page presents a generated resource index for that author, linking to all his content. Cool!
    • Some other stuff I missed.

    Back to Alison. The slide is titled “The TAO of Topic Maps.” These two are awesome! They used CIDOC CRM, which is an “event based” mode, and works “really well for text-based” content. They have Publication Events in their model, which is a really interesting idea. They have nested publication events, one for year, month, day, which they use for browsing purposes. (The big problem with the semantic frameworks is defining semantics for everything. Tagging and folksonomies would work really well for dynamically adding that data by harnessing the collective intelligence of the users.)

    Work in the future includes: Adding new navigation schemes, merging in other topic maps from other externial organizations, and “Subject topics.” Their current topics include names of people, places, things. They are doing some AI work to classify topics, such as “earthquakes.”

    Very cool.

  • IFLA Newspaper Conference: Perry Willett

    Posted on May 17th, 2006 Brian No comments

    Perry Willett from the University of Michigan is the first of the technical panel presenters. He is discussing the university’s newspaper experiences. He begins with an overview of their DLXS system, which models books/journals, finding aids, and images, using standard XML/XSLT/CSS processing. They use their XPAT search engine.

    Their book model, which was the basis of their newspaper model, is:

    • One image per page, 100-200 KiB
    • Uncorrected OCR, about 1-3 KiB
    • Usually no more than 1-2 columns per page

    The usual usage scenario is “search the OCR text, and display a page image.” they also offer browsing by author and title, along with simple sequence-based navigation. Searches can also be limited by year.

    Their first project was digitizing British Library newspapers. (This was one of the projects Ed mentioned earlier.) Their next project was to digitize their student newspaper, the Michigan Daily. Publication began in 1890, and has very spotty preserved continuity and often with narrow margins. Starting in the 1950s, they had some locally done microfilm, but it is generally very low quality, despite their own in-house microfilming processes.

    Newspapers are actually much different than books, though. The problems seem much more related to page size than anything else, although the multiplicity of columns is a notable exception.

    • No equipment to digitize such large sized pages
    • Pages are much larger, from 3-6 MiB
    • Many, irregular columns
    • Poor paper, with discolorations, tears, etc.
    • Poor print quality, leading to really crappy OCR, about 1-3 KiB

    One attempt to decrease the file size is to crop the articles out of the pages. Problems with that include jumping to the rest of the article when it is continued on another page. Another problem occurs when attempting to do full-text hit hightlighting. If your word coordinate origin is at the page, then you have to re-work your coordinates for the newly cropped article. It’s not a hard problem to solve, but the lack of standardization proves annoying.

    Other problems include: Often no authors for the articles, titles are not ususally important, date browsing requires more specificity, and legal problems associated with web publication of wire service stories. He also is discussing some of the problems in the general community. This includes:

    • Lack of standardization on column/page/article support
    • Lack of standardization between vendors and libraries and publishers on coordinates and relationships between parts
    • Copyright issues

    It seems to me that their attempt to shoehorn newspapers into the book model is really their downfall. Their system was not designed to be a general-purpose repository, and so it is painful to work in new content models.

  • IFLA Newspaper Conference: Ed King – British Library

    Posted on May 17th, 2006 Brian No comments

    Ed is giving a general overview of the state of newspaper publishing in the UK. It seems that growth is being seen in regional papers, often with news becoming less and less important. The free Metro London, with many different editions published throughout the day, is very highly circulated. The Sunday papers are showing a decrease in circulation, only single digits, but it’s more and more common. Those so-called “quality” Sundays that have changed formats or used some other gimmick have seen some increases. In general, it seems that the papers are fearful for their future in print.

    There have been rises in the online UK newspapers, like the Guardian. However, there are hundreds of regional newspapers, and Ed mentions that they it is surprising that they have not been slow to uptake on web-based publishing. The BBC is the big one overall. It gets 1.5 million hits per day, and is often the default news source for Britons.

    Ed sees editing as the big purpose of modern newspapers. Newspapers are different than blogs because the latter is raw, whereas the newspapers are fact-checked. This mirrors what Jim Wall said earlier, that reliable sources will drive the trust in the publication. Ed calls it “thought-through” journalism.

    There’s a noted difference in how different news providers are allowing access their archives. The New York Times, for example, requires payment for access to their most popular archived articles. Others, like the Los Angeles Times have opened up their access, assuming more access drives more advertising revenue.

    So where is the future going? There’s a quote from Rupert Murdoch discussing the always important role of “great journalism.” Further, he mentions Bill Gate’s major drive towards tablets as a personal information source. Somebody in the audience mentioned that tablets are already becoming a common standardized commodity within many libraries, but despite the increasing digitization, there is a future in print. I’m surprised to learn that print readership is actually up in India, which is a fairly technocentric society. Or maybe that growth is occurring on the non-technical fringe?

    The Brithish Library is working on digitizing some newspapers. The Penny Illustrated Paper from 1860-1918 was the proof-of-concept for about 50,000 pages. After that prototype, they have started work on the British Newspaper Project, digitizing 2 million newspaper pages from 1800-1900. They seem to be approaching things similar to NDNP, digitizing form microfilm, etc. And again, similar to NDNP, an existing collection of newspapers – the Burnery Collection of Newspapers – is about to begin the process of digitization.

  • IFLA Newspaper Conference: Stephen Abram – VP of Innovation for Sirsi-Dynix

    Posted on May 17th, 2006 Brian No comments


  • XServe RAID vs. The Batmobile

    Posted on March 12th, 2006 Brian No comments

    When you think of Apple, you probably think of iPods and Powerbooks and iTunes. Apple is well known for making chic and sexy hardware and software. It’s a consumer company, right? Not entirely. Apple also makes server hardware, including some very high-quality storage arrays, and it seems that I’m not the only person who likes their storage solutions.

    We’re using XServe RAID arrays as part of the early storage for the NDNP project. Currently, we have four machines, each with dual, independent fibre channel SAN busses, with each bus servicing seven disks. Each machine has roughly 4 TiB of usuable storage, giving us a total of 16 TiB of very sexy storage.

    Yes, even Apple’s rack-mounted hardware is sexy. The case has a fine, brushed metal finish. There are numerous useful status LEDs on the front, including blue LEDs for the disk activity lights. Further, there are blue LED meters on the front that show real-time storage bus utlization, reminiscent of the CPU load meters on the once-red-hot BeBox.

    The only problem with Apple’s hardware is the density. Our hardware is located on Capitol Hill, and space is at quite a premium. As part of our experimentation, we recently added a NexSan SATABeast to our collection. This monster, affectionately called The Batmobile, has 20 TiB of storage in a mere 4u, for a ratio of 5 TiB/u. Each of the XServes is 3u in height, so compared to the XServe RAID ratio of 1.3 TiB/u, the NexSan is the clear winner. We have yet to really put the Batmobile through its paces, though, so performance and maintanence may be a differentiating factor.