IFLA Newspaper Conference: Alison Stevenson & Elizabeth Styron

May 17, 2006

Alison and Elizabeth from the New Zealand Electronic Text Centre, part of the Victoria University of Wellington, are the next presenters in the technical panel. They are working to put content online using semantic web frameworks and topic maps. This could be really cool….

Alison starts. Challenges of newspaper content are:

How should content be delivered? Images? Transcribed text? Pages are nice because they show the placement on the page and other context, but text is good for searching and delivery in other formats. The lack of OCR quality is understood as a reason why text is often not used.
What level of access? (page vs. article)
How is it accessed? What kind of search? What pathways for browsing? (Date, region, topic, etc.) How do you provide access to machines? Machine access is often overlooked, but the value increases dramatically if you make it available. They didn’t mention Metcalf’s Law, but they should have.
How extensible or customizable? Branding for third-party aggregators?
Usability

So, on to the Topic Map stuff. It’s based on XML and TEI, as well as Topic aps and CIDOC CRM. They use Apache Cocoon and Apache Lucene. (w00t!) The early versions were based on the work done at the University of Virginia. Due to costs, they went open source immediately. At UVA, the collections were cross-searchable, but there was no cross browsing, and there was no auto-linking between content. “You don’t know that the author whose book you’re reading has letters that could be browsed in the collection.”

Now Elizabeth. “Topic Maps are an ISO standard for subject-based indexing.” They are the idea of a “back-of-the-book” index, but translated into the modern web-based world using modern standards. They layer on top of the resources. They build some ontologies based on “what they do,” and then describe the content using that. (This is precisely the kid of stuff I want to do with the RDC relation indexing stuff! Babak, can we get these girls to work with us?) Some of the benefits they achieved are:

Related resources enable “serendipity” (Mentioned by Jim Wall this morning.)
Browsing across collections
Navigation and interface for all resources
Integration of lists of authority (What does this mean?)
Better visibility, such as accidental users (most of them) via Google. A hit onto an author’s page presents a generated resource index for that author, linking to all his content. Cool!
Some other stuff I missed.

Back to Alison. The slide is titled “The TAO of Topic Maps.” These two are awesome! They used CIDOC CRM, which is an “event based” mode, and works “really well for text-based” content. They have Publication Events in their model, which is a really interesting idea. They have nested publication events, one for year, month, day, which they use for browsing purposes. (The big problem with the semantic frameworks is defining semantics for everything. Tagging and folksonomies would work really well for dynamically adding that data by harnessing the collective intelligence of the users.)

Work in the future includes: Adding new navigation schemes, merging in other topic maps from other externial organizations, and “Subject topics.” Their current topics include names of people, places, things. They are doing some AI work to classify topics, such as “earthquakes.”

Very cool.

Brian Vargas