Life and code.
RSS icon Email icon Home icon
  • Hadoop World 2009

    Posted on October 5th, 2009 Brian No comments

    I had the privilege of attending Hadoop World 2009 on Friday.  It was amazing to meet, listen to, and pick the brains of so many smart people.  The quantity of good work being done on this project is simply stunning, but it is equally stunning how much farther there remains to go.  Some interesting points for me include:

    Yahoo’s Enormous Clusters

    Eric Baldeschwieler from Yahoo gave an impressive talk about what they’re doing with Hadoop.  Yahoo is running clusters at a simply amazing scale.  They have several different clusters, totally some 86 PB of disk space, but their largest is a 4000-node cluster with 16 PB of disk, 64 TB of RAM, and 32,000 CPU cores.  One of the most compelling points they made was that Yahoo’s experiences prove that Hadoop really does scale as designed.  If you start with a small grid now, you can be sure that it will scale up – way up.

    Eric made it clear that Yahoo uses Hadoop because it so vastly improves the productivity of their engineers.  He noted that, though the hardware is commodity, the grid isn’t necessarily a cheaper solution; however, it easily pays for itself through the increased turnaround on problems.  In the old days, it was difficult for engineers to try out new ideas, but now you can try out a Big Data idea in a few hours, and see how it goes.

    A great example is the search suggestion on the front page.  Using Hadoop, they cut the time to generate the search suggestions on the front page from 26 days to 20 minutes.  Wow!  For the icing on the cake, the code was converted from C++ to Python, and development time went from 2-3 weeks to 2-3 days.

    HDFS For Archiving

    HDFS hasn’t been used much as an archival system yet, especially not with the time horizons of someplace like my employer.  When I asked him about it, Eric told me that the oldest data on Yahoo’s clusters is not much more than a year old.  Ironically, they tend to be concerned more about removing data from the servers due to legal mandates and privacy requirements, rather than keeping it around for a Very Long Time.  But he sees the need to hold some data for longer periods coming soon, and has promised he’ll be thinking about it.

    Facebook, though, is already making moves in this area.  They currently “back up” their production HDFS grid using Hive replication to a secondary grid, but they are working on (or already have – it wasn’t quite clear how far along this all was) an “archival cluster” solution.  A daemon would scan for least-recently used files and opportunistically move them to a cluster built with more storage-heavy nodes, leaving a symlink stub in place of the file.  When a request for that stub file comes in, the daemon intercepts it and begins pulling the data back off the archive grid.  This is quite similar to how SAM-QFS works today.  I had a chance to speak with with Dhruba Borthakur for a bit afterwards, and he had some interesting ideas about modifying the HDFS block scheduler to make it friendly for something like MAID.

    Jaesun Han from NexR gave a talk on Terapot, a system for long-term storage and discovery of emails due to legal requirements and litigation.  I asked him about whether they were relying on HDFS as their primary storage mechanism, or if they “backed up” to some external solution.  He laughed, and said that they weren’t using one now, but would probably get some sort of tape solution in the near future.  He also said that he believed HDFS was quite capable of being the sole archival solution, and I believe he was implying that it was fear from the legal and/or management folks that was driving a “back up” solution.  At this point, the Cloudera CTO noted that both Yahoo and Facebook had no “back up” solution for HDFS, except for other HDFS clusters.  It certainly seems like at least a couple multi-million dollar companies are willing to put their data where their mouth is on the reliability of HDFS.

    What’s Coming

    There is a tremendous sense that Hadoop has really matured in the last year or so.  But it’s also been noted that the APIs are still thrashing a bit, and it’s still awfully Java-centric.  Now that the underlying core is pretty solid, it seems like a lot of the work is moving towards making your Hadoop grid accessible to the rest of the company – not just the uber-geek Java coders.

    Doug Cutting talked about how they’re working on building some solid, future-proof APIs for 0.21.  Included in this is switching the RPC format to Avro, which is intended to solve some of the underlying issues with Thrift and Protocol Buffers while opening up the RPC and data format to a broader class of languages.  It’s worth noting that Avro and JSON are pretty easily transcoded to one another.  Also, they’ll finally be putting some serious thought into a real authentication and authorization scheme.  Yahoo (I think) mentioned Kerberos – let’s hope we get some OpenID up in that joint, too.

    There is a sudden push towards making Hadoop accessible via various UIs.  Cloudera introduced their Hadoop Desktop, Karmasphere gave a whirlwind tour of their Netbeans-based IDE, and IBM was showing off a spreadsheet metaphor on top of Hadoop called M2 (I can’t find any good links for it).  I hadn’t thought about that before, and it seemed so simple it was brilliant; Doug Cutting mentioned the idea, too, so it has some cachet.

    Final Thoughts

    It is worth noting that Facebook seems to be driving a lot of the really cool backend stuff, and people are noticing.  That’s not to say other organizations aren’t doing cool things, but during the opening presentations, Facebook got all the questions.  I mean, Dhruba recently posted a patch adding error-correcting codes on top of HDFS.  How cool is that?!

  • Multi-Threading with VFS

    Posted on May 14th, 2009 Brian No comments

    One of the new features in the BagIt Library will be multi-threading CPU-intensive bag processing operations, such as bag creation and verification.  Modern processors are all multi-core, but because the current version of the BagIt Library is not utilizing those cores, bag operations take longer than they should.  The new version of BIL should create and verify bags significantly faster than the old version.  Of course, as we add CPUs, we shift the bottleneck to the hard disk and IO bus, but it’s an improvement nonetheless.

    Writing proper multi-threaded code is a tricky proposition, though.  Threading is a notorious minefield of subtle errors and difficult-to-reproduce bugs.  When we turned on multi-threading in our tests, we ran into some interesting issues with the Apache Commons VFS library we use to keep track of file locations.  It turns out that VFS is not really designed to be thread-safe.  Some recent list traffic seems to indicate that this might be fixed sometime in the future, but it’s certainly not the case now.

    Now, we don’t want to lose VFS – it’s a huge boon.  Its support for various serialization formats and virtual files makes modeling serialized and holey bags a lot easier.  So we had to figure out how to make VFS work cleanly across multiple threads.

    The FileSystemManager is the root of one’s access to the VFS API.  It does a lot of caching internally, and the child objects coming from its methods often hold links back to each other via the FileSystemManager.  If you can isolate a FileSystemManager object per-thread, then you should be good to go.

    The usual way of obtaining a VFS is through the VFS.getManager() method,which returns a singleton FileSystemManager object.  Our solution was to replace the singleton call with a ThreadLocal variable, with the initialValue() method overloaded to create and initialize a new StandardFileSystemManager.  The code for that looks like this.

    private static final ThreadLocal fileSystemManager = new ThreadLocal() { @Override protected FileSystemManager initialValue() { StandardFileSystemManager mgr = new StandardFileSystemManager(); mgr.setLogger(LogFactory.getLog(VFS.class)); try { mgr.init(); } catch (FileSystemException e) { log.fatal("Could not initialize thread-local FileSystemManager.", e); } return mgr; } };

    The downside is that we lose the internal VFS caching that the manager does (although it still caches inside of a thread).  But that’s a small price to pay for it working.

  • Funny Smelling Code – Endlessly Propagating Parameters

    Posted on May 8th, 2009 Brian No comments

    We’re currently working on a new version of the BagIt Library: adding some new functionality, making some bug fixes, and refactoring the interfaces pretty heavily.  If you happen to be one of the people currently using the programmatic interface, the next version will likely break your code.  Sorry about that.

    The BagIt spec is pretty clear about what makes a bag valid or complete, and it might seem a no-brainer to strictly implement validation based on the spec.  Unfortunately, the real-world is not so simple.  For example, the spec is unambiguous about the required existence of the bagit.txt, but we have real bags on-disk (from before the spec existed) that lack the bag declaration and yet need to be processed.  As another example, hidden files are not mentioned at all by the spec, and the current version of the code treats them in an unspecified manner.  On Windows, when the bag being validated has been checked out from Subversion, the hidden .svn folders cause unit tests to fail all over the place.

    It seems an easy enough feature to add some flags to make the bag processing a bit more lenient.  In fact, the checkValid() method already had an overloaded version which took a boolean indicating whether or not to tolerate a missing bagit.txt.  I began by creating an enum which contained two flags (TOLERATE_MISSING_DECLARATION and IGNORE_HIDDEN_FILES), and began retrofitting the enum in place of the boolean.

    And then I got a whiff.

    I found that, internally, the various validation methods call one another, passing the same parameters over and over.  Additionally, the validation methods weren’t using any privileged internal information during processing – only public methods were being called.

    I called Justin this morning to discuss refactoring the validation operations using a Strategy pattern.  This would allow us to:

    1. Encapsulate the parameters to the algorithm, making the code easier to read and maintain.  No more long lists of parameters passed from function call to function call.
    2. Vary the algorithm used for processing based on the needs of the caller.
    3. Re-use standard algorithm components (either through aggregation or inheritance), simplifying one-off cases.

    He had also come to the same conclusion, although driven by a different parameter set.  It’s a good sign you’re headed in the right direction when two developers independently hacking on the code come up with the same solution to the same problem.

  • What You Learn

    Posted on May 5th, 2009 Brian 1 comment

    Version 2.0 of Chronicling America went online yesterday.  Congratulations are in order to David, Ed, Dan, Curt, and everybody else on the team!

    The new version looks almost the same as the old version, but is entirely different on the back-end.  The 1.x series attempted to build an end-to-end digital repository using XML-centric technologies — FEDORA and Cocoon — with a more traditional Django and MySQL web application.  The original version was complicated, slow, scaled poorly, and suffered stability problems from day one.  We had a very restrictive robots.txt in place because search crawlers would regularly crash the application.

    The new version finally has navigable permalinks, makes some vast improvements in the ingest workflow, and sports some RDF data linked in from the HTML pages.  It scales more predictably, is a lot more stable, and has a vastly smaller codebase.  I had very little to do with the development of the new version, mostly providing advice and historical perspective.

    The retirement of the old code is a little bittersweet and definitely humbling.  After all, my primary contribution to the 2.0 release was lessons in how not to build a repository system.  Fortunately, much of the knowledge gained from the first release has made its way into other projects — spawning some, improving others.  Useful tools and concepts like BagIt, workflow, transfer, and bit storage have all been informed by anecdotes, scenarios, situations, and problems from Chronicling America 1.0.

    It’s true that you learn more from failure than from success.  But it sure ain’t pretty on the ego.

  • Good News For Libraries: DMCA Not Applicable

    Posted on August 5th, 2008 Brian 1 comment

    Sure, it’s a brain-dead piece of legislation that simultaneously undermines the longevity of our modern cultural heritage and turns graduate students into criminals, but at least Congress got one part of the DMCA right – even if only by accident. The US Court of Appeals ruled last week that the DMCA doesn’t apply to the government.

    So after you get over your revulsion at the apparent hypocrisy involved, stop for a minute and think about what great news this is! If the government is excluded from the DMCA, then that means that public libraries are excluded from the DMCA. So while the Air Force might give a half-hearted Huzzah! for their minor victory in cheating one of their own out of the fruits of his labor, the entire preservation community can throw a big party. Libraries can now legally pay to create DRM-circumvention software, finally giving them a chance to legally create and archive copies that might still be playable in fifty years. We might finally have a chance to save our digital heritage!

    And lest you think this is a merely academic, let me point you at my recent tour of the National Audio/Visual Conservation Center. The NAVCC is charged with preserving both analog and digital movies. For the latter, they often come on DVDs, and preserving that today means either capturing and re-encoding an analog version of the movie (not so good) or deciphering the video files with (up until now) illegal software. The difficulty faced really hit home when we toured through a room filled with with dozens of DVD drives, used to capture movies.

  • The Future of Digital Archives

    Posted on April 12th, 2008 Brian No comments

    Ever since the announcement that portions of the Library’s new Library of Congress Experience initiative would have some user-interface components implemented using Microsoft’s Silverlight in exchange by Microsoft for some $3 million worth of hardware for kiosks, the net has been ablaze with fury over some perceived “selling-out” of the organization. For some reason, people think that www.loc.gov will shortly be 301ing to loc.microsoft.com, or something ridiculous.

    (Disclaimer: These are my opinions, based on my experience and work as a digital librarian. They are not necessarily those of the Library of Congress, or anybody else. Also, I am deliberately avoiding the topic of content born-digital in a proprietary format. It’s an especially painful area for digital archivists, as the content is still important notwithstanding the problems reading it both now and in the future, and it really doesn’t bear on the current complaints, anyway.)

    It’s a tricky problem, to be sure. The cost of digital conversion, preservation, and access is extremely high. The cost outlay by the National Endowment for the Humanities for digitizing the 1/2 million (and growing!) newspaper pages in Chronicling America runs into the millions of dollars. And that doesn’t even account for the cost of keeping that data around for any length of time! Funding has to come from somewhere, and so, as Ars says, there will likely be more and more of these quid pro quo arrangements in the future.

    But I’ll let you in on a little secret: It doesn’t matter. The future of digital libraries, and of organizations like the Library of Congress, isn’t in the web site that people visit. It isn’t in the technology choices made for viewing content. And it certainly isn’t going to be controlled by any particular company.

    See, though the user interface may be written in something proprietary, like Flash or Silverlight, the archived bits aren’t stored that way. Librarians are an insanely conservative bunch, made that way through hundreds of years of experience in attempting to keep old stuff and make it available for new people. The guidelines for digital preservation reflect that. We’re still working with TIFFs for a good reason: TIFF has been around for decades, and it’ll continue to be around for decades. We know we’ll be able to read them in a hundred years. In the library world, lack of shininess on a file format isn’t a bug – it’s a feature!

    But the real reason it doesn’t matter is because the digital library as a web site you visit to view content won’t exist in a decade. Instead, we’ll be serving out content via web-exposed APIs, opening the inner archives to anyone who cares to look. Sure, there will still be curated special presentations of content just like the new Experience program, and they might even use some new, fancy, proprietary technology, maybe even in exchange for $200 million of computer equipment (adjusting for inflation, here). But those special presentations will just be the tip of the content iceberg, and the rest will be there to look at in very standard and open formats.

  • Chronicling America – Now With More Page!

    Posted on March 14th, 2008 Brian No comments

    I spend a significant fraction of my time working at the Library on Chronicling America. We went live with the latest release a little while ago, finally breaching the half-million-page mark. We were just shy of it last time, but as of now we have 568,261 pages for your perusal.

  • Waterboarding in the Press – 106 Years Ago

    Posted on February 21st, 2008 Brian No comments

    The current debate over the use of waterboarding seems new and fresh and relevant. But if working on a digital archive of historical newspapers has taught me anything, it’s that we’ve pretty much already done everything.

    I’d never thought to search for waterboarding in Chronicling America, but the post on boingboing this morning led me to do a search for water cure, with some fascinating results.

    1902 WaterboardingIt turns out that practice was just as fraught with controversy then as it was now. It prompted a Congressional investigation, and the court-martial of a general (although he was acquitted).

    It was even used as a discipline tool in a Kansas mental hospital! The article states that the torture was used on patients who failed to follow the orders of the institution’s head, Miss Houston, who administered the punishment herself. The practice ceased when Miss Houston was replaced as head of the hospital by the ostensibly more gentle Miss Gower. She merely had the problem patients “strapped to a bench and whipped.”

  • Chronicling America Update

    Posted on July 1st, 2007 Brian No comments

    As usual, I’ve been crazy busy. We released a new version of my project at the Library of Congress on June 14th. The cut-over was nice and smooth, with only a few minutes of downtime while we switched over from the live application to the staged application. Our development team did a bang-up job on this release, with a bunch of new features, enhancements, clean-up, and about 80,000 more newspaper pages.

    So, what’s new in Chronicling America version 1.1? For starters, we’ve put in place permanently-resolvable URLs, based on data derived from the paper’s and page’s metadata. Thus, we’re putting an explicit guarantee out there that the URL http://www.loc.gov/chroniclingamerica/lccn/sn84026749/1904-01-30/ed-1/seq-1 will always get you back to page 1 of the January 30, 1904 edition of the Washington Times. (For completeness’ sake, I should note that it actually gets you to image 1, not page 1. They aren’t always the same thing!) Similarly, stripping off various parts of the URL will lead to the appropriate places. For example, http://www.loc.gov/chroniclingamerica/lccn/sn84026749 will get you to the directory record for the 1902-1939 Washington Times.

    If you clicked on that page link, you’ll have no doubt noticed our new and shiny dynamic page viewer, inspired by similar UIs like Google Maps. It has all of the functionality of our old, clunky Flash viewer, while simultaneously being faster, prettier, and easier to use.

    On the back-end, we made some pretty drastic improvements to the speed of our image rendering. Not only do we simply render images faster, but we also make more careful choices about how we ask for the images to be rendered, based on the information specific to each image’s JPEG 2000 service copy. Hopefully you’ll find the system quite a bit more responsive.

    And, of course, there are roughly 80,000 new pages in the system. That brings our grand total to up over 300,000 – and counting! Expect even more in the future.

  • Chronicling America

    Posted on March 20th, 2007 Brian No comments

    Wanted Men-Haters to Act as TeachersEver wonder what I do for a living? Check it out: Chronicling America is now live!

    That site is the direct public access portion of the National Digital Newspaper Program. The program details are all there on the Library’s NDNP page. Feel free to check it out if it suits you; I won’t go into it much more here.

    But I will talk about the technical aspects of the program. It has consumed me for the last few years now, so I have a lot to say! Those posts will come out in the future, under the Digital Libraries category here on my site. If you’re interested only in that, you can subscribe to the Digital Libraries RSS feed.

    For now, though, let’s take a tour of some of my favorites among the quarter-million-odd pages now online:

    • The Washington Times: March 31, 1905 – I found this page while doing a test search for cheese, and the headline in the middle of the page immediately struck me: “Wanted Men-Haters to Act as Teachers”. Oh how far we’ve come in a mere 102 years! The differences in culture between then and now astound me, and the Chronicling America repository is replete with examples. Words, articles, and pictures that today would be considered racist, misogynist, and classist abound. The title of The Colored American might be seem out of place today, but in 1905 it was just a descriptive newspaper title.
    • The San Francisco Call: June 10, 1900 – Bathing suits were indistinguishable from cocktail dresses! How did anybody swim in these things?
    • The San Francisco Call: September 7, 1901 – The assassination of President McKinley is big news. What a front page!
    • The Washington Herald: July 11, 1909 – It seems somebody who lived in the same building I do – almost 100 years ago! – lost a hat pin, and would like it returned. I’ll keep an eye out for it. I wonder how much the reward was?