Life and code.
RSS icon Email icon Home icon
  • Multi-Threading with VFS

    Posted on May 14th, 2009 Brian No comments

    One of the new features in the BagIt Library will be multi-threading CPU-intensive bag processing operations, such as bag creation and verification.  Modern processors are all multi-core, but because the current version of the BagIt Library is not utilizing those cores, bag operations take longer than they should.  The new version of BIL should create and verify bags significantly faster than the old version.  Of course, as we add CPUs, we shift the bottleneck to the hard disk and IO bus, but it’s an improvement nonetheless.

    Writing proper multi-threaded code is a tricky proposition, though.  Threading is a notorious minefield of subtle errors and difficult-to-reproduce bugs.  When we turned on multi-threading in our tests, we ran into some interesting issues with the Apache Commons VFS library we use to keep track of file locations.  It turns out that VFS is not really designed to be thread-safe.  Some recent list traffic seems to indicate that this might be fixed sometime in the future, but it’s certainly not the case now.

    Now, we don’t want to lose VFS – it’s a huge boon.  Its support for various serialization formats and virtual files makes modeling serialized and holey bags a lot easier.  So we had to figure out how to make VFS work cleanly across multiple threads.

    The FileSystemManager is the root of one’s access to the VFS API.  It does a lot of caching internally, and the child objects coming from its methods often hold links back to each other via the FileSystemManager.  If you can isolate a FileSystemManager object per-thread, then you should be good to go.

    The usual way of obtaining a VFS is through the VFS.getManager() method,which returns a singleton FileSystemManager object.  Our solution was to replace the singleton call with a ThreadLocal variable, with the initialValue() method overloaded to create and initialize a new StandardFileSystemManager.  The code for that looks like this.

    private static final ThreadLocal fileSystemManager = new ThreadLocal() { @Override protected FileSystemManager initialValue() { StandardFileSystemManager mgr = new StandardFileSystemManager(); mgr.setLogger(LogFactory.getLog(VFS.class)); try { mgr.init(); } catch (FileSystemException e) { log.fatal("Could not initialize thread-local FileSystemManager.", e); } return mgr; } };

    The downside is that we lose the internal VFS caching that the manager does (although it still caches inside of a thread).  But that’s a small price to pay for it working.

  • Funny Smelling Code – Endlessly Propagating Parameters

    Posted on May 8th, 2009 Brian No comments

    We’re currently working on a new version of the BagIt Library: adding some new functionality, making some bug fixes, and refactoring the interfaces pretty heavily.  If you happen to be one of the people currently using the programmatic interface, the next version will likely break your code.  Sorry about that.

    The BagIt spec is pretty clear about what makes a bag valid or complete, and it might seem a no-brainer to strictly implement validation based on the spec.  Unfortunately, the real-world is not so simple.  For example, the spec is unambiguous about the required existence of the bagit.txt, but we have real bags on-disk (from before the spec existed) that lack the bag declaration and yet need to be processed.  As another example, hidden files are not mentioned at all by the spec, and the current version of the code treats them in an unspecified manner.  On Windows, when the bag being validated has been checked out from Subversion, the hidden .svn folders cause unit tests to fail all over the place.

    It seems an easy enough feature to add some flags to make the bag processing a bit more lenient.  In fact, the checkValid() method already had an overloaded version which took a boolean indicating whether or not to tolerate a missing bagit.txt.  I began by creating an enum which contained two flags (TOLERATE_MISSING_DECLARATION and IGNORE_HIDDEN_FILES), and began retrofitting the enum in place of the boolean.

    And then I got a whiff.

    I found that, internally, the various validation methods call one another, passing the same parameters over and over.  Additionally, the validation methods weren’t using any privileged internal information during processing – only public methods were being called.

    I called Justin this morning to discuss refactoring the validation operations using a Strategy pattern.  This would allow us to:

    1. Encapsulate the parameters to the algorithm, making the code easier to read and maintain.  No more long lists of parameters passed from function call to function call.
    2. Vary the algorithm used for processing based on the needs of the caller.
    3. Re-use standard algorithm components (either through aggregation or inheritance), simplifying one-off cases.

    He had also come to the same conclusion, although driven by a different parameter set.  It’s a good sign you’re headed in the right direction when two developers independently hacking on the code come up with the same solution to the same problem.