Multi-Threading with VFS
One of the new features in the BagIt Library will be multi-threading CPU-intensive bag processing operations, such as bag creation and verification. Modern processors are all multi-core, but because the current version of the BagIt Library is not utilizing those cores, bag operations take longer than they should. The new version of BIL should create and verify bags significantly faster than the old version. Of course, as we add CPUs, we shift the bottleneck to the hard disk and IO bus, but it’s an improvement nonetheless.
Writing proper multi-threaded code is a tricky proposition, though. Threading is a notorious minefield of subtle errors and difficult-to-reproduce bugs. When we turned on multi-threading in our tests, we ran into some interesting issues with the Apache Commons VFS library we use to keep track of file locations. It turns out that VFS is not really designed to be thread-safe. Some recent list traffic seems to indicate that this might be fixed sometime in the future, but it’s certainly not the case now.
Now, we don’t want to lose VFS – it’s a huge boon. Its support for various serialization formats and virtual files makes modeling serialized and holey bags a lot easier. So we had to figure out how to make VFS work cleanly across multiple threads.
The FileSystemManager is the root of one’s access to the VFS API. It does a lot of caching internally, and the child objects coming from its methods often hold links back to each other via the FileSystemManager. If you can isolate a FileSystemManager object per-thread, then you should be good to go.
The usual way of obtaining a VFS is through the VFS.getManager() method,which returns a singleton FileSystemManager object. Our solution was to replace the singleton call with a ThreadLocal variable, with the initialValue() method overloaded to create and initialize a new StandardFileSystemManager. The code for that looks like this.
private static final ThreadLocal fileSystemManager = new ThreadLocal() { @Override protected FileSystemManager initialValue() { StandardFileSystemManager mgr = new StandardFileSystemManager(); mgr.setLogger(LogFactory.getLog(VFS.class));
try { mgr.init(); } catch (FileSystemException e) { log.fatal(“Could not initialize thread-local FileSystemManager.”, e); }
return mgr; } };
The downside is that we lose the internal VFS caching that the manager does (although it still caches inside of a thread). But that’s a small price to pay for it working.