Mercurial: Super-Undo for MarcEdit

MarcEdit is a powerful software suite for working with MARC records.  It can convert MARC records between various MARC formats and character encodings, and help you harvest new records through its Z39.50 client.

One of the more popular uses of MarcEdit is “breaking” binary MARC records into a mnemonic text format, editing them, and then “making” them back into binary MARC where they can be loaded into a library catalog. For example, you might download a batch of ebook MARC records from a vendor, and do edits like the following:

  • Add a 099 that is your local universal call number for ebooks.
  • Delete that weird 912 that your tag table doesn’t recognize.
  • Hop through the file, looking at each 520, making sure it isn’t an excessively long ad from the back of the book.
  • Delete all of the links you don’t need.
  • Add a proxy prefix to all of the links you do need.
  • Insert an added entry for the aggregator, only if it isn’t there already.

Near the end of this process you might realize, “Oh shoot! I forgot that the links can have more than one format, and deleted too many of them!” Unfortunately, you deleted them near the beginning of the process, so to easily retrieve them you have to start over, losing all that time you spent looking at 520s. Even if you saved your work frequently, you only have the latest revision saved. Can you hit ‘Undo’ enough times to get back far enough?  Probably not, and how would you know when you were done?

Next time, you want to avoid something like this happening again, so you rename the file every time you save it. How should you name all the different revisions? ebooks1,  ebooks2, ebooks3? ebooks-after-099, ebooks-after-912, ebooks-520-ok?

Maybe you don’t do your edits in the same order every time. How do you tell which file is the best version and how much work is left to do (or redo)?

If those files are large, how long until IT calls to ask why you have so many copies of the same MARC records cluttering the system?

Heaven forbid you work on the project with other people. They could learn your file renaming scheme, but may just invent their own instead, so you end up with files like ebooks2-after-520s-send-to-kathryn-final-revised-broken.

This can be made to work, but there is an easier way.

Version Control

What you are looking for is a version control system. Programmers often use version control systems to manage their source code, but such systems also work great for any kind of text files.  With proper version control you can organize all previous versions of your file, while keeping your directory tidy and not using significantly more space than storing just one copy of the file.

There are lots of version control systems to choose from, but the one I use most often with MarcEdit is Mercurial. It is free, cross-platform (but easy to install on Windows), and takes minimal hassle to set up new repositories. I create a new repository at the beginning of each cleanup project (even if it is only one file), do the cleanup work saving checkpoints along the way, batch load the cleaned up records, and then delete the repository. No sticky residue.

Setting up Mercurial

You can install hg on Windows with Cygwin. (If you do not already have Cygwin installed, you should, but that’s for another blog post!)

After it is installed, you can configure Mercurial by creating a file in your home directory indicating your username, so it knows who to credit in the edit history.  (Even if you know you are the only one using the repository, this step is required)

You can create this file by typing a line like the following:

echo [ui] > ~/.hgrc
echo username = zemkat >> ~/.hgrc

That’s it!  If you prefer a GUI interface there are GUI options for Mercurial (such as TortoiseHG and MacHg), but the Cygwin command line is pretty easy to use, especially for a simple project.

Working with Repositories

I create a new Mercurial repository for each new batch of MARC records I work with.  To create a new repository, make a new directory, say C:\local\marc\EEBO

In cygwin, change to that directory, and initialize the repository:

cd /cygdrive/c/local/marc/EEBO
hg init

In MarcEdit, break the binary MARC records and save the .mrk file to the directory you just created.

In Cygwin, add the file to the repository and do your first commit by typing the following:

hg add
hg commit -m "initial import"

Now you will always be able to access (or revert back to) the file as it looked before you edited it. Edit your file in MarcEdit, and commit after every major (or minor) edit:

hg commit -m "099 added"
hg commit -m "912 deleted"
hg commit -m "520s checked up through OCLC#12345678"
hg commit -m "520s all checked"

Rather than storing full copies of all of your different versions, Mercurial stores the changes between each one, so you are not using significantly more space by committing often.

At any time, you can view the history of your file by typing:

hg log

which will show a list of entries that look something like:

changeset: 4:6a1f5e638a91
user: zemkat
date: Thu Jan 04 14:25:07 2012 -0500
summary: 520s all checked

If you want to undo all of your changes up to a given step, you can revert to that revision (the number before the : in the changeset):

hg revert -r 4
hg commit -m "backing up a few steps"

Once you are sure you are done, and your file is loaded, you can (if you like) remove the repository data (about the previous revisions) by deleting the .hg directory in the project directory:

rm -rf .hg

The final version of the .mrk file will remain.

Mercurial has many more features, but the ones described above should be enough to get you started and using it as a super-undo function behind MarcEdit.  For more information, check out the Mercurial guide on its official website, or the excellent tutorial at HgInit.com.