I'm in the middle of a long-overdue project to import all my old home-made CD-R/DVD-R/DVD+Rs of data onto my NAS. This page summarizes my approach and what I think are the right tools and processes for the job.
- Introduction and rationale
- Gather the discs & preparation
- Identify types and varieties of disc
- Initial import of the discs
- Step 3: Figuring out what is on each disc
- Retrying damaged/degraded discs
- Organising the extracted files
From the late 1990s until relatively recently, I used to archive digital material onto CD-Rs, and later DVD-Rs. I had about half a dozen spindles of hundreds of home-made discs with all sorts of stuff on them. If you are in a similar position, consider a project like this as soon as possible: based on my experiences it might already be too late for some of them. The adjacent pictures were created with the ddrescueview software and show both a degraded home-made DVD-R and a degraded commercially pressed CD-ROM. I came across both as I embarked on this project.
The stages of this guide can run like a pipeline, where you perform steps 3 onwards for some images whilst you are beginning stage 2 for others, which could be useful if you have a lot of discs.
The guide is structured to offer my recommendation for tools or techniques, without spelling out the rationale. The tools I recommend are available on GNU/Linux systems. They are likely available on other UNIX-based systems (including Mac OS X), but you may need to look elsewhere for help with Windows.
Fetch all the discs you want to read together into one place.
Mine were spread across jewel cases and spindle-tubs at home, at work and some in boxes in my parents house. I consolidated them down to tubs in most cases, leaving only those discs that looks particularly valuable in jewel cases. I kept a couple of empty jewel cases for transporting discs to other people and threw the rest out.
This consolidation exercise on its own reduced the space occupied by my discs (and their cases) tremendously.
I labelled an empty tub "imported to NAS" with a label maker. As I progressed with the following steps, I moved discs into that tub. The intention will be to throw away the tub once I'm completely done. I labelled another empty tub "needs attention" for discs that could not be read on the first attempt.
Don't trust disc labels. When sorting your discs, Don't throw a disc out on the basis of the label alone. I had a bad habit of topping up a disc which was mostly for one thing with other data if there was space left over. Mistakes can also be made, and I had plenty of unlabelled discs anyway. If in doubt, put the disc in your "to image" pile.
You're going to need a computer with sufficient storage space upon which to store the disc images, metadata and/or the data within them, once you start organising it. You're also going to need an optical drive to read them. If you haven't yet got a system in place for reliably storing your data and managing backups, it would be worth sorting that out first before embarking on a project like this.
Finally, this is going to take time. In the best case, discs read quickly and reliably, and the time is spent simply inserting them and ejecting them. In the worst case, you might have troublesome discs that you really want to attempt to read everything from, which can take a great deal of (unattended) time.
For single-session data discs ("normal", every-day home-made discs), use GNU ddrescue to create an image of the disc contents, and a logfile of the imaging process:
ddrescue -n -b2048 /dev/cdrom cdimage.iso cdimage.log
This will create a
cdimage.iso file, hopefully containing your data, and a
cdimage.log, describing what
ddrescue managed to achieve. You
should archive both. When you stumble over the dics image, possibly years
later, the log file gives you the confidence that it was extracted
This will either complete reasonably quickly (within one to two minutes), or
will run with no sign of terminating. Once you've got a feel for how long a
successful extraction takes, terminate any attempt that lasts
much longer than that, putting those discs to one side in a "needs
attention" pile, to be re-attempted later. If
ddrescue does finish, it will
tell you if it couldn't read any of the disc. If so, put that disc in the
"needs attention" pile too.
Above, I wrote that I recommend this approach for home-made data discs. Broadly, I am assuming that such discs use a limited set of options and features available to disc authors: they'll either be single session, or multisession but you aren't interested in any files that are masked by later sessions; they won't be mixed mode (no Audio tracks); there won't be anything unusual or important stored in the disc metadata, title, or subcodes; etcetera.
This is not always the case for commercial discs, or audio CDs or video DVDs.
For those, you may wish to recover more information than is available to you
ddrescue. These aren't my focus right now, so I don't have much advice
on how to handle them, although I might in the future.
If your discs are labelled as poorly or inconsistently as mine, it might not be
obvious what filename to give each disc image. For my project I decided to
append a new label to all imported discs, something like "blahX", where X is an
incrementing number. So, for a fourth disc being imported with the label "my
files", the image name would be
my_files.blah5.iso. If you are keeping the
physical discs after importing them, You could also mark the disc with "blah5".
If you have managed to at least partially read data from the disc, you should
hopefully be able to pull a list of files that are stored on the image. Collect
and store this list alongside the image and the
ddrescue log file.
Mount the disc image as a "loopback mount", and use summain to collect a list of the contents of the image, along with the corresponding metadata:
sudo mount -o loop ./my_files.iso /mnt ( cd /mnt && summain ./ ) > my_files.summain.list
It's valuable to collect a list of the contents of ISO images so you can
easily search through collections of lists (via
grep etc) when looking
for a file, rather than have to access the ISO images themselves, which
will be slower, but also you may have deleted them if you decided there's
nothing worth keeping on them, or you have sorted and moved the contents
to other storage locations.
Using a loopback mount is the most reliable way, in my experience, of ensuring
that the disc's metadata are properly intepreted. Other tools exist to read the
contents without mounting it include
isoinfo (from the Debian package
iso-info (from the Debian package
p7zip, but they all have separate (and potentially serious) shortcomings.
xorriso is a very capable tool that can be used for this purpose and can interpret Joliet and RockRidge metadata extensions as well as lots of other things.
For capturing the contents of images, I think it's important to record files,
their metadata (size, last modified date, etc.) and checksums of their
contents. Approaches such as
find -ls get the metadata but not the
sha1sum outputs the checksums but not the metadata.
xorriso can output both, with e.g.
-find . -exec lsdl or
find . -exec
make_md5, but I could not convince it to do both at the same time:
If you have a partial disc image, it can sometimes be useful to know which files you have all the data for and which you do not, from which you can make a decision whether to continue trying to read the disc.
My badiso tool reads in an ISO image from the command line (e.g.
image.iso) and a corresponding
ddrescue log file (
image.log) and prints
out a file listing, indicating complete files with a green tick and incomplete
files with a red cross.
… ✔ './joes/allstars.zip ✗ './joes/ban.gif ✔ './joes/eur-mgse.zip ✔ './joes/gold.zip ✗ './joes/graphhack.txt ✗ './joes/machines.zip ✔ './joes/md.zip ✔ './joes/midi.zip …
badiso is currently written in Python and builds on top of
badiso, I have managed to sort partially read disc images into
- the files I care about are not damaged. I can extract them, delete the disc image and throw away the disc.
- the files I care about are damaged: I can find out which, and by how much.
Systems and software for organising specific types of files is beyond the scope of this article. For the most part, I'm simply using a filesystem for all types of data.
Once discs have been successfully imported and their contents identified, the next action for each disc can be identified.
In some cases, I had a place to put the contents of the discs. For example,
some of my discs have FLAC copies of music. I have a place to put that now
/archive/Music/Lossless), so I make a decision as to whether I want to keep
the data; check to see if I already have it in my archive, copy it into place,
delete the disc image and (eventually) discard the disc.
In other cases there's a wide variety of stuff on the disc. I find it helpful to extract the disc image to a path on my NAS and then delete the disc image. As an when, I can then move or remove files from those extracted paths to the proper places. It also means the contents of the images gets picked indexed by locate, which periodically runs on my system, so I can write queries to look for files across all extracted images quickly and easily.
I always keep the
ddrescue log file and the
summain index, because these
tell me the provenance of the files, and the confidence that I extracted them
from a disc without damage.