This page is a draft.

I'm in the middle of a long-overdue project to import all my old home-made CD-R/DVD-R/DVD+Rs of data onto my NAS. This page summarizes my approach and what I think are the right tools and processes for the job.

<<<<<<< HEAD

Introduction and rationale

a pile of CDs and DVDs

discs in progress


Introduction and rationale

ab7b34df763f30ce7df3f205281941ecc18285a2

Example of a degraded DVD-R

Example of a degraded DVD-R

Moiré-like degredation on a commercial CD-ROM

Moiré-like degredation on a commercial CD-ROM

From the late 1990s until relatively recently, I used to archive digital material onto CD-Rs, and later DVD-Rs. I had about half a dozen spindles of hundreds of home-made discs with all sorts of stuff on them. If you are in a similar position, consider a project like this as soon as possible: based on my experiences it might already be too late for some of them. The adjacent pictures were created with the ddrescueview software and show both a degraded home-made DVD-R and a degraded commercially pressed CD-ROM. I came across both as I embarked on this project.

<<<<<<< HEAD The process I've refined can be roughly divided into these stages:

The stages can run like a pipeline, where you perform steps 3 onwards for some images whilst you are beginning stage 2 for others, which could be useful if you have a lot of discs.

The article is structured to offer my recommendation for tools or techniques,

The stages of this guide can run like a pipeline, where you perform steps 3 onwards for some images whilst you are beginning stage 2 for others, which could be useful if you have a lot of discs.

The guide is structured to offer my recommendation for tools or techniques,

ab7b34df763f30ce7df3f205281941ecc18285a2 without spelling out the rationale. The tools I recommend are available on GNU/Linux systems. They are likely available on other UNIX-based systems (including Mac OS X), but you may need to look elsewhere for help with Windows.

Gather the discs & preparation

a pile of CDs and DVDs

discs in progress

Fetch all the discs you want to read together into one place.

<<<<<<< HEAD I had some in jewel cases and others on the spindles that the blank discs come on. I had some at my house, some in boxes at my parents house and more at work. I decided to consolidate most of the discs down onto the spindles and throw away most of the jewel cases. If you suspect a particular disc as having particularly valuable data on it, you may wish to leave it in a jewel case. You might also want to hang onto one or two empty jewel cases, if there's a chance you'll happen upon a disc you want to give to someone else.

I dedicated an initially-empty spindle as the "done" spindle, opting to keep the imported discs, at least until the import process was complete. I also kept a second "needs attention" spindle for discs that couldn't be read successfully straight away. I labelled both using a label-maker.

Don't trust disc labels: If in doubt, put the disc in your "to image" pile. Don't throw a disc out on the basis of the label alone. I had a bad habit of topping up a disc which was mostly for one thing with other data if there was space left over. Mistakes can also be made, and I had plenty of unlabelled

discs anyway.

Mine were spread across jewel cases and spindle-tubs at home, at work and some in boxes in my parents house. I consolidated them down to tubs in most cases, leaving only those discs that looks particularly valuable in jewel cases. I kept a couple of empty jewel cases for transporting discs to other people and threw the rest out.

This consolidation exercise on its own reduced the space occupied by my discs (and their cases) tremendously.

I labelled an empty tub "imported to NAS" with a label maker. As I progressed with the following steps, I moved discs into that tub. The intention will be to throw away the tub once I'm completely done. I labelled another empty tub "needs attention" for discs that could not be read on the first attempt.

Don't trust disc labels. When sorting your discs, Don't throw a disc out on the basis of the label alone. I had a bad habit of topping up a disc which was mostly for one thing with other data if there was space left over. Mistakes can also be made, and I had plenty of unlabelled discs anyway. If in doubt, put the disc in your "to image" pile.

ab7b34df763f30ce7df3f205281941ecc18285a2

You're going to need a computer with sufficient storage space upon which to store the disc images, metadata and/or the data within them, once you start organising it. You're also going to need an optical drive to read them. If you haven't yet got a system in place for reliably storing your data and managing backups, it would be worth sorting that out first before embarking on a project like this.

Finally, this is going to take time. In the best case, discs read quickly and reliably, and the time is spent simply inserting them and ejecting them. In the worst case, you might have troublesome discs that you really want to attempt to read everything from, which can take a great deal of (unattended) time.

<<<<<<< HEAD

Identify what type of disc it is


Identify types and varieties of disc

ab7b34df763f30ce7df3f205281941ecc18285a2

This section is a draft.

coming soon

Initial import of the discs

conventional, home-made data discs

For single-session data discs ("normal", every-day home-made discs), use GNU ddrescue to create an image of the disc contents, and a logfile of the imaging process:

ddrescue -n -b2048 /dev/cdrom cdimage.iso cdimage.log

This will create a cdimage.iso file, hopefully containing your data, and a map file cdimage.log, describing what ddrescue managed to achieve. You should archive both. When you stumble over the dics image, possibly years later, the log file gives you the confidence that it was extracted successfully.

<<<<<<< HEAD This should complete reasonably quickly — within one or two minutes. If it doesn't, there's a problem with reading the disc. ddrescue's Once you've got a feel for how long a successful disc extraction takes

This will either complete reasonably quickly (within one to two minutes), or will run with no sign of terminating. Once you've got a feel for how long a successful extraction takes, I'd recommend terminating any attempt that lasts

much longer than that, and putting those discs to one side in a "needs

This will either complete reasonably quickly (within one to two minutes), or will run with no sign of terminating. Once you've got a feel for how long a successful extraction takes, terminate any attempt that lasts much longer than that, putting those discs to one side in a "needs

ab7b34df763f30ce7df3f205281941ecc18285a2 attention" pile, to be re-attempted later. If ddrescue does finish, it will tell you if it couldn't read any of the disc. If so, put that disc in the "needs attention" pile too.

unconventional, multisession, mixed-mode, or commercial discs

Above, I wrote that I recommend this approach for home-made data discs. Broadly, I am assuming that such discs use a limited set of options and features available to disc authors: they'll either be single session, or multisession but you aren't interested in any files that are masked by later sessions; they won't be mixed mode (no Audio tracks); there won't be anything unusual or important stored in the disc metadata, title, or subcodes; etcetera.

This is not always the case for commercial discs, or audio CDs or video DVDs. For those, you may wish to recover more information than is available to you via ddrescue. These aren't my focus right now, so I don't have much advice on how to handle them, although I might in the future.

labelling and storing images

If your discs are labelled as poorly or inconsistently as mine, it might not be obvious what filename to give each disc image. For my project I decided to append a new label to all imported discs, something like "blahX", where X is an incrementing number. So, for a fourth disc being imported with the label "my files", the image name would be my_files.blah5.iso. If you are keeping the physical discs after importing them, You could also mark the disc with "blah5".

Step 3: Figuring out what is on each disc

If you have managed to at least partially read data from the disc, you should hopefully be able to pull a list of files that are stored on the image. Collect and store this list alongside the image and the ddrescue log file.

Mount the disc image as a "loopback mount", and use summain to collect a list of the contents of the image, along with the corresponding metadata:

sudo mount -o loop ./my_files.iso /mnt
( cd /mnt && summain ./ ) > my_files.summain.list

alternatives, rationale

It's valuable to collect a list of the contents of ISO images so you can easily search through collections of lists (via grep etc) when looking for a file, rather than have to access the ISO images themselves, which will be slower, but also you may have deleted them if you decided there's nothing worth keeping on them, or you have sorted and moved the contents to other storage locations.

Using a loopback mount is the most reliable way, in my experience, of ensuring that the disc's metadata are properly intepreted. Other tools exist to read the <<<<<<< HEAD contents without mounting it, including isoinfo(from the Debian package genisoimage),iso-info(from the Debian packagelibcdio-utils`),

7z/p7zip, but they all have separate (but potentially serious) shortcomings.

contents without mounting it include isoinfo (from the Debian package genisoimage), iso-info (from the Debian package libcdio-utils) and 7z/p7zip, but they all have separate (and potentially serious) shortcomings.

ab7b34df763f30ce7df3f205281941ecc18285a2

xorriso is a very capable tool that can be used for this purpose and can interpret Joliet and RockRidge metadata extensions as well as lots of other things.

For capturing the contents of images, I think it's important to record files, <<<<<<< HEAD their metadata (size, last modified date, etc.) and checksums of their contents. Approaches such as find -ls get the metadata but not the checksums; recursive sha1sum1 the checksums but not the metadata. xorriso can output both, with e.g. -find . -exec lsdl or find . -exec make_md5, but I could not convince

it to do both at the same time: summain does.

their metadata (size, last modified date, etc.) and checksums of their contents. Approaches such as find -ls get the metadata but not the checksums; recursive sha1sum outputs the checksums but not the metadata. xorriso can output both, with e.g. -find . -exec lsdl or find . -exec make_md5, but I could not convince it to do both at the same time: summain does.

ab7b34df763f30ce7df3f205281941ecc18285a2

Finding out what data is not extracted

If you have a partial disc image, it can sometimes be useful to know which files you have all the data for and which you do not, from which you can make a decision whether to continue trying to read the disc.

My badiso tool reads in an ISO image from the command line (e.g. image.iso) and a corresponding ddrescue log file (image.log) and prints out a file listing, indicating complete files with a green tick and incomplete files with a red cross.

 './joes/allstars.zip
 './joes/ban.gif
 './joes/eur-mgse.zip
 './joes/gold.zip
 './joes/graphhack.txt
 './joes/machines.zip
 './joes/md.zip
 './joes/midi.zip
…

badiso is currently written in Python and builds on top of xorriso.

<<<<<<< HEAD

Using badiso, I have managed to sort partially read disc images into

  • the files I care about are not damaged. I can extract them, delete the disc image and throw away the disc.
  • the files I care about are damaged: I can find out which, and by how much.

    ab7b34df763f30ce7df3f205281941ecc18285a2

Retrying damaged/degraded discs

This section is a draft.

coming soon

Organising the extracted files

<<<<<<< HEAD

coming soon

Systems and software for organising specific types of files is beyond the scope of this article. For the most part, I'm simply using a filesystem for all types of data.

Once discs have been successfully imported and their contents identified, the next action for each disc can be identified.

In some cases, I had a place to put the contents of the discs. For example, some of my discs have FLAC copies of music. I have a place to put that now (/archive/Music/Lossless), so I make a decision as to whether I want to keep the data; check to see if I already have it in my archive, copy it into place, delete the disc image and (eventually) discard the disc.

In other cases there's a wide variety of stuff on the disc. I find it helpful to extract the disc image to a path on my NAS and then delete the disc image. As an when, I can then move or remove files from those extracted paths to the proper places. It also means the contents of the images gets picked indexed by locate, which periodically runs on my system, so I can write queries to look for files across all extracted images quickly and easily.

I always keep the ddrescue log file and the summain index, because these tell me the provenance of the files, and the confidence that I extracted them from a disc without damage.

ab7b34df763f30ce7df3f205281941ecc18285a2

<<<<<<< HEAD

ab7b34df763f30ce7df3f205281941ecc18285a2


Comments