What if you have a mis-behaving electronic device, which might corrupt files you store on it? I'm interested in tools to inspect incremental backups, look at file changes and alert me if they are suspicious. For example, if an MP3 changes, did the MPEG data change, or the ID3 tag, or something else? An ID3 change is likely to be on purpose, MPEG data less so. Similar for JPEGs and EXIF metadata. I decided to poke around in my backups of my music player. I discovered some pretty weird stuff.

Firstly, quite a few MP3s seem to have changed on my player over time. I found 788 instances of MP3s changing a very small amount: an 11 byte rdiff-format patch. This seemed particularly strange, so I investigated a little:

01 Fool's Day.mp3.2012-05-31T22:40:09+01:00.diff
00000000  72 73 02 36 47 00 00 59  01 ad 00                 |rs.6G..Y...|
00000000  72 73 02 36 47 00 00 54  0f 58 00                 |rs.6G..T.X.|

In a nutshell, the patches appear to do nothing, or at least copy the input file verbatim into the output. (my rdifffs source is a useful document for the rdiff file format, or failing that, the rdiff source itself.) This is probably just a behavioural nuance of my backup software.

That leaves 156 changed MP3s with rdiff patches ranging in size from 21 bytes to 26k. The smallest are almost certainly no-ops, just less efficiently stored ones. The largest looks like embedded album art in the ID3 tag being added, and I'm guessing the mid-size ones are ID3 textual changes (spelling corrections etc.), but ID3 changes are very hard to inspect by eye in the raw bytestream.

For this reason I think a tool that could sort through file changes and pick out things which might need human investigation might be useful. Such tools could be run automatically after backups complete, or at scheduled times. I'll probably start writing some over the next few weeks, but if you know of any that already exist or might form part of a solution, please let me know!


comment 1
Hachoir might be a useful place to start with writing that.
Comment by Anonymous
comment 1
The backup tool bup integrates with par2 which can create parity data. This enables you to both verify and repair your backup sets in one operation, and the more space you allocate to the parity data, the more redundancy you have. Maybe par2 is not useful directly in this case, but the existing solution might be interesting to you.
Comment by ulrik
comment 3

Thanks for the suggestions!

I might consider bup again for this. My main concern about it is that you cannot throw away old increment data: that is, the backup repository can only grow.

comment 4

In the specific case of MP3 files, you could try http://snipplr.com/view/4025.5422/

It calculates an MD5 hash over only the music portion of the MP3, so if you (or a media player) change some id3 info (or whatever meta info), you will still be able to check if your music stayed intact. Also great for finding duplicates in your MP3 collection, if MP3 files with the same music stream came from different sources and thus have different ID3 tags.

Comment by Anonymous
comment 5

I haven't actually looked at or tested it yet, but a quick Google search came up with http://code.google.com/p/diffmp3/ - granted, it dates back to 2009, but this doesn't really mean that it's not at least partly functional or maybe salvageable.

Sorry if you've already looked at it and found it wanting.