file level deduplication (Re: rsync alternative? (too many files))

Tue Mar 8 04:28:30 PST 2011

I wrote a utility some years back to do file level deduplication via
multiple hard links within filesystem.  At the time, chief objectives
were:
o do it efficiently
   o never read more blocks of a file than necessary
   o never read a file more than once
   o drop no longer needed data from memory as soon as feasible
o do it accurately
   o compare actual data, not hashes
o ignore zero-length files (presume they're there for their metadata)
o compromises/"features"
   o don't care about most file metadata (ownerships, permissions)
   o preserve file with oldest mtime
   o if mtimes tie, break tie preserving file with most hard links
     (efficiency)
   o presume things aren't being changed under our feet (efficiency at
     cost of accuracy/safety)

Typical objective was to be able to point it at a filesystem
(or portion(s) thereof) containing a large archive of files, and to
replace any distinct occurrences of non-zero length files (of type f)
having identical data, with hard links to one file.

I suppose with enough limits/constraints on memory(/swap) and
sufficiently large archive - particularly if archive also had certain
attributes (sufficiently huge number of files and/or sufficiently large
number of files of same length) - the program may choke for lack of
resources.  It is, however, pretty efficient and scalable program and
algorithm, so may well handle most archives without problem on systems
that aren't too resource constrained.  I've not bumped into resource
issues with it yet ... but I also haven't pointed it at multi-TiB
archives yet, either.

Anyway, not exactly the rsync hard link "farm" solution one may be
looking for, but might possibly be modified (or similar approach used)
to create something like that.

references/excerpts:
http://www.rawbw.com/~mp/perl/cmpln.tar.gz
http://www.rawbw.com/~mp/perl/

> From: "Tony Godshall" <tony at godshall.org>
> Subject: rsync alternative? (too many files)
> Date: Fri, 4 Mar 2011 16:32:29 -0800

> Anyone know of an rsync alternative or workaround for huge
> batches of files?  In particular I'm looking for the ability to do
> the hardlink-a-tree-then-rsync way of making copies of a
> complete filesystem without duplicating files and without
> rsync crashing on me when the number of files to be transferred
> gets too big.