file level deduplication (Re: rsync alternative? (too many files))

Tue Mar 8 11:08:15 PST 2011

Thank you.  I'll take a look.

On Tue, Mar 8, 2011 at 04:28, Michael Paoli
<Michael.Paoli at cal.berkeley.edu> wrote:
> I wrote a utility some years back to do file level deduplication via
> multiple hard links within filesystem.  At the time, chief objectives
> were:
> o do it efficiently
>  o never read more blocks of a file than necessary
>  o never read a file more than once
>  o drop no longer needed data from memory as soon as feasible
> o do it accurately
>  o compare actual data, not hashes
> o ignore zero-length files (presume they're there for their metadata)
> o compromises/"features"
>  o don't care about most file metadata (ownerships, permissions)
>  o preserve file with oldest mtime
>  o if mtimes tie, break tie preserving file with most hard links
>    (efficiency)
>  o presume things aren't being changed under our feet (efficiency at
>    cost of accuracy/safety)
>
> Typical objective was to be able to point it at a filesystem
> (or portion(s) thereof) containing a large archive of files, and to
> replace any distinct occurrences of non-zero length files (of type f)
> having identical data, with hard links to one file.
>
> I suppose with enough limits/constraints on memory(/swap) and
> sufficiently large archive - particularly if archive also had certain
> attributes (sufficiently huge number of files and/or sufficiently large
> number of files of same length) - the program may choke for lack of
> resources.  It is, however, pretty efficient and scalable program and
> algorithm, so may well handle most archives without problem on systems
> that aren't too resource constrained.  I've not bumped into resource
> issues with it yet ... but I also haven't pointed it at multi-TiB
> archives yet, either.
>
> Anyway, not exactly the rsync hard link "farm" solution one may be
> looking for, but might possibly be modified (or similar approach used)
> to create something like that.
>
> references/excerpts:
> http://www.rawbw.com/~mp/perl/cmpln.tar.gz
> http://www.rawbw.com/~mp/perl/
>
>> From: "Tony Godshall" <tony at godshall.org>
>> Subject: rsync alternative? (too many files)
>> Date: Fri, 4 Mar 2011 16:32:29 -0800
>
>> Anyone know of an rsync alternative or workaround for huge
>> batches of files?  In particular I'm looking for the ability to do
>> the hardlink-a-tree-then-rsync way of making copies of a
>> complete filesystem without duplicating files and without
>> rsync crashing on me when the number of files to be transferred
>> gets too big.
>
>