rsync alternative? (too many files)

Seth David Schoen schoen at loyalty.org
Sun Mar 6 15:19:53 PST 2011


Tony Godshall writes:

> > find . -type f -print0  -xdev | xargs -0
> > ...running cp piped into ssh, or whatever.  'Ware slowness.
> 
> Yeah.  I looked at doing a find ... -type d -print0 | xargs -0 mkdir
> followed by a one that does the rsyncs without the recursion[1] so
> that each rsync would have only one file to do, but that doesn't,
> unless I'm missing something, preserve the hardlinks, which is pretty
> important since I've got something like 2.5TB residing in about 1.5TB
> after file-level deduplication that I'm trying to copy to a 2TB
> removable volume.

If you're sure that the filenames don't contain tabs, you can

find . -type f -printf '%p\t%i\n' | sort -k 2 -n | uniq -f 1

to get all the unique ones copy only those, and then

find . -type f -printf '%p\t%i\n' | sort -k 2 -n | uniq -D -f 1

to find out what the duplicated ones are.

You could perhaps use this information as part of something like

find . -type f -printf '%p\t%i\n' | sort -k 2 -n | uniq --all-repeated=separate -f 1 | awk '/./ && inside { print "ln "  master " "  $1; }; /./ && !inside { master = $1; inside = 1; }; ! /./ { inside = 0; }'

This would break if filenames contain spaces but it should be possible
to adjust it so it doesn't.

If the find takes a long time, you could save its output into
/tmp/find.out and then replace the subsequent find commands with
cat /tmp/find.out, since the find output should never change
(and neither should the result of sort, for that matter).

-- 
Seth David Schoen <schoen at loyalty.org> | Qué empresa fácil no pensar en
     http://www.loyalty.org/~schoen/   | un tigre, reflexioné.
     http://vitanuova.loyalty.org/     |            -- Borges, El Zahir


More information about the bad mailing list