rsync alternative? (too many files)

Tony Godshall togo at of.net
Mon Mar 7 17:26:46 PST 2011


On Mon, Mar 7, 2011 at 15:57, Rick Moen <rick at linuxmafia.com> wrote:
> Quoting Tony Godshall (togo at of.net):
>
>> Hi Seth.
>>
>> I must not have expressed myself clearly.
>>
>> There are excessive unique files, not duplicate entries in a list of files.
>>
>> The files have already been deduplicated in the sense that entries to
>> files containing the same content are hardlinks.
>>
>> If I were to copy the files to new media without retaining the
>> hardlinks, they would take up way more space.
>
> I'm afraid I initially didn't quite understand your phrase 'copy the
> files without retaining the hardlinks' in this context -- though I now
> have a hunch about what you're talking about.  (It's entirely possible I
> need more caffeine.)

Or maybe I should say "while" when I mean "while" and not say
"without" when I mean "while".

...
> But you mean 'preserving the hard links as being multiple maps to shared
> inodes rather than maps to individual, hardlink-specific inodes', right?

Yes

> Apologies for having not grasped your meaning.

Apologies for not having expressed my meaning.

> Also, you never really clarified whether you were talking about copying
> files within a host or across a network between hosts.

> ... I now strongly
> suspect you meant the _former_ (and thus the hardlinks you wish to
> 'preserve' are between source and destination, i.e., not needing to
> create new inodes for the destination copy).

I can give you more details[1] but as to what I think you are getting at,
no, source and dest are not the same partition.  If you can make cross-
partition hardlinks, I'll buy you a beer.

> ... Most of us think about the
> copying problem, especially when hauling out rsync, within the context
> of inter-host file copying.

Ah.  I use rsync first, almost always.[2]  Of course, if I wanted to make
a hardlink shadow tree, I'd use cp -al .

Yes, what I need is for the destination to have the same hardlink structure
as the source- it's a file by file backup of a bunch of machines and many
of the files were identical and will never be written and we don't care about
mtimes so they've been hardlinked.  That's what I mean when I say the
amount of storage will go way up if I rsync directory by directory without -r
each directory entry will duplicate a copy of the file

> Here's a page that says cpio is the right tool in this context:
> http://jeremy.zawodny.com/blog/archives/010037.html
>
> I'll bet GNU tar would also work.

Yes, perhaps gtar or cpio for the initial copy would be the right tool
for the job and then maybe rsync won't run out of memory when I do
subsequent copies.  Not resumable, but at least I'd not run out of memory.

> (You'll reportedly need to run it as root if you want to preserve access
> times.)

(I don't preserve access times.  Hell, I don't even record access times.)

> By the way, you _did_ attempt the rsync copy with the -H flag, right?
> You probably did:  I am not bothering to check the upthread posts.  It's
> necessary for the 'preserve hard links' behaviour you want, although
> running out of RAM for huge copies can still be a problem.

My inital point was that rsync runs out of memory, not that I'm having
trouble having it retain hardlinks.

Yes, I'm sorry I wasn't explicit.  I habitually use -aHx and am explicit at
copying mounts so as to be bitten by various mount and/or symlink related
issues I've since blocked from my mind.

I'm also sorry I didn't respond directly to your early comment...

> > > I'm not sure I followed the first half of that sentence,  ...

Here's a writeup that explains what I refer to as the "cp -al and rsync trick"

  http://www.mikerubel.org/computers/rsync_snapshots/#Incremental

I didn't write that and I'm not sure where I got it originally, but that's
the sort of disk-to-disk backup I've been doing that results in my having
lots of hardlinked files that bloat if you don't use -H in your rsync

By the way, that new --link-dest= option looks very interesting.
I'll have to try it the next time I implement, but I don't think it'll help
me with the out-of-memory thing.

(Oh, and did I say "I don't preserve access times."?  Hell, with the
cp -al and rsync trick, I don't even preserve mtimes if they change
and the contents don't)

Thanks Rick, thanks all.

Tony


[1] what I thought were irrelevant details: in this case I am copying from
one local partition (ext3 on LVM2 on PVs on SATA) to a different one
(external 2TB hard drive).  But it could just as soon be across the network
as well.

[2]   It is way more resumable[3] than cp and uses way less bandwidth too.
And it has great "don't you dare overwrite" options, --backup and --suffix.

[3] And what's more likely to require resuming than copying terabytes...
well, copying petabytes, I guess, but I haven't had to deal with that yet


More information about the bad mailing list