It all began with rsync by Andrew Tridgell. He created an innovative algorithm to transfer and update files efficiently, by calculating the needed difference between two files at different locations. The implementation of this algorithm is a great utility, and soon became an important part of any Unix toolkit.
Unfortunately, rsync is great as an standalone utility, but it's hard to integrate within another program. An attempt to make apply the rsync algorithm to a slightly different problem was rproxy, by Martil Pool. One of the first requirements for rproxy was a library to encapsulate the rsync alrogithm, so he first created librsync.
Since it's a library intended to be used in very different environments, librsync deals exclusively with data; it doesn't include any facility to manage files or network, and doesn't specify protocol between sender and receiver. Its only concern is analysing data to create and apply signatures and deltas.The rproxy project has been inactive since late 2002; but librsync continued to be developed, and is used in some other projects, like unison, rdiff-backup and superlifter.
The rsync algorithm has three phases: signature creation, delta creation, and delta application or patching. Here i'll describe the process of updating "fileA" to a new version called "fileB".
A signature is a set of block checksums. fileA is split in a series of blocks (by default 2048 bytes each) and for each one, two checksums are calculated. One is called the "rolling checksum", and the other is the "strong checksum". A rolling checksum is one with a property that makes it efficient to scan a big data stream generating the checksum at every offset. The strong checksum, on the other hand has to be calculated from scratch for each data block, but it's much less likely to generate collitions. By default, the strong checksum used is MD4.
Using the default 32bit rolling checksum and the 64bit MD4 for the strong, we have 12bytes of signature for each data block. With 2048 bytes on each block, that means the signature is 12/2048 (less than 0.6%) the size of fileA. In other words, 6KB for each Megabyte of fileA.
An rsync-like file update begins when the destination generates the signature of the file it already has, and transfers it to the sender.
The Delta is what is needed to convert fileA into fileB. It's calculated using fileB and the signature from fileA. Note that it's not needed to have fileA to calculate the Delta, only its signature. It's a set of 'commands', that tells how to reconstruct fileB using as much of fileA as possible. Any other data on fileB that isn't present anywhere on fileA is included in the delta. This library doesn't perform any additional compression to the data included in the delta.
When the sender has received the signature of fileA, it calculates the delta and transfers it back to the destination.
The process of applying the delta to fileA is called patching, and the result should be identical to fileB. Since the delta contains references to arbitrary blocks of fileA, it's important to be able to access randomly to fileA's data.
Since lua-rdiff doesn't do any file or network I/O, it uses the concept of data sources and data sinks. A data source can be simply a Lua string, or a Lua function that would be called repeatedly to get the data piece by piece. A data sink is an optional function that, if provided, is called with the produced data. If the data sink null or unspecified, the API call returns the data as a string.
The lua-rdiff package defines the table
rdiff, containing three
rdiff.signature (srcA [,sink])
Calculates the signature of the data given by
srcA. This function
only needs sequential access to the data, so if
srcA is a function,
it'll be called without parameters, and should return chunks of data as strings
nil to signal end of data.
sink is given, it should be a function and would be called with
a string to be written sequentially to the desired output. On sucess,
true, on error it
nil and an error code.
sink is not given,
rdiff.signature() returns the
signature as a Lua string, and
nil and an error code on
rdiff.delta (signature, srcB, [,sink])
Creates the delta needed to transform fileA into fileB. Both
srcB are data sources read sequentially. The delta is either
sink or returned as a Lua string.
rdiff.patch (delta, srcA [,sink])
Applies the given delta into the base file. Both
srcA parameters are data sources; but
provide random-access. If it's given as a function, it will be called with
two number parameters specifying a position and length, respectively. It
should read a block of data and return it as a string.
Assuming the original data is in
this code creates the signature for
strA, calculates the
strB, and recreates
sig_str = rdiff.signature (strA) delta_str = rdiff.delta (sig_str, strB) new_str = rdiff.patch (delta_str, strA)
This function creates a simple file-based data source from a given open file.
Note that it provides sequential data when called without parameters, and
random-access when called with
function fsrc (f) return function (pos, len) if pos == nil and len == nil then return f:read (4096) else f:seek ("set", pos) return f:read (len) end end end
This is the corresponding data sink creator: we don't need any random-access sink, so it's much simpler.
function fsink (f) return function (data) f:write (data) end end
fA = io.open (fileAname, "rb") fsig = io.open (signaturename, "wb") assert (rdiff.signature (fsrc (fA), fsink (fsig))) fA:close() fsig:close()
fsig = io.open (signaturename, "rb") fB = io.open (fileBname, "rb") fdelta = io.open (deltaname, "wb") assert (rdiff.delta (fsrc (fsig), fsrc (fB), fsink(fdelta))) fsig:close() fB:close () fdelta:close()
fdelta = io.open (deltaname, "rb") fA = io.open (fileAname, "rb") fnew = io.open (newfilename, "wb") assert (rdiff.patch (fsrc(fdelta), fsrc(fA), fsink(fnew))) fnew:close() fA:close() fdelta:close ()
Copyright © 2006 Javier Guerra G. All rights reserved.