NAME
nextsplit_ae, nextsplit_buz, nextsplit_rabin - find the next split for variable-length chunks in a data stream
SYNOPSIS
#include <limb/nextsplit.h>
size_t nextsplit_ae(size_t min, size_t avg, const void *data, size_t dlen) size_t nextsplit_buz(size_t min, size_t avg, const void *data, size_t dlen) size_t nextsplit_rabin(size_t min, size_t avg, const void *data, size_t dlen)
DESCRIPTION
These functions are used for content-based chunking, in order to find the next breakpoint where to split the data stream into a chunk - useful for things such as data deduplication.
Each of them will look into data
(up to dlen
) for position to split a chunk,
which shall be at least min
bytes long and with average/targeted length of
avg
. Failing to find one, they will return dlen
for a full chunk - as such,
dlen
should be the maximum chunk size and not the total data length.
The nextsplit_ae
() function uses Asymmetric Extremum Content Defined Chunking
Algorithm, which is notably extremely fast.
The nextsplit_buz
() function uses the buzhash rolling hash, which gets quite
better deduplication results whilst remaining very fast (though not as fast as
AE).
The nextsplit_rabin
() function uses a rolling-hash algorithm based on Rabin
fingerprint, which tend to get slightly better deduplication results, albeit
being slower.
Note that it isn't uncommon for nextsplit_buz
() to lead to better
deduplication results (than both other alternatives).
RETURN VALUES
All those functions will return the length of the next chunk from data
as
determined by their algorithm, or dlen
when no such split could be determined.