Welcome to little lamb

Code » limb » master » tree

[master] / src / doc / nextsplit.h / nextsplit_ae.3.md

% limb manual
% nextsplit_ae(3)
% limb 0.1.0
% 2023-07-24

# NAME

nextsplit_ae, nextsplit_buz, nextsplit_rabin - find the next split for
variable-length chunks in a data stream


# SYNOPSIS

    #include <limb/nextsplit.h>

```pre hl
size_t nextsplit_ae(size_t <em>min</em>, size_t <em>avg</em>,
                    const void *<em>data</em>, size_t <em>dlen</em>)
size_t nextsplit_buz(size_t <em>min</em>, size_t <em>avg</em>,
                     const void *<em>data</em>, size_t <em>dlen</em>)
size_t nextsplit_rabin(size_t <em>min</em>, size_t <em>avg</em>,
                       const void *<em>data</em>, size_t <em>dlen</em>)
```

# DESCRIPTION

These functions are used for content-based chunking, in order to find the next
breakpoint where to split the data stream into a chunk - useful for things such
as data deduplication.

Each of them will look into `data` (up to `dlen`) for position to split a chunk,
which shall be at least `min` bytes long and with average/targeted length of
`avg`. Failing to find one, they will return `dlen` for a full chunk - as such,
`dlen` should be the maximum chunk size and not the total data length.

The `nextsplit_ae`() function uses Asymmetric Extremum Content Defined Chunking
Algorithm, which is notably extremely fast.

The `nextsplit_buz`() function uses the buzhash rolling hash, which gets quite
better deduplication results whilst remaining very fast (though not as fast as
AE).

The `nextsplit_rabin`() function uses a rolling-hash algorithm based on Rabin
fingerprint, which tend to get slightly better deduplication results, albeit
being slower.

Note that it isn't uncommon for `nextsplit_buz`() to lead to better
deduplication results (than both other alternatives).

# RETURN VALUES

All those functions will return the length of the next chunk from `data` as
determined by their algorithm, or `dlen` when no such split could be determined.