limb 0.2.0

2024-01-09

nextsplit_ae(3)
limb manual
nextsplit_ae(3)

NAME

nextsplit_ae, nextsplit_buz, nextsplit_rabin - find the next split for variable-length chunks in a data stream

SYNOPSIS

#include <limb/nextsplit.h>
size_t nextsplit_ae(size_t min, size_t avg,
                    const void *data, size_t dlen)
size_t nextsplit_buz(size_t min, size_t avg,
                     const void *data, size_t dlen)
size_t nextsplit_rabin(size_t min, size_t avg,
                       const void *data, size_t dlen)

DESCRIPTION

These functions are used for content-based chunking, in order to find the next breakpoint where to split the data stream into a chunk - useful for things such as data deduplication.

Each of them will look into data (up to dlen) for position to split a chunk, which shall be at least min bytes long and with average/targeted length of avg. Failing to find one, they will return dlen for a full chunk - as such, dlen should be the maximum chunk size and not the total data length.

The nextsplit_ae() function uses Asymmetric Extremum Content Defined Chunking Algorithm, which is notably extremely fast.

The nextsplit_buz() function uses the buzhash rolling hash, which gets quite better deduplication results whilst remaining very fast (though not as fast as AE).

The nextsplit_rabin() function uses a rolling-hash algorithm based on Rabin fingerprint, which tend to get slightly better deduplication results, albeit being slower.

Note that it isn't uncommon for nextsplit_buz() to lead to better deduplication results (than both other alternatives).

RETURN VALUES

All those functions will return the length of the next chunk from data as determined by their algorithm, or dlen when no such split could be determined.

limb 0.1.0
2023-07-24
nextsplit_ae(3)