FOSS Unleashed

Dealing with buffers in libflux

In C there are a few ways to handle buffers of data. The C API provides two such means.

  1. Terminal/sentinel values: This is most commonly seen in “c-strings”, where the string is terminated with a special NUL (0x00) character. This makes it impossible to use any function that expects such a value with binary data which might contain a NUL byte early or lack one entirely.
  2. Pointer+Size pairs: This is seen in the read()/write()/fread()/fwrite()/memcpy()/etc functions. You provide a pair of arguments: a pointer and a size/length of the pointer. Sometimes you need to provide a third value: the size of the data type (when providing a length of an array).

However, there are a few other possible means:

  1. “Pascal Strings”: So called pascal-strings provide a size type (usually a 16bit integer) inline with the data before the actual data. This can be seen throughout the 9p and other protocols, this allows for passing binary data with less worry (sanity checking possibly malicious/erroneous lengths is necessary). This is also used in certain dynamic array implimentations.
  2. Start+End Pointer pairs: This can be seen in certain parts of various STL implimentations (via the .begin() and .end()), the end pointer can either be a pointer to the final valid value, or one passed the final valid value.
  3. Start+End Index: Similar to the prior example, these are simply indexes into a buffer or array that will be referenced along side the indexes. This can be seen in the jsmn json parser library. Again the end index could either be one passed the final valid value or the final valid value.
  4. Complicated State Objects: Seen in some STL objects, the object or struct would be expected to be passed around instead of direct access to the array and or buffer. This is unavoidable in certain cases, but sometimes done needlessly.

Pros and Cons

Let’s look at what each option has that’s good and bad about them.

Sentinel Values

The sentinel value can be seen throughout the core C standard library (AKA stdlib). The argv and env values passed to main() are NULL-terminated arrays of NUL-terminated strings. Every string literal is NUL-terminated. All functions expected to deal with ASCII or UTF strings expect NUL-terminated strings. Using NUL-terminated strings are the path of least resistance, there is no easy means to use anything else, and there are nearly every kind of function you’d want to handle them provided in <string.h>.

On the otherhand it’s very easy to forget that you need to allocate space for the NUL-terminator, pass data which has an early terminator, or is missing one, and some of the provided functions can produce data that is potentially not NUL-terminated when they normally would (see: entire reason strlcpy exists when strncpy exists and is standard). What about when you’re recieving data that is a mix of “string” data and binary data (eg: network protocols)? You have a buffer of “string” data, that is very likely not NUL-terminated, and thus you either have to copy it out, destroy the source buffer’s integrity (by inserting a NUL, possibly corrupting something else in that packet), or forgo all the normal string functions. Finally, unless you cache this data, to calculate the size of the data in such a buffer or array, a loop must be performed.

Pointer+Size pairing

Also available often through-out the C standard library, and the Berkley socket API (IE: the one everyone uses, because it’s in POSIX). This is less acessible, but still very common throughout stdlib. Some very, very basic operations (memset, bzero, memcpy) are available, but not much in the means of parsing or constructing data meant to be in those buffers. This is the means of transfer for any binary data, or anything that could possibly produce binary data (eg: read()). So long as you keep both parts of the data together, and update the length when passing a further index into the buffer (eg: you’re using repeated memcpy calls to construct data in the buffer), this works very well. While this provides everything needed to properly avoid buffer over-runs, and using indexes (instead of advancing a pointer) you avoid it entirely.

Pascal Strings (inline size)

Very similar to the Pointer and Size pairing, this can trivially be passed to functions requiring those arguments (and trivial macros could be provided to assist). This requires construction of functions that make use of this technique if working with C code. The common means of removing elements of the start of such a buffer or array would require making a copy with the new size inserted before the first element once more. Which means additional copying, which might not otherwise be required. However, this does mean the user of such an API need only provide a single value representing a single buffer, which can make code more readable (not that this is a huge boon in most cases).

Start and End Pointers or Indexes

These are easily convertable between each other, (buffer + index -> pointer and pointer - buffer -> index), it takes care to note wether or not the end value is the final valid value or one beyond it (this affects if end checks are using > or >=). This can also easily be converted into a buffer+size pair (buffer = pointer, size = pointer - buffer) to make use of APIs which use that convention, and when doing so, advancing the pointer means one doesn’t need to continually update the size (as it’s a simple substraction calculation). One could also have a list of start/end pairs all pointing to the same backing buffer safely, and provided that one keeps operations withing those start/end bounds (and there are not overlaps) one could safely mutate data in the buffer for that pairing without corrupting another value within the buffer (eg: the list of start/end pairs are tokenized values from the input buffer). However, there aren’t convenient APIs to use start/end pairs, code has to be littered with conversions to appropriate arguments to functions. Likewise, providing arguments to functions that take such pairs can be very akward (such as when wanting to pass a string literal).

What I do in libflux

Because of dealing a large amount of binary data which contained textual data:

  • I need functions that can easily handle an ambiguous case (partially binary, fully binary, fully textual with optional NUL terminator)
  • I need to take a NUL-terminated string and copy it into a buffer without the NUL terminator (and sometimes with)
  • I need to be able to store textual data and pass it around, and cannot rely on the source having a NUL-terminator. If I need to add a NUL-terminator I can copy the string at that point.
  • I need to be able to know if a buffer was fully or partially consumed or range (eg: strcmp, atol, or other parsing/extraction methods)
  • I would like to reduce the amount of needless copies of all or part of a buffer (eg: for purposes of adding a NUL-terminator)
  • I would like to be able to parse and tokenize a buffer:
    • When I can’t guarantee a NUL terminator
    • Without making a copy of every token
    • Without needing to insert a NUL terminator for each token (there are situations where the tokens a fully adjacent to each other)

Because of this, I feel the best option is to have start and end pointers. I can create sub-ranges of the backing data trivially without needing to modify the data, possibly corrupting other segments. For the purposes of the API I wrote to solve those issues, I use a special struct:

1
2
3
struct FluxBuffer {
uint8_t *start, *end;
};

This struct uses the start and end pointer method, and I have rewritten versions of most of the normal <string.h> functions to work with these structures. There’s bufcpy to copy data (and because size checks are trivial, it checks the size of the destination is large enough for holding the data in the source), buflcpy (which adds a NUL-terminator to the destination after the copy), bufcmp (which can optionally provided a pointer to where the comparison checks ended), and bufindex (which returns a pointer to the first instance of a character in the buffer).

I also provide a (FLUX_)BUFLIT (Buffer Literal) macro which makes use of the fact that string literals with the same contents can share an address (this breaks if they don’t, but I have a unit-test to sanity check that they do). Due to how it is implimented, you can also provide a static buffer and call BUFLIT on it to get a valid buffer.

This means I can write code like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
uint8_t buffer[1 << 14], *p;
Buffer buf = BUFLIT(buffer);
ssize_t sz;

sz = recv(sock, buffer, sizeof(buffer));

// check sz for -1 error

buf.end = buffer + sz;

if (0 == bufeq(buf, BUFLIT("start"), &p)) {
// `bufeq` is a more lenient `bufcmp`, and will not fail on a comparison of the final NUL-terminator character in the second argument (arguments are otherwise exactly the same)
// `p` points to the point in `buf` after the last good match
// This means that if we recieved exactly five characters ('s', 't', 'a', 'r', 't') then `p == buf.end` will be true
// otherwise, we could use this behavior to parse out more of the recieved data by setting `buf.start` to `p`
}

With more functions provided, and some more helper macros (possibly one to fill both pointer and size arguments of common functions), we could keep code both safe and readable.

In the event one recieves a NUL-terminated string into a buffer:

1
2
3
4
5
6
uint8_t buffer[1 << 14];
Buffer buf = BUFLIT(buffer);

functionThatWritesNULTerminatedString(buffer);

buf.end = bufindex(buf, 0); // Note: this means that end points to the NUL rather than one-passed it. This is more usefull when wanting to then serialize this string for transmission or storage.

If we need to process multiple sub strings based on some kind of sentinel (eg: reading a buffer line-by-line):

1
2
3
4
5
6
7
8
9
10
11
12
Buffer buf = getBufferToProcess(), line;

line.start = buf.start;
line.end = bufindex(buf, '\n'); // Returns either buf.end or a pointer to a '\n'

while (line.start < buf.end) {
processLine(line);

line.start = line.end + 1; // skip the '\n'
line.end = buf.end;
line.end = bufindex(line, '\n');
}

… or using the bufadvance macro:

1
2
3
4
5
6
7
8
9
10
11
12
13
Buffer buf = getBufferToProcess(), line;

line.start = buf.start;
line.end = buf.start;

while (line.start < buf.end) {
bufadvance(line.end, buf.end, '\n' != *line.end);

processLine(line);

line.end++;
line.start = line.end;
}

Because the functions to deal with text data and binary data are the same, one doesn’t need to deal with conversions between the requirements of the two function requirements. A substring can easily be produced without needing to create a new copy of the data.