Data Structures¶
Zephyr provides a library of common general purpose data structures used within the kernel, but useful by application code in general. These include list and balanced tree structures for storing ordered data, and a ring buffer for managing “byte stream” data in a clean way.
Note that in general, the collections are implemented as “intrusive” data structures. The “node” data is the only struct used by the library code, and it does not store a pointer or other metadata to indicate what user data is “owned” by that node. Instead, the expectation is that the node will be itself embedded within a user-defined struct. Macros are provided to retrieve a user struct address from the embedded node pointer in a clean way. The purpose behind this design is to allow the collections to be used in contexts where dynamic allocation is disallowed (i.e. there is no need to allocate node objects because the memory is provided by the user).
Note also that these libraries are uniformly unsynchronized; access to them is not threadsafe by default. These are data structures, not synchronization primitives. The expectation is that any locking needed will be provided by the user.
Single-linked List¶
Zephyr provides a sys_slist_t
type for storing simple
singly-linked list data (i.e. data where each list element stores a
pointer to the next element, but not the previous one). This supports
constant-time access to the first (head) and last (tail) elements of
the list, insertion before the head and after the tail of the list and
constant time removal of the head. Removal of subsequent nodes
requires access to the “previous” pointer and thus can only be
performed in linear time by searching the list.
The sys_slist_t
struct may be instantiated by the user in any
accessible memory. It should be initialized with either
sys_slist_init()
or by static assignment from SYS_SLIST_STATIC_INIT
before use. Its interior fields are opaque and should not be accessed
by user code.
The end nodes of a list may be retrieved with
sys_slist_peek_head()
and sys_slist_peek_tail()
, which will
return NULL if the list is empty, otherwise a pointer to a
sys_snode_t
struct.
The sys_snode_t
struct represents the data to be inserted. In
general, it is expected to be allocated/controlled by the user,
usually embedded within a struct which is to be added to the list.
The container struct pointer may be retrieved from a list node using
SYS_SLIST_CONTAINER()
, passing it the struct name of the
containing struct and the field name of the node. Internally, the
sys_snode_t
struct contains only a next pointer, which may be
accessed with sys_slist_peek_next()
.
Lists may be modified by adding a single node at the head or tail with
sys_slist_prepend()
and sys_slist_append()
. They may also
have a node added to an interior point with sys_slist_insert()
,
which inserts a new node after an existing one. Similarly
sys_slist_remove()
will remove a node given a pointer to its
predecessor. These operations are all constant time.
Convenience routines exist for more complicated modifications to a
list. sys_slist_merge_slist()
will append an entire list to an
existing one. sys_slist_append_list()
will append a bounded
subset of an existing list in constant time. And
sys_slist_find_and_remove()
will search a list (in linear time)
for a given node and remove it if present.
Finally the slist implementation provides a set of “for each” macros
that allows for iterating over a list in a natural way without needing
to manually traverse the next pointers. SYS_SLIST_FOR_EACH_NODE()
will enumerate every node in a list given a local variable to store
the node pointer. SYS_SLIST_FOR_EACH_NODE_SAFE()
behaves
similarly, but has a more complicated implementation that requires an
extra scratch variable for storage and allows the user to delete the
iterated node during the iteration. Each of those macros also exists
in a “container” variant (SYS_SLIST_FOR_EACH_CONTAINER()
and
SYS_SLIST_FOR_EACH_CONTAINER_SAFE()
) which assigns a local
variable of a type that matches the user’s container struct and not
the node struct, performing the required offsets internally. And
SYS_SLIST_ITERATE_FROM_NODE()
exists to allow for enumerating a
node and all its successors only, without inspecting the earlier part
of the list.
Slist Internals¶
The slist code is designed to be minimal and conventional.
Internally, a sys_slist_t
struct is nothing more than a pair of
“head” and “tail” pointer fields. And a sys_snode_t
stores only a
single “next” pointer.
The specific implementation of the list code, however, is done with an internal “Z_GENLIST” template API which allows for extracting those fields from arbitrary structures and emits an arbitrarily named set of functions. This allows for implementing more complicated single-linked list variants using the same basic primitives. The genlist implementor is responsible for a custom implementation of the primitive operations only: an “init” step for each struct, and a “get” and “set” primitives for each of head, tail and next pointers on their relevant structs. These inline functions are passed as parameters to the genlist macro expansion.
Only one such variant, sflist, exists in Zephyr at the moment.
Flagged List¶
The sys_sflist_t
is implemented using the described genlist
template API. With the exception of symbol naming (“sflist” instead
of “sflist”), it operates in all ways identically to the slist API.
It includes the ability to associate exactly two bits of user defined
“flags” with each list node. These can be accessed and modified with
sys_sflist_flags_get()
and sys_sflist_flags_get()
.
Internally, the flags are stored unioned with the bottom bits of the
next pointer and incur no SRAM storage overhead when compared with the
simpler slist code.
Double-linked List¶
Similar to the single-linked list in many respects, Zephyr includes a double-linked implementation. This provides the same algorithmic behavior for all the existing slist operations, but also allows for constant-time removal and insertion (at all points: before or after the head, tail or any internal node). To do this, the list stores two pointers per node, and thus has somewhat higher runtime code and memory space needs.
A sys_dlist_t
struct may be instantiated by the user in any
accessible memory. It must be initialized with sys_dlist_init()
or SYS_DLIST_STATIC_INIT()
before use. The sys_dnode_t
struct
is expected to be provided by the user for any nodes addded to the
list (typically embedded within the struct to be tracked, as described
above). It must be initialized in zeroed/bss memory or with
sys_dnode_init()
before use.
Primitive operations may retrieve the head/tail of a list and the
next/prev pointers of a node with sys_dlist_peek_head()
,
sys_dlist_peek_tail()
, sys_dlist_peek_next()
and
sys_dlist_peek_prev()
. These can all return NULL where
appropriate (i.e. for empty lists, or nodes at the endpoints of the
list).
A dlist can be modified in constant time by removing a node with
sys_dlist_remove()
, by adding a node to the head or tail of a list
with sys_dlist_prepend()
and sys_dlist_append()
, or by
inserting a node before an existing node with sys_dlist_insert()
.
As for slist, each node in a dlist can be processed in a natural code
block style using SYS_DLIST_FOR_EACH_NODE()
. This macro also
exists in a “FROM_NODE” form which allows for iterating from a known
starting point, a “SAFE” variant that allows for removing the node
being inspected within the code block, a “CONTAINER” style that
provides the pointer to a containing struct instead of the raw node,
and a “CONTAINER_SAFE” variant that provides both properties.
Convenience utilities provided by dlist include
sys_dlist_insert_at()
, which inserts a node that linearly searches
through a list to find the right insertion point, which is provided by
the user as a C callback function pointer, and
sys_dlist_is_linked()
, which will affirmatively return whether or
not a node is currently linked into a dlist or not (via an
implementation that has zero overhead vs. the normal list processing).
Dlist Internals¶
Internally, the dlist implementation is minimal: the sys_dlist_t
struct contains “head” and “tail” pointer fields, the sys_dnode_t
contains “prev” and “next” pointers, and no other data is stored. But
in practice the two structs are internally identical, and the list
struct is inserted as a node into the list itself. This allows for a
very clean symmetry of operations:
An empty list has backpointers to itself in the list struct, which can be trivially detected.
The head and tail of the list can be detected by comparing the prev/next pointers of a node vs. the list struct address.
An insertion or deletion never needs to check for the special case of inserting at the head or tail. There are never any NULL pointers within the list to be avoided. Exactly the same operations are run, without tests or branches, for all list modification primitives.
Effectively, a dlist of N nodes can be thought of as a “ring” of “N+1” nodes, where one node represents the list tracking struct.
Balanced Red/Black Tree¶
For circumstances where sorted containers may become large at runtime, a list becomes problematic due to algorithmic costs of searching it. For these situations, Zephyr provides a balanced tree implementation which has runtimes on search and removal operations bounded at O(log2(N)) for a tree of size N. This is implemented using a conventional red/black tree as described by multiple academic sources.
The struct rbtree
tracking struct for a rbtree may be initialized
anywhere in user accessible memory. It should contain only zero bits
before first use. No specific initialization API is needed or
required.
Unlike a list, where position is explicit, the ordering of nodes
within an rbtree must be provided as a predicate function by the user.
A function of type rb_lessthan_t()
should be assigned to the
lessthan_fn
field of the struct rbtree
before any tree
operations are attempted. This function should, as its name suggests,
return a boolean True value if the first node argument is “less than”
the second in the ordering desired by the tree. Note that “equal” is
not allowed, nodes within a tree must have a single fixed order for
the algorithm to work correctly.
As with the slist and dlist containers, nodes within an rbtree are
represented as a struct rbnode
structure which exists in
user-managed memory, typically embedded within the the data structure
being tracked in the tree. Unlike the list code, the data within an
rbnode is entirely opaque. It is not possible for the user to extract
the binary tree topology and “manually” traverse the tree as it is for
a list.
Nodes can be inserted into a tree with rb_insert()
and removed
with rb_remove()
. Access to the “first” and “last” nodes within a
tree (in the sense of the order defined by the comparison function) is
provided by rb_get_min()
and rb_get_max()
. There is also a
predicate, rb_contains()
, which returns a boolean True if the
provided node pointer exists as an element within the tree. As
described above, all of these routines are guaranteed to have at most
log time complexity in the size of the tree.
There are two mechanisms provided for enumerating all elements in an
rbtree. The first, rb_walk()
, is a simple callback implementation
where the caller specifies a C function pointer and an untyped
argument to be passed to it, and the tree code calls that function for
each node in order. This has the advantage of a very simple
implementation, at the cost of a somewhat more cumbersome API for the
user (not unlike ISO C’s bsearch()
routine). It is a recursive
implementation, however, and is thus not always available in
environments that forbid the use of unbounded stack techniques like
recursion.
There is also a RB_FOR_EACH()
iterator provided, which, like the
similar APIs for the lists, works to iterate over a list in a more
natural way, using a nested code block instead of a callback. It is
also nonrecursive, though it requires log-sized space on the stack by
default (however, this can be configured to use a fixed/maximally size
buffer instead where needed to avoid the dynamic allocation). As with
the lists, this is also available in a RB_FOR_EACH_CONTAINER()
variant which enumerates using a pointer to a container field and not
the raw node pointer.
Tree Internals¶
As described, the Zephyr rbtree implementation is a conventional red/black tree as described pervasively in academic sources. Low level details about the algorithm are out of scope for this document, as they match existing conventions. This discussion will be limited to details notable or specific to the Zephyr implementation.
The core invariant guaranteed by the tree is that the path from the root of the tree to any leaf is no more than twice as long as the path to any other leaf. This is achieved by associating one bit of “color” with each node, either red or black, and enforcing a rule that no red child can be a child of another red child (i.e. that the number of black nodes on any path to the root must be the same, and that no more than that number of “extra” red nodes may be present). This rule is enforced by a set of rotation rules used to “fix” trees following modification.
These rotations are conceptually implemented on top of a primitive
that “swaps” the position of one node with another in the list.
Typical implementations effect this by simply swapping the nodes
internal “data” pointers, but because the Zephyr struct rbnode
is
intrusive, that cannot work. Zephyr must include somewhat more
elaborate code to handle the edge cases (for example, one swapped node
can be the root, or the two may already be parent/child).
The struct rbnode
struct for a Zephyr rbtree contains only two
pointers, representing the “left”, and “right” children of a node
within the binary tree. Traversal of a tree for rebalancing following
modification, however, routinely requires the ability to iterate
“upwards” from a node as well. It is very common for red/black trees
in the industry to store a third “parent” pointer for this purpose.
Zephyr avoids this requirement by building a “stack” of node pointers
locally as it traverses downward thorugh the tree and updating it
appropriately as modifications are made. So a Zephyr rbtree can be
implemented with no more runtime storage overhead than a dlist.
These properties, of a balanced tree data structure that works with only two pointers of data per node and that works without any need for a memory allocation API, are quite rare in the industry and are somewhat unique to Zephyr.
Ring Buffer¶
For circumstances where an application needs to implement asynchronous
“streaming” copying of data, Zephyr provides a struct ring_buf
abstraction to manage copies of such data in and out of a shared
buffer of memory. Ring buffers may be used in either “bytes” mode,
where the data to be streamed is an uninterpreted array of bytes, or
“items” mode where the data much be an integral number of 32 bit
words. While the underlying data structure is the same, it is not
legal to mix these two modes on a single ring buffer instance. A ring
buffer initialized with a byte count must be used only with the
“bytes” API, one initialized with a word count must use the “items”
calls.
A struct ring_buf
may be placed anywhere in user-accessible
memory, and must be initialized with ring_buf_init()
before use.
This must be provided a region of user-controlled memory for use as
the buffer itself. Note carefully that the units of the size of the
buffer passed change (either bytes or words) depending on how the ring
buffer will be used later. Macros for combining these steps in a
single static declaration exist for convenience.
RING_BUF_DECLARE()
will declare and statically initialize a ring
buffer with a specified byte count, where
RING_BUF_ITEM_DECLARE_SIZE()
will declare and statically
initialize a buffer with a given count of 32 bit words.
RING_BUF_ITEM_DECLARE_POW2()
can be used to initialize an
items-mode buffer with a memory region guaranteed to be a power of
two, which enables various optimizations internal to the
implementation. No power-of-two initialization is available for
bytes-mode ring buffers.
“Bytes” data may be copied into the ring buffer using
ring_buf_put()
, passing a data pointer and byte count. These
bytes will be copied into the buffer in order, as many as will fit in
the allocated buffer. The total number of bytes copied (which may be
fewer than provided) will be returned. Likewise ring_buf_get()
will copy bytes out of the ring buffer in the order that they were
written, into a user-provided buffer, returning the number of bytes
that were transferred.
To avoid multiply-copied-data situations, a “claim” API exists for
byte mode. ring_buf_put_claim()
takes a byte size value from the
user and returns a pointer to memory internal to the ring buffer that
can be used to receive those bytes, along with a size of the
contiguous internal region (which may be smaller than requested). The
user can then copy data into that region at a later time without
assembling all the bytes in a single region first. When complete,
ring_buf_put_finish()
can be used to signal the buffer that the
transfer is complete, passing the number of bytes actually
transferred. At this point a new transfer can be initiated.
Similarly, ring_buf_get_claim()
returns a pointer to internal ring
buffer data from which the user can read without making a verbatim
copy, and ring_buf_get_finish()
signals the buffer with how many
bytes have been consumed and allows for a new transfer to begin.
“Items” mode works similarly to bytes mode, except that all transfers
are in units of 32 bit words and all memory is assumed to be aligned
on 32 bit boundaries. The write and read operations are
ring_buf_item_put()
and ring_buf_item_get()
, and work
otherwise identically to the bytes mode APIs. There no “claim” API
provided for items mode. One important difference is that unlike
ring_buf_put()
, ring_buf_item_put()
will not do a partial
transfer; it will return an error in the case where the provided data
does not fit in its entirety.
The user can manage the capacity of a ring buffer without modifying it
using the ring_buf_space_get()
call (which returns a value of
either bytes or items depending on how the ring buffer has been used),
or by testing the ring_buf_is_empty()
predicate.
Finally, a ring_buf_reset()
call exists to immediately empty a
ring buffer, discarding the tracking of any bytes or items already
written to the buffer. It does not modify the memory contents of the
buffer itself, however.
Ring Buffer Internals¶
Data streamed through a ring buffer is always written to the next byte
within the buffer, wrapping around to the first element after reaching
the end, thus the “ring” structure. Internally, the struct
ring_buf
contains its own buffer pointer and its size, and also a
“head” and “tail” index representing where the next read and write
This boundary is invisible to the user using the normal put/get APIs, but becomes a barrier to the “claim” API, because obviously no contiguous region can be returned that crosses the end of the buffer. This can be surprising to application code, and produce performance artifacts when transfers need to alias closely to the size of the buffer, as the number of calls to claim/finish need to double for such transfers.
When running in items mode (only), the ring buffer contains two implementations for the modular arithmetic required to compute “next element” offsets. One is used for arbitrary sized buffers, but the other is optimized for power of two sizes and can replace the compare and subtract steps with a simple bitmask in several places, at the cost of testing the “mask” value for each call.