When people take their first C programming courses, they are taught about the standard allocator named ‘malloc()
‘, while when learning C++, we were first taught about its standard allocator, named ‘new
‘.
These allocators work on the assumption that a program is running in user space, and may not always be efficient at allocating smaller chunks of memory. They assume that a standard method of managing the heap is in-place, where the heap of any one process is a part of that process’s memory-image, and partially managed by the kernel.
Not only that, but when we tell either of these standard operators to allocate a chunk of memory, the allocator recognizes the size of that chunk, prepends to the chunk of memory a binary representation of its size, and before returning a pointer to the allocated memory, subtracts the size of the binary representation, of the size originally requested by the programmer. Thus, the pointer returned by either of these allocators points directly to the memory which the programmer can use, even though the allocated chunk is larger, and preceded by a binary representation of its own size. That way, when the command is given to deallocate, all the deallocation-function needs to receive in principle, is a pointer to the allocated chunk, and the deallocation-function can then find the header that was inserted from there, to derive how much memory to delete.
I suppose that one conclusion to draw from this is, that even though it looks like a good exercise to teach programming students, the exercise of always allocating a 32-bit or a 64-bit object – i.e., a 4-byte or an 8-byte object – such as an integer, to obtain an 8-byte pointer to that integer, is actually not a good one, because in addition to the requested 8 bytes, an additional header is always being allocated, which may add 4 bytes if the maximum allocated size is a 32-bit number, or add 8 bytes if the maximum allocated size (of one chunk) is a 64-bit number.
Additionally, these allocators assume the support of the kernel, to a user-space process, the latter of which has a heap. On 64-bit systems that are ‘vmalloc
‘-based, this requires the user-space application try to access virtual address ‘0x0000 0000 0000 0000
‘, which intentionally results in a page-fault, and stops the process. The kernel then needs to examine why the page-fault occurred, and since this was a legitimate reason, needs to set up the virtual page-frame, of an address returned to the (restarted) user-space process, via the usual methods for returning values.
And so means also needed to exist, by which a kernel can manage memory more-efficiently, even under the assumption that the kernel does not have the sort of heap, that a user-space process does. And one main mechanism for doing so, is to use a slab allocator. It will allocate large numbers of small chunks, without requiring as much overhead to do so, as the standard user-space allocators did. In kernel-space, these slabs are the main replacement for a heap.
(Updated 06/20/2017 … )
(Edit 06/20/2017 :
I suppose that I should add, that when the standard, user-space allocator – such as ‘vmalloc
‘ – gets used by an application, there are essentially two types of allocations taking place.
The first is the allocation by the kernel of an entire 4KB page of virtual memory, to the process. The second would be, the allocation of a smaller chunk of memory – such as even an 8-byte chunk – to the code which executed the ‘malloc()
‘ or the ‘new
‘ instruction.
Because of this, the page of virtual memory cannot be deallocated again, until every chunk of (programmer-visible) memory has been deallocated first, that resided within it. In practice, this happens infrequently in the case of smaller chunks. Therefore, the heap tends to grow over time, because newer pages are still being allocated, but older pages, that now have deallocated chunks within them, still cannot be deallocated from the process’ address-space. )
But then applications also exist in user-space, which might want to use a slab-allocator, for some but not all of the reasons the kernel has one. And so a C-function exists under Linux, called ‘kmalloc()
‘, which offers this in user-space. If the number of small objects to be allocated truly becomes large, it starts to become memory-efficient. And using a slab-allocator is generally more time-efficient.
In user-space, memory is first allocated to the heap of the process using the standard allocator, corresponding to 1 page of the slab-allocator, but then usually remains allocated for as long as the application keeps running, so that the deallocation and reallocation of chunks belonging to the slabs, no longer involves the standard allocator.
So there exist specialized applications, such as ‘memcached
‘, where the main constraint is time, i.e. where allocation should be extremely fast.
Now, there exist experts who will tell the reader, that their implementation of the slab allocator is actually a ‘slub allocator’, but I consider this to be a refinement, that improves performance, and which I don’t need to know all the details about.
While on my computer, slab-classes 2 and 3 offer 120-byte and 152-byte chunk-sizes, it would have made sense for slab-class 1 to offer a much smaller chunk-size, to catch requests from certain programmers for the smallest-possible chunk.
In practical computing, a possible C++ class that one might like to implement could be a ‘cons object’, which contains two addresses, the head and the tail of a linked-list. This also represents the smallest chunk that I could think to allocate via ‘new
‘, and on 64-bit systems, would represent a 16-byte allocation.
Possibly, slab-class 1 is still larger than that. (Note: The chunk-sizes of slab-classes 2 and 3 seem to differ by 32 for some reason. )
(Edit : Below are the real slab-sizes, in Base 10. In many cases, the same information is represented in Base 2^10 or in Base 2^20, to define KB and MB respectively. )
dirk@Phoenix:~$ memcached -vv
slab class 1: chunk size 96 perslab 10922
slab class 2: chunk size 120 perslab 8738
slab class 3: chunk size 152 perslab 6898
slab class 4: chunk size 192 perslab 5461
slab class 5: chunk size 240 perslab 4369
slab class 6: chunk size 304 perslab 3449
slab class 7: chunk size 384 perslab 2730
slab class 8: chunk size 480 perslab 2184
slab class 9: chunk size 600 perslab 1747
slab class 10: chunk size 752 perslab 1394
slab class 11: chunk size 944 perslab 1110
slab class 12: chunk size 1184 perslab 885
slab class 13: chunk size 1480 perslab 708
slab class 14: chunk size 1856 perslab 564
slab class 15: chunk size 2320 perslab 451
slab class 16: chunk size 2904 perslab 361
slab class 17: chunk size 3632 perslab 288
slab class 18: chunk size 4544 perslab 230
slab class 19: chunk size 5680 perslab 184
slab class 20: chunk size 7104 perslab 147
slab class 21: chunk size 8880 perslab 118
slab class 22: chunk size 11104 perslab 94
slab class 23: chunk size 13880 perslab 75
slab class 24: chunk size 17352 perslab 60
slab class 25: chunk size 21696 perslab 48
slab class 26: chunk size 27120 perslab 38
slab class 27: chunk size 33904 perslab 30
slab class 28: chunk size 42384 perslab 24
slab class 29: chunk size 52984 perslab 19
slab class 30: chunk size 66232 perslab 15
slab class 31: chunk size 82792 perslab 12
slab class 32: chunk size 103496 perslab 10
slab class 33: chunk size 129376 perslab 8
slab class 34: chunk size 161720 perslab 6
slab class 35: chunk size 202152 perslab 5
slab class 36: chunk size 252696 perslab 4
slab class 37: chunk size 315872 perslab 3
slab class 38: chunk size 394840 perslab 2
slab class 39: chunk size 493552 perslab 2
slab class 40: chunk size 616944 perslab 1
slab class 41: chunk size 771184 perslab 1
slab class 42: chunk size 1048576 perslab 1
Observation: Each chunk-size is divisible by 8 bytes, which fits the addressing-constraints of 64-bit PCs well. So the method applied seems to be:
- Start with 96.
- For each successive slab-class, multiply by 1.25, and round up to the nearest 8.
- Except for the last slab-class, in which the chunk-size just equals the page-size. In this case, the next chunk-size in the sequence would still have been smaller. This suggests that in the case of this slab-class, an added layer of parent-pages needs to keep track of the data-pages.
But according to my own software, a certain slab-allocator has never been requested to allocate chunks to slab-class 1, because everything that software was requested to cache, was larger.
(Edit : )
Just as any other memory-management system does, a slab-allocator needs a fast methodology, to find and identify free chunks, so that when the command is given to allocate one, this can be done in the fewest steps possible.
A plausible methodology could be, that each time a new page is allocated to a slab-class, it is entirely filled with a pointer in each chunk, that states the next-available free chunk. The last chunk would start out as containing the NULL pointer, and a header belonging to the page would point to the head of this linked list.
- If a chunk was to be allocated, then the first free chunk pointed to by this header would be chosen, but the linked list also edited, no longer to point to the same chunk. Instead, the header would now be set to the address which the allocated chunk pointed to.
- If a chunk was to be deallocated, it would be inserted at the beginning of this list, so that it now contains the starting-address that the header previously contained, and the header would be set to point to the newly-deallocated chunk.
- If the corresponding header-field ever contained NULL, this would mean that the entire page has been allocated, and has zero free chunks remaining.
It is my expectation that the header information actually holds more than one address, and what-else it states, probably also differs in a slub-allocator, from what it stated in a 1st-generation slab-allocator. Since a slub-allocator has greater complexity, it might even hold meta-data in special pages.
(Edit 06/30/2017 : )
According to information which I’ve read, it’s possible for an individual slab allocator to have a smallest, slab-class 1, with smaller chunks than 96 Bytes. Only, the ‘memcached
‘ version on my computer reserves 96 bytes for the chunk-size of this slab-class, because ‘memcached
‘ specifically requires that 48 bytes be used to store the key – for use in its hash-table. And then it starts to become efficient, to store an additional 48 bytes as the value of the key.
Dirk