This post is a personal “pot-pourri” of some researches I have made on main memory (RAM) and OS initialization steps …
- CPU real mode
- CPU protected mode
- Kernel memory management
- Process memory management
- ELF extension
- Linux booting steps (for main memory)
CPU real mode
Most CPUs (Intel) start in “real mode”.
This mode can address 20 bits of memory.
Thus, a total of 1MB is available: from 00000H to FFFFFH.
This 1 MB is also known as “conventional memory” (it was considered as a large amount back at that time).
Memory is accessed by putting the right values in the right registers:
- One value for the segment into one “segment register” such as CS or SS or DS, …
- One value for the “offset” into the appropriate offset register (IP or SP or SI, …)
CS = Code Segment = Made to contain the instructions of the program
SS = Stack Segment = Made to keep track of the subroutine calls
DS = Data Segment = Made to contain static variables and global variables
Hence, the combination of a segment selector value and an offset value points to a specific address in RAM.
MS-DOS OS runs in real mode. Real mode has different limitations:
→ No memory protection, any address memory is accessible
→ No multitasking
→ No code privilege
→ Memory isn’t that big … (only 1 MB)
CPU protected mode
Protected mode enables:
- Multitasking thanks to segment protections
- More memory as it uses 32 bits
- Segments of different sizes (could be more than 64KB)
CPUs start in real mode but then enable protected mode by setting up a Global Descriptor Table (GDT) in the RAM and then turning the “protection enable” (PE) bit to 1 in the CR0 register.
GDTR is a register pointing to the beginning of the GDT table in the RAM.
The GDT which is setup in the RAM must have at least 3 entries:
→ a null descriptor
→ a code segment descriptor
→ a data segment descriptor
Each GDT entry (called descriptor) is 8 Bytes:
An entry in the GDT table contains informations on the segment such as:
→ segment base address
→ segment limit
→ access right (DPL value: 0 to 3)
→ descriptor type (read / execute only ?)
In protected mode, segment registers (such as CS, SS, DS, …) now point to an entry in the GDT.
Figure 3 represents what a register such as CS could contain under protected mode.
As you can see the privilege is stored in RPL.
This can be used to distinguish kernel and user code (0 and 3).
Therefore if a CS segment has a RPL value of 3 (user) and is trying to reach a segment described as 0 it would cause an error / exception as 3 being not enough permission (user space) to access kernel (0) segment.
x86 CPU protected mode allows the creation of segments with different sizes.
However Linux doesn’t use memory segmentation (considered as ‘legacy’).
Linux uses a “flat memory” model.
Indeed, in the Linux kernel code, all segments (KERNEL_CS, KERNEL_DS, USER_CS, USER_DS) overlap each other and have the same unlimited size !
Once flat addressing has been set up (as a ‘dumb’ and ‘useless’ segmentation), the Linux kernel will then build a ‘paging’ system on top to create and manage the virtual memory (for both kernel and user space).
Linux uses paging to create and manage a virtual memory.
To enable paging, the OS will first need to set up some structures:
- A page directory (PD)
- A page table (PT)
The page directory (PD) contains entries for each page table along some metadata:
A page directory consists of 1024 * 32-bit entries.
Each entry holds the address of a page table.
There is only one page directory in the computer / kernel.
Page table is an array of 1024 * 32-bit entries (conveniently fitting into a single 4KB page).
Each entry points to the physical address of a page (page is aka ‘frame’ for the hardware).
When a process is created, the Linux kernel gives it a page table (hence an entry in the page directory is created). The process will then be able to allocate memory with syscalls such as “malloc” which will create page entries in the page table.
Paging tables and directory illustrations
With paging enabled, each linear address will be translated thanks to PD, then PT, then Offset.
CR3 is a register holding the base address of the page directory (PD). This register never change and is used by the hardware to translate addresses.
Paging is also a hardware feature as special components are involved in the ‘address resolution’.
Indeed, once the page directory has been created by the OS and made available to the CPU thanks to registers (CR3 for instance). The hardware understands how paging work and can handle address translation thanks to the Memory Management Unit (MMU).
The Translation Look-aside Buffer (TLB) is a cache used to improve speed.
We can now understand why main memory operations are relatively slow for the CPU.
Number of CPU cycles for different accesses:
1 cycle to read a register
4 cycles to reach to L1 cache
10 cycles to reach L2 cache
75 cycles to reach L3 cache
100+ cycles to reach main memory.
Kernel memory management
The C standard library has a lot of functions related to memory such as “malloc, realloc, calloc, free”.
The Linux C kernel code is self contained and does not rely on the C standard library.
In other words, the kernel doesn’t use malloc, realloc, calloc.
→ So how does the kernel allocate memory (for itself) ? 🤔
The kernel contains implementations of many of the C library’s functions, or variants (for example, printk instead of printf); but they don’t all follow the standard exactly. In some circumstances, the implementations of C library functions in the compiler are used instead.
Interesting functions created and available for the kernel:
__get_free_page() = returns one free page (low level kernel function)
__get_free_pages() = give a number of consecutive pages (low level kernel function)
→ Can the Linux kernel use a ‘heap’ 🤔
With pure C, without any library, only automatic and static variables are available; no heap allocation.
However, heap allocation is still possible in the kernel space thanks to ‘kmalloc’:
kmalloc is the normal method of allocating memory for objects smaller than page size in the kernel.
The maximum size allocatable by kmalloc() is 1024 pages, or 4MB on x86.
I don’t know why but I guess there is a heap in the kernel space (the one used with kmalloc) in Figure 11 above.
kmalloc can be called with different parameters:
Process memory management
If a program uses recursion without any condition check, it could “overflow” the stack and cause the famous “stack overflow” error.
To avoid that, the developer has to check that:
→ Recursive functions are not “infinite”
→ Function uses “Tail recursion” (the recursive call is the last element executed by the function).
A program is stored as a binary file with a .elf extension (for Executable and Linkable Format) or .exe
ELF is generally the output of a compiler.
ELF binary is divided between:
- ELF header
- File data
The readelf command shows the details of an ELF file.
Before execution, the Kernel needs to load and to parse the binary. The function load_elf_binary is called, this function checks if the ELF header is valid, and it also sets several pointers: code, brk and stack.
Linux memory booting steps
When the kernel gets the control:
- It creates a GDT table with the 3 base entries overlapping (in order to get linear = physical address for flat addressing)
- Saves the GDT memory location into GDTR register
- Loads an IDT table in memory (for interrupts)
- Saves the IDT memory location into IDTR register
- Enables protected mode (it will use the flat addressing GDT table)
- Create necessary structures for paging: directory and table pages
How is paging enabled at boot time ?
During the boot process, the kernel builds page tables before enabling virtual memory in the CPU. The initial paging tables map two virtual ranges onto the first 8 megabytes of physical memory. The first range is called the identity mapping and it maps virtual addresses 0–8MB (not sure on the 8MB, but something like that) onto physical addresses 0–8MB. The second virtual range maps another ~8MB starting at virtual address 0xc0000000 onto the 0–8MB of physical memory.
That way, immediately after the paging is enabled, things are still working OK thanks to the identity mapping. The kernel can then jump to the 0xc0000000 addresses and it will also work thanks to the second range. It then calls a function called ‘zap_low_mappings’ or something like that to blow away the identity mapping. Everything is up and running as far as paging goes by then.
I hope you enjoyed this post, I do not guarantee it to be error free.
Thus, if you detect any error, don’t hesitate to let me know, it would be greatly appreciated.
You can also follow me on twitter and check out the references below if you want to dive deeper into the subject.
Modes of Memory Addressing on x86
Whole OS booting steps
Multisource and pretty complete post on paging
Assembly section explained
Processes and Kernel