Symmetric Multi-Processing — The Linux Kernel documentation (2024)

View slides

Lecture objectives:¶

Kernel Concurrency
Atomic operations
Spin locks
Cache thrashing
Optimized spin locks
Process and Interrupt Context Synchronization
Mutexes
Per CPU data
Memory Ordering and Barriers
Read-Copy Update

Synchronization basics¶

Because the Linux kernel supports symmetric multi-processing (SMP) itmust use a set of synchronization mechanisms to achieve predictableresults, free of race conditions.

Note

We will use the terms core, CPU and processor asinterchangeable for the purpose of this lecture.

Race conditions can occur when the following two conditions happensimultaneously:

there are at least two execution contexts that run in "parallel":
- truly run in parallel (e.g. two system calls running ondifferent processors)
- one of the contexts can arbitrary preempt the other (e.g. aninterrupt preempts a system call)
the execution contexts perform read-write accesses to sharedmemory

Race conditions can lead to erroneous results that are hard to debug,because they manifest only when the execution contexts are scheduledon the CPU cores in a very specific order.

A classical race condition example is an incorrect implementation fora release operation of a resource counter:

void release_resource(){ counter--; if (!counter) free_resource();}

A resource counter is used to keep a shared resource available untilthe last user releases it but the above implementation has a racecondition that can cause freeing the resource twice:

In most cases the release_resource() function will only free theresource once. However, in the scenario above, if thread A ispreempted right after decrementing counter and thread B callsrelease_resource() it will cause the resource to be freed. Whenresumed, thread A will also free the resource since the counter valueis 0.

To avoid race conditions the programmer must first identify thecritical section that can generate a race condition. The criticalsection is the part of the code that reads and writes shared memoryfrom multiple parallel contexts.

In the example above, the minimal critical section is starting withthe counter decrement and ending with checking the counter's value.

Once the critical section has been identified race conditions can beavoided by using one of the following approaches:

make the critical section atomic (e.g. use atomicinstructions)
disable preemption during the critical section (e.g. disableinterrupts, bottom-half handlers, or thread preemption)
serialize the access to the critical section (e.g. use spinlocks or mutexes to allow only one context or thread in thecritical section)

Linux kernel concurrency sources¶

There are multiple source of concurrency in the Linux kernel thatdepend on the kernel configuration as well as the type of system itruns on:

single core systems, non-preemptive kernel: the currentprocess can be preempted by interrupts
single core systems, preemptive kernel: above + thecurrent process can be preempted by other processes
multi-core systems: above + the current process can runin parallel with another process or with an interrupt running onanother processor

Note

We only discuss kernel concurrency and that is why anon-preemptive kernel running on an single core systemhas interrupts as the only source of concurrency.

Atomic operations¶

In certain circ*mstances we can avoid race conditions by using atomicoperations that are provided by hardware. Linux provides a unified APIto access atomic operations:

integer based:
- simple: atomic_inc(), atomic_dec(),atomic_add(), atomic_sub()
- conditional: atomic_dec_and_test(), atomic_sub_and_test()
bit based:
- simple: test_bit(), set_bit(),change_bit()
- conditional: test_and_set_bit(), test_and_clear_bit(),test_and_change_bit()

For example, we could use atomic_dec_and_test() to implementthe resource counter decrement and value checking atomic:

void release_resource(){ if (atomic_dec_and_test(&counter)) free_resource();}

One complication with atomic operations is encountered inmulti-core systems, where an atomic operation is not longeratomic at the system level (but still atomic at the core level).

To understand why, we need to decompose the atomic operation in memoryloads and stores. Then we can construct race condition scenarios wherethe load and store operations are interleaved across CPUs, like in theexample below where incrementing a value from two processors willproduce an unexpected result:

In order to provide atomic operations on SMP systems differentarchitectures use different techniques. For example, on x86 a LOCKprefix is used to lock the system bus while executing the prefixedoperation:

Disabling preemption (interrupts)¶

On single core systems and non preemptive kernels the only source ofconcurrency is the preemption of the current thread by aninterrupt. To prevent concurrency is thus sufficient to disableinterrupts.

This is done with architecture specific instructions, but Linux offersarchitecture independent APIs to disable and enable interrupts:

 #define local_irq_disable() \ asm volatile („cli” : : : „memory”)#define local_irq_enable() \ asm volatile („sti” : : : „memory”)#define local_irq_save(flags) \ asm volatile ("pushf ; pop %0" :"=g" (flags) : /* no input */: "memory") \ asm volatile("cli": : :"memory")#define local_irq_restore(flags) \ asm volatile ("push %0 ; popf" : /* no output */ : "g" (flags) :"memory", "cc");

Although the interrupts can be explicitly disabled and enable withlocal_irq_disable() and local_irq_enable() these APIsshould only be used when the current state and interrupts isknown. They are usually used in core kernel code (like interrupthandling).

For typical cases where we want to avoid interrupts due to concurrencyissues it is recommended to use the local_irq_save() andlocal_irq_restore() variants. They take care of saving andrestoring the interrupts states so they can be freely called fromoverlapping critical sections without the risk of accidentallyenabling interrupts while still in a critical section, as long as thecalls are balanced.

Spin Locks¶

Spin locks are used to serialize access to a critical section. Theyare necessary on multi-core systems where we can have true executionparallelism. This is a typical spin lock implementation:

spin_lock: lock bts [my_lock], 0 jc spin_lock/* critical section */spin_unlock: mov [my_lock], 0

bts dts, src - bit test and set; it copies the src bit from the dtsmemory address to the carry flag and then sets it:

CF <- dts[src]dts[src] <- 1

As it can be seen, the spin lock uses an atomic instruction to makesure that only one core can enter the critical section. If there aremultiple cores trying to enter they will continuously "spin" until thelock is released.

While the spin lock avoids race conditions, it can have a significantimpact on the system's performance due to "lock contention":

There is lock contention when at least one core spins trying toenter the critical section lock
Lock contention grows with the critical section size, time spentin the critical section and the number of cores in the system

Another negative side effect of spin locks is cache thrashing.

Cache thrashing occurs when multiple cores are trying to read andwrite to the same memory resulting in excessive cache misses.

Since spin locks continuously access memory during lock contention,cache thrashing is a common occurrence due to the way cachecoherency is implemented.

Cache coherency in multi-processor systems¶

The memory hierarchy in multi-processor systems is composed of localCPU caches (L1 caches), shared CPU caches (L2 caches) and the mainmemory. To explain cache coherency we will ignore the L2 cache andonly consider the L1 caches and main memory.

In the figure below we present a view of the memory hierarchy with twovariables A and B that fall into different cache lines and wherecaches and the main memory are synchronized:

In the absence of a synchronization mechanism between the caches andmain memory, when CPU 0 executes A = A + B and CPU 1 executes B =A + B we will have the following memory view:

In order to avoid the situation above multi-processor systems usecache coherency protocols. There are two main types of cache coherencyprotocols:

Bus snooping (sniffing) based: memory bus transactions aremonitored by caches and they take actions to preservecoherency
Directory based: there is a separate entity (directory) thatmaintains the state of caches; caches interact with directoryto preserve coherency

Bus snooping is simpler but it performs poorly when the number ofcores goes beyond 32-64.

Directory based cache coherence protocols scale much better (upto thousands of cores) and are usually used in NUMA systems.

A simple cache coherency protocol that is commonly used in practice isMESI (named after the acronym of the cache line states names:Modified, Exclusive, Shared and Invalid). It's maincharacteristics are:

Caching policy: write back
Cache line states
- Modified: owned by a single core and dirty
- Exclusive: owned by a single core and clean
- Shared: shared between multiple cores and clean
- Invalid : the line is not cached

Issuing read or write requests from CPU cores will trigger statetransitions, as exemplified below:

Invalid -> Exclusive: read request, all other cores have the linein Invalid; line loaded from memory
Invalid -> Shared: read request, at least one core has the linein Shared or Exclusive; line loaded from sibling cache
Invalid/Shared/Exclusive -> Modified: write request; allother cores invalidate the line
Modified -> Invalid: write request from other core; line isflushed to memory

Note

The most important characteristic of the MESI protocol isthat it is a write-invalidate cache protocol. When writing to ashared location all other caches are invalidated.

Optimized spin locks¶

As we have seen simple spin lock implementations can have poorperformance issues due to cache thrashing, especially as the number ofcores increase. To avoid this issue there are two possible strategies:

reduce the number of writes and thus reduce the number of cacheinvalidate operations
avoid the other processors spinning on the same cache line, and thusavoid the cache invalidate operations

An optimized spin lock implementation that uses the first approach ispresented below:

spin_lock: rep ; nop test lock_addr, 1 jnz spin_lock lock bts lock_addr jc spin_lock

we first test the lock read only, using a non atomicinstructions, to avoid writes and thus invalidate operationswhile we spin
only when the lock might be free, we try to acquire it

The implementation also use the PAUSE instruction to avoidpipeline flushes due to (false positive) memory order violations andto add a small delay (proportional with the memory bus frequency) toreduce power consumption.

A similar implementation with support for fairness (the CPU cores areallowed in the critical section based on the time of arrival) is usedin the Linux kernel (the ticket spin lock)for many architectures.

However, for the x86 architecture, the current spin lockimplementation uses a queued spin lock where the CPU cores spin ondifferent locks (hopefully distributed in different cache lines) toavoid cache invalidation operations:

Conceptually, when a new CPU core tries to acquire the lock and itfails it will add its private lock to the list of waiting CPUcores. When the lock owner exits the critical section it unlocks thenext lock in the list, if any.

While a read spin optimized spin lock reduces most of the cacheinvalidation operations, the lock owner can still generate cacheinvalidate operations due to writes to data structures close to thelock and thus part of the same cache line. This in turn generatesmemory traffic on subsequent reads on the spinning cores.

Hence, queued spin locks scale much better for large number of coresas is the case for NUMA systems. And since they have similar fairnessproperties as the ticket lock it is the preferred implementation onthe x86 architecture.

Process and Interrupt Context Synchronization¶

Accessing shared data from both process and interrupt context is arelatively common scenario. On single core systems we can do this bydisabling interrupts, but that won't work on multi-core systems,as we can have the process running on one CPU core and the interruptcontext running on a different CPU core.

Using a spin lock, which was designed for multi-processor systems,seems like the right solution, but doing so can cause commondeadlock conditions, as detailed by the following scenario:

In the process context we take the spin lock
An interrupt occurs and it is scheduled on the same CPU core
The interrupt handler runs and tries to take the spin lock
The current CPU will deadlock

To avoid this issue a two fold approach is used:

In process context: disable interrupts and acquire a spin lock;this will protect both against interrupt or other CPU cores raceconditions (spin_lock_irqsave() andspin_lock_restore() combine the two operations)
In interrupt context: take a spin lock; this will will protectagainst race conditions with other interrupt handlers or processcontext running on different processors

We have the same issue for other interrupt context handlers such assoftirqs, tasklets or timers and while disabling interrupts mightwork, it is recommended to use dedicated APIs:

In process context use spin_lock_bh() (which combineslocal_bh_disable() and spin_lock()) andspin_unlock_bh() (which combines spin_unlock() andlocal_bh_enable())
In bottom half context use: spin_lock() andspin_unlock() (or spin_lock_irqsave() andspin_lock_irqrestore() if sharing data with interrupthandlers)

As mentioned before, another source of concurrency in the Linux kernelcan be other processes, due to preemption.

Preemption is configurable: when active it provides better latencyand response time, while when deactivated it provides betterthroughput.

Preemption is disabled by spin locks and mutexes but it can bemanually disabled as well (by core kernel code).

As for local interrupt enabling and disabling APIs, the bottom halfand preemption APIs allows them to be used in overlapping criticalsections. A counter is used to track the state of bottom half andpreemption. In fact the same counter is used, with different incrementvalues:

#define PREEMPT_BITS 8#define SOFTIRQ_BITS 8#define HARDIRQ_BITS 4#define NMI_BITS 1#define preempt_disable() preempt_count_inc()#define local_bh_disable() add_preempt_count(SOFTIRQ_OFFSET)#define local_bh_enable() sub_preempt_count(SOFTIRQ_OFFSET)#define irq_count() (preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK))#define in_interrupt() irq_count()asmlinkage void do_softirq(void){ if (in_interrupt()) return; ...

Mutexes¶

Mutexes are used to protect against race conditions from other CPUcores but they can only be used in process context. As opposed tospin locks, while a thread is waiting to enter the critical section itwill not use CPU time, but instead it will be added to a waiting queueuntil the critical section is vacated.

Since mutexes and spin locks usage intersect, it is useful to comparethe two:

They don't "waste" CPU cycles; system throughput is better thanspin locks if context switch overhead is lower than mediumspinning time
They can't be used in interrupt context
They have a higher latency than spin locks

Conceptually, the mutex_lock() operation is relatively simple:if the mutex is not acquired we can take the fast path via an atomicexchange operation:

void __sched mutex_lock(struct mutex *lock){ might_sleep(); if (!__mutex_trylock_fast(lock)) __mutex_lock_slowpath(lock);}static __always_inline bool __mutex_trylock_fast(struct mutex *lock){ unsigned long curr = (unsigned long)current; if (!atomic_long_cmpxchg_acquire(&lock->owner, 0UL, curr)) return true; return false;}

otherwise we take the slow path where we add ourselves to the mutexwaiting list and put ourselves to sleep:

... spin_lock(&lock->wait_lock);... /* add waiting tasks to the end of the waitqueue (FIFO): */ list_add_tail(&waiter.list, &lock->wait_list);... waiter.task = current;... for (;;) { if (__mutex_trylock(lock)) goto acquired; ... spin_unlock(&lock->wait_lock); ... set_current_state(state); spin_lock(&lock->wait_lock); } spin_lock(&lock->wait_lock);acquired: __set_current_state(TASK_RUNNING); mutex_remove_waiter(lock, &waiter, current); spin_lock(&lock->wait_lock);...

The full implementation is a bit more complex: instead of going tosleep immediately it optimistic spinning if it detects that the lockowner is currently running on a different CPU as chances are the ownerwill release the lock soon. It also checks for signals and handlesmutex debugging for locking dependency engine debug feature.

The mutex_unlock() operation is symmetric: if there are nowaiters on the mutex then we can take the fast path via an atomic exchangeoperation:

void __sched mutex_unlock(struct mutex *lock){ if (__mutex_unlock_fast(lock)) return; __mutex_unlock_slowpath(lock, _RET_IP_);}static __always_inline bool __mutex_unlock_fast(struct mutex *lock){ unsigned long curr = (unsigned long)current; if (atomic_long_cmpxchg_release(&lock->owner, curr, 0UL) == curr) return true; return false;}void __mutex_lock_slowpath(struct mutex *lock){... if (__mutex_waiter_is_first(lock, &waiter)) __mutex_set_flag(lock, MUTEX_FLAG_WAITERS);...

Note

Because struct task_struct is cached aligned the 7lower bits of the owner field can be used for various flags,such as MUTEX_FLAG_WAITERS.

Otherwise we take the slow path where we pick up first waiter from thelist and wake it up:

...spin_lock(&lock->wait_lock);if (!list_empty(&lock->wait_list)) { /* get the first entry from the wait-list: */ struct mutex_waiter *waiter; waiter = list_first_entry(&lock->wait_list, struct mutex_waiter, list); next = waiter->task; wake_q_add(&wake_q, next);}...spin_unlock(&lock->wait_lock);...wake_up_q(&wake_q);

Per CPU data¶

Per CPU data avoids race conditions by avoiding to use shareddata. Instead, an array sized to the maximum possible CPU cores isused and each core will use its own array entry to read and writedata. This approach certainly has advantages:

No need to synchronize to access the data
No contention, no performance impact
Well suited for distributed processing where aggregation is onlyseldom necessary (e.g. statistics counters)

Memory Ordering and Barriers¶

Modern processors and compilers employ out-of-order execution toimprove performance. For example, processors can execute "future"instructions while waiting for current instruction data to be fetchedfrom memory.

Here is an example of out of order compiler generated code:

C code	Compiler generated code
a = 1;b = 2;	MOV R10, 1MOV R11, 2STORE R11, bSTORE R10, a

Note

When executing instructions out of order the processor makessure that data dependency is observed, i.e. it won't executeinstructions whose input depend on the output of a previousinstruction that has not been executed.

In most cases out of order execution is not an issue. However, incertain situations (e.g. communicating via shared memory betweenprocessors or between processors and hardware) we must issue someinstructions before others even without data dependency between them.

For this purpose we can use barriers to order memory operations:

A read barrier (rmb(), smp_rmb()) is used tomake sure that no read operation crosses the barrier; that is,all read operation before the barrier are complete beforeexecuting the first instruction after the barrier
A write barrier (wmb(), smp_wmb()) is used tomake sure that no write operation crosses the barrier
A simple barrier (mb(), smp_mb()) is usedto make sure that no write or read operation crosses the barrier

Read Copy Update (RCU)¶

Read Copy Update is a special synchronization mechanism similar withread-write locks but with significant improvements over it (and somelimitations):

Read-only lock-less access at the same time with write access
Write accesses still requires locks in order to avoid racesbetween writers
Requires unidirectional traversal by readers

In fact, the read-write locks in the Linux kernel have been deprecatedand then removed, in favor of RCU.

Implementing RCU for a new data structure is difficult, but a fewcommon data structures (lists, queues, trees) do have RCU APIs thatcan be used.

RCU splits removal updates to the data structures in two phases:

Removal: removes references to elements. Some old readers maystill see the old reference so we can't free the element.
Elimination: free the element. This action is postponed untilall existing readers finish traversal (quiescent cycle). Newreaders won't affect the quiescent cycle.

As an example, lets take a look on how to delete an element from alist using RCU:

In the first step it can be seen that while readers traverse the listall elements are referenced. In step two a writer removeselement B. Reclamation is postponed since there are still readers thathold references to it. In step three a quiescent cycle just expiredand it can be noticed that there are no more references toelement B. Other elements still have references from readers thatstarted the list traversal after the element was removed. In step 4 wefinally perform reclamation (free the element).

Now that we covered how RCU functions at the high level, lets looks atthe APIs for traversing the list as well as adding and removing anelement to the list:

/* list traversal */rcu_read_lock();list_for_each_entry_rcu(i, head) { /* no sleeping, blocking calls or context switch allowed */}rcu_read_unlock();/* list element delete */spin_lock(&lock);list_del_rcu(&node->list);spin_unlock(&lock);synchronize_rcu();kfree(node);/* list element add */spin_lock(&lock);list_add_rcu(head, &node->list);spin_unlock(&lock);

Symmetric Multi-Processing — The Linux Kernel documentation (2024)

FAQs

What is symmetric multiprocessing in Linux? ›

SMP (symmetric multiprocessing) is computer processing done by multiple processors that share a common operating system (OS) and memory. In symmetric multiprocessing, the processors share the same input/output (I/O) bus or data path. A single copy of the OS is in charge of all the processors.

Get More Info Here ›

What are the chief characteristics of an SMP? ›

Key characteristics of symmetric multiprocessing

Here are some key attributes: Uniform Memory Access: In SMP, every processor can access any memory location directly. This enables equal opportunity for all processors to fetch and store data. Processor Equality: All processors in an SMP system are peers.

Explore More ›

What is SMP in operating system? ›

Symmetric multiprocessing or shared-memory multiprocessing (SMP) involves a multiprocessor computer hardware and software architecture where two or more identical processors are connected to a single, shared main memory, have full access to all input and output devices, and are controlled by a single operating system ...

Show Me More ›

What is the full form of SMP? ›

SMP (Symmetric Multiprocessing)

Learn More ›

How does multiprocessing work in Linux? ›

Linux is a multiprocessing operating system, its objective is to have a process running on each CPU in the system at all times, to maximize CPU utilization. If there are more processes than CPUs (and there usually are), the rest of the processes must wait before a CPU becomes free until they can be run.

Discover More Details ›

What is multiprocessing with an example? ›

Multiprocessing is the ability for computers to complete multiple tasks at the same time without having to wait for one task to complete before the next task can be started. A dual-core processor is twice as fast as a single processor, and a quad-core processor is four times as fast.

Discover More Details ›

What are some of the key OS design issues for an SMP? ›

What are some of the key OS design issues for an SMP ?

Simultaneous concurrent processes.
Scheduling.
Synchronization.
Memory management.
Reliability and fault tolerance.

What is the key design issue with symmetric multiprocessor operating systems? ›

Characteristics of SMP

Complexity: Are complex in design, as all units share same memory and data bus. Expensive: They are costlier in nature. Unlike asymmetric where a task is done only by Master processor, here tasks of the operating system are handled individually by processors.

Learn More ›

What is the difference between SMP and NUMA? ›

NUMA is similar to SMP, in which multiple CPUs share a single memory. However, in SMP, all CPUs access a common memory at the same speed. In NUMA, memory on the same processor board as the CPU (local memory) is accessed faster than memory on other processor boards (shared memory), hence the "non-uniform" nomenclature.

Read On ›

What is SMP in the kernel? ›

Symmetric multiprocessing (also known as SMP) makes it possible for a computer system to have more than one CPU sharing all system resources. This means that, unlike a uniprocessor system, an SMP system may actually have more than one process running at the same time.

Find Out More ›

What does SMP stand for Linux? ›

Linux developer. · Updated 4y. SMP is an acronym for Symmetric Multi Processing (meaning multiple CPUs of the same kind - pretty much.)

Explore More ›

What is SMP used for? ›

Smp treatment is a non-invasive hair micropigmentation process and an effective and real alternative to hair trasplant or implants. Smp hair procedure is useful to improve hair transplants, to add density, to add definition to hairline, to camouflage scars; for alopecia, baldness and women's hair loss.

Get More Info ›

What is SMP slang for? ›

SMP is an acronym for the term sexual marketplace, and is often used in male-dominated conversations, or in the “manosphere.”

Discover More Details ›

What is the SMP protocol? ›

Session Multiplex Protocol (SMP) is an application protocol that facilitates session management by providing a mechanism to create multiple lightweight communication channels (sessions) over a lower-layer transport connection.

Find Out More ›

What is an SMP connector? ›

Sub miniature push-on (SMP) Connectors are plug-in connectors for applications up to 40 GHz. These connectors are easy to connect and disconnect as they can be pushed against each other to make a connection and pulled apart to open the connection. They do not need to be screwed on like most other connectors.

What is symmetric and asymmetric multiprocessing? ›

Definition. Symmetric multiprocessing occurs when many processors work together to process programs using the same OS and memory. The processing of programs by several processors in a master-slave arrangement is known as asymmetric multiprocessing. Basic. Each CPU executes the OS operations.

What is the difference between SMP and MMP? ›

The main difference between SMP and MPP is the system design. In an SMP system, each processor shares the same resources. In an MPP system, each processor has its own dedicated resources and shares nothing.

Explore More ›

What is Linux SMP? ›

Symmetrical multiprocessing (SMP ) denotes a multiprocessor architecture in which no CPU is selected as the Master CPU, but rather all of them cooperate on an equal basis, hence the name "symmetrical." As usual, we shall focus on Intel SMP architectures.

Get More Info Here ›

What are the two types of multiprocessing? ›

Answer A is correct; multiprocessing systems run multiple programs or processes per CPU. Two types are Symmetric Multiprocessing (SMP) and Asymmetric Multiprocessing (AMP).

Learn More Now ›