View slides
Lecture objectives:¶
- Kernel Concurrency
- Atomic operations
- Spin locks
- Cache thrashing
- Optimized spin locks
- Process and Interrupt Context Synchronization
- Mutexes
- Per CPU data
- Memory Ordering and Barriers
- Read-Copy Update
Synchronization basics¶
Because the Linux kernel supports symmetric multi-processing (SMP) itmust use a set of synchronization mechanisms to achieve predictableresults, free of race conditions.
Note
We will use the terms core, CPU and processor asinterchangeable for the purpose of this lecture.
Race conditions can occur when the following two conditions happensimultaneously:
- there are at least two execution contexts that run in "parallel":
- truly run in parallel (e.g. two system calls running ondifferent processors)
- one of the contexts can arbitrary preempt the other (e.g. aninterrupt preempts a system call)
- the execution contexts perform read-write accesses to sharedmemory
Race conditions can lead to erroneous results that are hard to debug,because they manifest only when the execution contexts are scheduledon the CPU cores in a very specific order.
A classical race condition example is an incorrect implementation fora release operation of a resource counter:
void release_resource(){ counter--; if (!counter) free_resource();}
A resource counter is used to keep a shared resource available untilthe last user releases it but the above implementation has a racecondition that can cause freeing the resource twice:
In most cases the release_resource() function will only free theresource once. However, in the scenario above, if thread A ispreempted right after decrementing counter and thread B callsrelease_resource() it will cause the resource to be freed. Whenresumed, thread A will also free the resource since the counter valueis 0.
To avoid race conditions the programmer must first identify thecritical section that can generate a race condition. The criticalsection is the part of the code that reads and writes shared memoryfrom multiple parallel contexts.
In the example above, the minimal critical section is starting withthe counter decrement and ending with checking the counter's value.
Once the critical section has been identified race conditions can beavoided by using one of the following approaches:
- make the critical section atomic (e.g. use atomicinstructions)
- disable preemption during the critical section (e.g. disableinterrupts, bottom-half handlers, or thread preemption)
- serialize the access to the critical section (e.g. use spinlocks or mutexes to allow only one context or thread in thecritical section)
Linux kernel concurrency sources¶
There are multiple source of concurrency in the Linux kernel thatdepend on the kernel configuration as well as the type of system itruns on:
- single core systems, non-preemptive kernel: the currentprocess can be preempted by interrupts
- single core systems, preemptive kernel: above + thecurrent process can be preempted by other processes
- multi-core systems: above + the current process can runin parallel with another process or with an interrupt running onanother processor
Note
We only discuss kernel concurrency and that is why anon-preemptive kernel running on an single core systemhas interrupts as the only source of concurrency.
Atomic operations¶
In certain circ*mstances we can avoid race conditions by using atomicoperations that are provided by hardware. Linux provides a unified APIto access atomic operations:
- integer based:
- simple:
atomic_inc()
,atomic_dec()
,atomic_add()
,atomic_sub()
- conditional:
atomic_dec_and_test()
,atomic_sub_and_test()
- simple:
- bit based:
- simple:
test_bit()
,set_bit()
,change_bit()
- conditional:
test_and_set_bit()
,test_and_clear_bit()
,test_and_change_bit()
- simple:
For example, we could use atomic_dec_and_test()
to implementthe resource counter decrement and value checking atomic:
void release_resource(){ if (atomic_dec_and_test(&counter)) free_resource();}
One complication with atomic operations is encountered inmulti-core systems, where an atomic operation is not longeratomic at the system level (but still atomic at the core level).
To understand why, we need to decompose the atomic operation in memoryloads and stores. Then we can construct race condition scenarios wherethe load and store operations are interleaved across CPUs, like in theexample below where incrementing a value from two processors willproduce an unexpected result:
In order to provide atomic operations on SMP systems differentarchitectures use different techniques. For example, on x86 a LOCKprefix is used to lock the system bus while executing the prefixedoperation:
On ARM the LDREX and STREX instructions are used together to guaranteeatomic access: LDREX loads a value and signals the exclusive monitorthat an atomic operation is in progress. The STREX attempts to store anew value but only succeeds if the exclusive monitor has not detectedother exclusive operations. So, to implement atomic operations theprogrammer must retry the operation (both LDREX and STREX) until theexclusive monitor signals a success.
Although they are often interpreted as "light" or "efficient"synchronization mechanisms (because they "don't require spinning orcontext switches", or because they "are implemented in hardware sothey must be more efficient", or because they "are just instructionsso they must have similar efficiency as other instructions"), as seenfrom the implementation details, atomic operations are actuallyexpensive.
Disabling preemption (interrupts)¶
On single core systems and non preemptive kernels the only source ofconcurrency is the preemption of the current thread by aninterrupt. To prevent concurrency is thus sufficient to disableinterrupts.
This is done with architecture specific instructions, but Linux offersarchitecture independent APIs to disable and enable interrupts:
#define local_irq_disable() \ asm volatile („cli” : : : „memory”)#define local_irq_enable() \ asm volatile („sti” : : : „memory”)#define local_irq_save(flags) \ asm volatile ("pushf ; pop %0" :"=g" (flags) : /* no input */: "memory") \ asm volatile("cli": : :"memory")#define local_irq_restore(flags) \ asm volatile ("push %0 ; popf" : /* no output */ : "g" (flags) :"memory", "cc");
Although the interrupts can be explicitly disabled and enable withlocal_irq_disable()
and local_irq_enable()
these APIsshould only be used when the current state and interrupts isknown. They are usually used in core kernel code (like interrupthandling).
For typical cases where we want to avoid interrupts due to concurrencyissues it is recommended to use the local_irq_save()
andlocal_irq_restore()
variants. They take care of saving andrestoring the interrupts states so they can be freely called fromoverlapping critical sections without the risk of accidentallyenabling interrupts while still in a critical section, as long as thecalls are balanced.
Spin Locks¶
Spin locks are used to serialize access to a critical section. Theyare necessary on multi-core systems where we can have true executionparallelism. This is a typical spin lock implementation:
spin_lock: lock bts [my_lock], 0 jc spin_lock/* critical section */spin_unlock: mov [my_lock], 0
bts dts, src - bit test and set; it copies the src bit from the dtsmemory address to the carry flag and then sets it:
CF <- dts[src]dts[src] <- 1
As it can be seen, the spin lock uses an atomic instruction to makesure that only one core can enter the critical section. If there aremultiple cores trying to enter they will continuously "spin" until thelock is released.
While the spin lock avoids race conditions, it can have a significantimpact on the system's performance due to "lock contention":
- There is lock contention when at least one core spins trying toenter the critical section lock
- Lock contention grows with the critical section size, time spentin the critical section and the number of cores in the system
Another negative side effect of spin locks is cache thrashing.
Cache thrashing occurs when multiple cores are trying to read andwrite to the same memory resulting in excessive cache misses.
Since spin locks continuously access memory during lock contention,cache thrashing is a common occurrence due to the way cachecoherency is implemented.
Cache coherency in multi-processor systems¶
The memory hierarchy in multi-processor systems is composed of localCPU caches (L1 caches), shared CPU caches (L2 caches) and the mainmemory. To explain cache coherency we will ignore the L2 cache andonly consider the L1 caches and main memory.
In the figure below we present a view of the memory hierarchy with twovariables A and B that fall into different cache lines and wherecaches and the main memory are synchronized:
In the absence of a synchronization mechanism between the caches andmain memory, when CPU 0 executes A = A + B and CPU 1 executes B =A + B we will have the following memory view:
In order to avoid the situation above multi-processor systems usecache coherency protocols. There are two main types of cache coherencyprotocols:
- Bus snooping (sniffing) based: memory bus transactions aremonitored by caches and they take actions to preservecoherency
- Directory based: there is a separate entity (directory) thatmaintains the state of caches; caches interact with directoryto preserve coherency
Bus snooping is simpler but it performs poorly when the number ofcores goes beyond 32-64.
Directory based cache coherence protocols scale much better (upto thousands of cores) and are usually used in NUMA systems.
A simple cache coherency protocol that is commonly used in practice isMESI (named after the acronym of the cache line states names:Modified, Exclusive, Shared and Invalid). It's maincharacteristics are:
- Caching policy: write back
- Cache line states
- Modified: owned by a single core and dirty
- Exclusive: owned by a single core and clean
- Shared: shared between multiple cores and clean
- Invalid : the line is not cached
Issuing read or write requests from CPU cores will trigger statetransitions, as exemplified below:
- Invalid -> Exclusive: read request, all other cores have the linein Invalid; line loaded from memory
- Invalid -> Shared: read request, at least one core has the linein Shared or Exclusive; line loaded from sibling cache
- Invalid/Shared/Exclusive -> Modified: write request; allother cores invalidate the line
- Modified -> Invalid: write request from other core; line isflushed to memory
Note
The most important characteristic of the MESI protocol isthat it is a write-invalidate cache protocol. When writing to ashared location all other caches are invalidated.
This has important performance impact in certain access patterns, andone such pattern is contention for a simple spin lock implementationlike we discussed above.
To exemplify this issue lets consider a system with three CPU cores,where the first has acquired the spin lock and it is running thecritical section while the other two are spinning waiting to enter thecritical section:
As it can be seen from the figure above due to the writes issued bythe cores spinning on the lock we see frequent cache line invalidateoperations which means that basically the two waiting cores will flushand load the cache line while waiting for the lock, creatingunnecessary traffic on the memory bus and slowing down memory accessesfor the first core.
Another issue is that most likely data accessed by the first CPUduring the critical section is stored in the same cache line with thelock (common optimization to have the data ready in the cache afterthe lock is acquired). Which means that the cache invalidationtriggered by the two other spinning cores will slow down the executionof the critical section which in turn triggers more cache invalidateactions.
Optimized spin locks¶
As we have seen simple spin lock implementations can have poorperformance issues due to cache thrashing, especially as the number ofcores increase. To avoid this issue there are two possible strategies:
- reduce the number of writes and thus reduce the number of cacheinvalidate operations
- avoid the other processors spinning on the same cache line, and thusavoid the cache invalidate operations
An optimized spin lock implementation that uses the first approach ispresented below:
spin_lock: rep ; nop test lock_addr, 1 jnz spin_lock lock bts lock_addr jc spin_lock
- we first test the lock read only, using a non atomicinstructions, to avoid writes and thus invalidate operationswhile we spin
- only when the lock might be free, we try to acquire it
The implementation also use the PAUSE instruction to avoidpipeline flushes due to (false positive) memory order violations andto add a small delay (proportional with the memory bus frequency) toreduce power consumption.
A similar implementation with support for fairness (the CPU cores areallowed in the critical section based on the time of arrival) is usedin the Linux kernel (the ticket spin lock)for many architectures.
However, for the x86 architecture, the current spin lockimplementation uses a queued spin lock where the CPU cores spin ondifferent locks (hopefully distributed in different cache lines) toavoid cache invalidation operations:
Conceptually, when a new CPU core tries to acquire the lock and itfails it will add its private lock to the list of waiting CPUcores. When the lock owner exits the critical section it unlocks thenext lock in the list, if any.
While a read spin optimized spin lock reduces most of the cacheinvalidation operations, the lock owner can still generate cacheinvalidate operations due to writes to data structures close to thelock and thus part of the same cache line. This in turn generatesmemory traffic on subsequent reads on the spinning cores.
Hence, queued spin locks scale much better for large number of coresas is the case for NUMA systems. And since they have similar fairnessproperties as the ticket lock it is the preferred implementation onthe x86 architecture.
Process and Interrupt Context Synchronization¶
Accessing shared data from both process and interrupt context is arelatively common scenario. On single core systems we can do this bydisabling interrupts, but that won't work on multi-core systems,as we can have the process running on one CPU core and the interruptcontext running on a different CPU core.
Using a spin lock, which was designed for multi-processor systems,seems like the right solution, but doing so can cause commondeadlock conditions, as detailed by the following scenario:
- In the process context we take the spin lock
- An interrupt occurs and it is scheduled on the same CPU core
- The interrupt handler runs and tries to take the spin lock
- The current CPU will deadlock
To avoid this issue a two fold approach is used:
- In process context: disable interrupts and acquire a spin lock;this will protect both against interrupt or other CPU cores raceconditions (
spin_lock_irqsave()
andspin_lock_restore()
combine the two operations) - In interrupt context: take a spin lock; this will will protectagainst race conditions with other interrupt handlers or processcontext running on different processors
We have the same issue for other interrupt context handlers such assoftirqs, tasklets or timers and while disabling interrupts mightwork, it is recommended to use dedicated APIs:
- In process context use
spin_lock_bh()
(which combineslocal_bh_disable()
andspin_lock()
) andspin_unlock_bh()
(which combinesspin_unlock()
andlocal_bh_enable()
) - In bottom half context use:
spin_lock()
andspin_unlock()
(orspin_lock_irqsave()
andspin_lock_irqrestore()
if sharing data with interrupthandlers)
As mentioned before, another source of concurrency in the Linux kernelcan be other processes, due to preemption.
Preemption is configurable: when active it provides better latencyand response time, while when deactivated it provides betterthroughput.
Preemption is disabled by spin locks and mutexes but it can bemanually disabled as well (by core kernel code).
As for local interrupt enabling and disabling APIs, the bottom halfand preemption APIs allows them to be used in overlapping criticalsections. A counter is used to track the state of bottom half andpreemption. In fact the same counter is used, with different incrementvalues:
#define PREEMPT_BITS 8#define SOFTIRQ_BITS 8#define HARDIRQ_BITS 4#define NMI_BITS 1#define preempt_disable() preempt_count_inc()#define local_bh_disable() add_preempt_count(SOFTIRQ_OFFSET)#define local_bh_enable() sub_preempt_count(SOFTIRQ_OFFSET)#define irq_count() (preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK))#define in_interrupt() irq_count()asmlinkage void do_softirq(void){ if (in_interrupt()) return; ...
Mutexes¶
Mutexes are used to protect against race conditions from other CPUcores but they can only be used in process context. As opposed tospin locks, while a thread is waiting to enter the critical section itwill not use CPU time, but instead it will be added to a waiting queueuntil the critical section is vacated.
Since mutexes and spin locks usage intersect, it is useful to comparethe two:
- They don't "waste" CPU cycles; system throughput is better thanspin locks if context switch overhead is lower than mediumspinning time
- They can't be used in interrupt context
- They have a higher latency than spin locks
Conceptually, the mutex_lock()
operation is relatively simple:if the mutex is not acquired we can take the fast path via an atomicexchange operation:
void __sched mutex_lock(struct mutex *lock){ might_sleep(); if (!__mutex_trylock_fast(lock)) __mutex_lock_slowpath(lock);}static __always_inline bool __mutex_trylock_fast(struct mutex *lock){ unsigned long curr = (unsigned long)current; if (!atomic_long_cmpxchg_acquire(&lock->owner, 0UL, curr)) return true; return false;}
otherwise we take the slow path where we add ourselves to the mutexwaiting list and put ourselves to sleep:
... spin_lock(&lock->wait_lock);... /* add waiting tasks to the end of the waitqueue (FIFO): */ list_add_tail(&waiter.list, &lock->wait_list);... waiter.task = current;... for (;;) { if (__mutex_trylock(lock)) goto acquired; ... spin_unlock(&lock->wait_lock); ... set_current_state(state); spin_lock(&lock->wait_lock); } spin_lock(&lock->wait_lock);acquired: __set_current_state(TASK_RUNNING); mutex_remove_waiter(lock, &waiter, current); spin_lock(&lock->wait_lock);...
The full implementation is a bit more complex: instead of going tosleep immediately it optimistic spinning if it detects that the lockowner is currently running on a different CPU as chances are the ownerwill release the lock soon. It also checks for signals and handlesmutex debugging for locking dependency engine debug feature.
The mutex_unlock()
operation is symmetric: if there are nowaiters on the mutex then we can take the fast path via an atomic exchangeoperation:
void __sched mutex_unlock(struct mutex *lock){ if (__mutex_unlock_fast(lock)) return; __mutex_unlock_slowpath(lock, _RET_IP_);}static __always_inline bool __mutex_unlock_fast(struct mutex *lock){ unsigned long curr = (unsigned long)current; if (atomic_long_cmpxchg_release(&lock->owner, curr, 0UL) == curr) return true; return false;}void __mutex_lock_slowpath(struct mutex *lock){... if (__mutex_waiter_is_first(lock, &waiter)) __mutex_set_flag(lock, MUTEX_FLAG_WAITERS);...
Note
Because struct task_struct
is cached aligned the 7lower bits of the owner field can be used for various flags,such as MUTEX_FLAG_WAITERS
.
Otherwise we take the slow path where we pick up first waiter from thelist and wake it up:
...spin_lock(&lock->wait_lock);if (!list_empty(&lock->wait_list)) { /* get the first entry from the wait-list: */ struct mutex_waiter *waiter; waiter = list_first_entry(&lock->wait_list, struct mutex_waiter, list); next = waiter->task; wake_q_add(&wake_q, next);}...spin_unlock(&lock->wait_lock);...wake_up_q(&wake_q);
Per CPU data¶
Per CPU data avoids race conditions by avoiding to use shareddata. Instead, an array sized to the maximum possible CPU cores isused and each core will use its own array entry to read and writedata. This approach certainly has advantages:
- No need to synchronize to access the data
- No contention, no performance impact
- Well suited for distributed processing where aggregation is onlyseldom necessary (e.g. statistics counters)
Memory Ordering and Barriers¶
Modern processors and compilers employ out-of-order execution toimprove performance. For example, processors can execute "future"instructions while waiting for current instruction data to be fetchedfrom memory.
Here is an example of out of order compiler generated code:
C code | Compiler generated code |
a = 1;b = 2; | MOV R10, 1MOV R11, 2STORE R11, bSTORE R10, a |
Note
When executing instructions out of order the processor makessure that data dependency is observed, i.e. it won't executeinstructions whose input depend on the output of a previousinstruction that has not been executed.
In most cases out of order execution is not an issue. However, incertain situations (e.g. communicating via shared memory betweenprocessors or between processors and hardware) we must issue someinstructions before others even without data dependency between them.
For this purpose we can use barriers to order memory operations:
- A read barrier (
rmb()
,smp_rmb()
) is used tomake sure that no read operation crosses the barrier; that is,all read operation before the barrier are complete beforeexecuting the first instruction after the barrier - A write barrier (
wmb()
,smp_wmb()
) is used tomake sure that no write operation crosses the barrier - A simple barrier (
mb()
,smp_mb()
) is usedto make sure that no write or read operation crosses the barrier
Read Copy Update (RCU)¶
Read Copy Update is a special synchronization mechanism similar withread-write locks but with significant improvements over it (and somelimitations):
- Read-only lock-less access at the same time with write access
- Write accesses still requires locks in order to avoid racesbetween writers
- Requires unidirectional traversal by readers
In fact, the read-write locks in the Linux kernel have been deprecatedand then removed, in favor of RCU.
Implementing RCU for a new data structure is difficult, but a fewcommon data structures (lists, queues, trees) do have RCU APIs thatcan be used.
RCU splits removal updates to the data structures in two phases:
- Removal: removes references to elements. Some old readers maystill see the old reference so we can't free the element.
- Elimination: free the element. This action is postponed untilall existing readers finish traversal (quiescent cycle). Newreaders won't affect the quiescent cycle.
As an example, lets take a look on how to delete an element from alist using RCU:
In the first step it can be seen that while readers traverse the listall elements are referenced. In step two a writer removeselement B. Reclamation is postponed since there are still readers thathold references to it. In step three a quiescent cycle just expiredand it can be noticed that there are no more references toelement B. Other elements still have references from readers thatstarted the list traversal after the element was removed. In step 4 wefinally perform reclamation (free the element).
Now that we covered how RCU functions at the high level, lets looks atthe APIs for traversing the list as well as adding and removing anelement to the list:
/* list traversal */rcu_read_lock();list_for_each_entry_rcu(i, head) { /* no sleeping, blocking calls or context switch allowed */}rcu_read_unlock();/* list element delete */spin_lock(&lock);list_del_rcu(&node->list);spin_unlock(&lock);synchronize_rcu();kfree(node);/* list element add */spin_lock(&lock);list_add_rcu(head, &node->list);spin_unlock(&lock);