Improving Context Switching Performance of Idle Tasks Under Linux

The Benchmark

Linux advocates seem to think that Linux offers better performance than Windows.  Gregory Travis (greg@sherrill.kiva.net) set out  test this assumption.  He tested the times needed to manage processes and threads.  His benchmark was a simple program that just timed simple operations like fork() or sched_yield().

Recall that a process is a program in execution, and a thread is just a CPU state stored within a process.  A CPU context is either a process or a thread.  Most processes have only one thread, but some processes, partially servers, have more.  Generally, servers with more than one thread use each thread to handle one request, and use multiple threads when there are more than one request concurrently outstanding.  Sometimes programs besides servers use multiple threads, often to allow multiple CPUs to work on the same problem at the same time.  A third use of threads is to use one thread to maintain user interface responsiveness while other threads of the same program handle computationally expensive or I/O bound tasks.  Netscape's Navigator is an example of such a program.

Process creation time reflects how long it takes to start up a program.  Process  switch time describes how long it takes for the computer to switch between two concurrently executing programs.  Process clone times reflect how long it takes to make an identical copy of the current program, something done often in UNIX but almost never in Windows.  Thread create and switch times reflect how long those operations take.  All times are on a 200 Mhz Pentium MMX
 

Benchmark Operating System Operation Time per Op
Spawn New Process NT spawnl() 12.0 ms
Linux fork()/exec()() 6.0 ms
Clone Current Process NT NONE N/A
Linux fork() 1.0 ms
Spawn New Thread NT pthread_create() 0.9 ms
Linux pthread_create() 0.3 ms
Switch Current Process NT Sleep(0) 0.010 ms
Linux sched_yield() 0.019 ms
Switch Current Thread NT Sleep(0) 0.006 ms
Linux sched_yield() 0.019ms 
 

Note that Linux can create a process twice as fast as NT can, and can create a thread three times as fast as NT.  Process creation takes longer  than thread creation (12x for NT,  20x for Linux) because processes have so much more overhead.  Memory maps and file descriptor tables are just two of the things needed to be created for a process but not a thread.

Also, note that using this benchmark NT is much faster than Linux at context switching, either between processes (1.9x) or threads (3.2x).  Since context switching occurs  much more frequently than context creation, it would seem that Linux has a problem.  But does it?
 

Understanding the Problem

So, why does this simple benchmark make the Linux context switching code so much slower than Windows NT? To answer that, one must understand the Linux scheduler.

The scheduler has a list called the run queue of all contexts ready to run.  These contexts are not sorted in any way.  Each context also has a goodness, which  is a measure of the current priority of the context. There are two loops within the scheduler.  The first loop (the searching loop) finds the context with the highest  goodness from the ready queue and selects that context to run.  The second loop (the recalc loop) recalculates the goodnesses if all contexts have used their entire time slice.  This second loop only runs occasionally.   The code below shows the structure of the schedule function.
 

Code from sched.c, heavily edited for clarity and simplicity
static inline int goodness(struct task_struct * p,  
               struct task_struct * prev, int this_cpu) 
{ 
        int weight;  
        weight = p->counter; 
        /* Code to adjust weight */ 
        return weight; 
} 

asmlinkage void schedule(void) 
{ 
        prev = current; 
        c = -1000; 
        next = idle_task; 

        /* The Search Loop */ 
        while (p != &init_task) { 
                int weight = goodness(p, prev, this_cpu); 
                if (weight > c) 
                        c = weight, next = p; 
                p = p->next_run; 
        } 
        /* The Recalc Loop */ 
        /* if all runnable processes have "counter == 0", */ 
        /* re-calculate counters */ 
        if (!c) { 
                for_each_task(p) 
                        p->counter = (p->counter >> 1) + p->priority; 
        } 
}

The search loop needs a small bit of CPU time for every runnable process, but needs it on every context switch.  The benchmark described above shows the context switch times with 40 contexts present.  But that corresponds to a load average of 40 for a single CPU system or 20 for a dual CPU system.  These are very large load averages, much larger than normally considered healthy.  Generally, there will only be a few (perhaps one or three) contexts to choose from.  Most processes and threads, even very active ones, will be waiting for some I/O even to occur, and are therefore not on the ready queue and are not considered for selection.

Even very heavily loaded machines will generally have most contexts waiting for I/O to complete, and therefore not be on the run queue.  Consider the site http://www.insite.com.  This is a very heavily loaded internet site, with about 200 copies of httpd running as web servers.  The total number for all purposes is about 420 processes.

Measurements of this machine indicate that there were 24,221,164 context switches over a 17 hour period (400 switches per second).  For these switches, only 4% of the time were there over 10 contexts ready to choose from at that moment, and only 0.1% of the time were there twenty or more.  The longest run queue during the 17 hour stretch was only 36 contexts, out of 420 possible.  mean run queue length averaged only 2.5 contexts.  In some sense, the benchmark represents a worst case for Linux, and the average case, even for heavily loaded machines,  is much better.

The second (recalc) loop is even more important when understanding these benchmark results. This recalc loop runs only when every context on the ready queue has used up its entire time slice (generally 20 ticks or 0.2 seconds).  The recalc loop then recalculate the priority of each context.  Normally, this recalc loop runs only occasionally.  Most rescheduling is done is response to an interrupt or because of an I/O request. therefore most contexts do not use their whole time slice.  However,  when a process or thread wants to yield the CPU, it calls sched_yield(), which treats the process as if the process had used its entire time slice.  When a process has used the entire time slice,  In this way, any process that calls sched_yield() has it's priority lowered and will not be scheduled again for a substantial period of time.   The code to sched_yield() is shown below.
 

asmlinkage int sys_sched_yield(void) { 
        cli(); 
        move_last_runqueue(current); 
        current->counter = 0;  // THIS IS THE LINE 
        need_resched = 1; 
        sti(); 
        return 0; 
}
When all contexts have used their entire time slice,  the scheduler recalculates the priority of all contexts using the recalc loop.  Since during this benchmark all contexts just do one cheap operation (sched_yield()) and then get their counter set to zero, this recalc loop runs much more often than normal.  Again the benchmark seems to be a worst case for Linux.

Again consider the web site described above.  Measurements show only 1 out of 200 contexts switches required a recalculation of the priority of every context.  The other 199 context switches did not.
 

Solving the Problem

Half the problem can be solved easily, the other half would be more difficult.

The Search Loop

There is probably no way to make the search loop run faster; its well written code.  However, it could be eliminated by simply keeping contexts in the ready queue in sorted order.  Then the next context to run is the context at the head of the ready queue.  But, keeping the ready queue in sorted order would require code to take each context being added to the ready queue, and place it in the appropriate position.  The time needed to find this position would add complexity to the scheduler.  For large load averages (implying many contexts on the ready queue) there might be considerable time savings.  But in the more normal case of small ready queues, there is no significant savings.

The Recalc Loop

Minimizing the time needed by the recalc loop would be easy.  Again, it is well written code not likely to be improved upon, but it need not run as often as it does.  By changing sched_yield() the recalc loop can run much less often.

Right now (Linux 2.0.30) sched_yield() acts as if the yielding context has used the entire time slice.  Instead, what if  sched_yield() acted as if the yielding context had only used one tick of its time slice.  There would be several effects

Here is the new sched_yield() shown below.  Compare to the one above.  There is only one more line, yet a large increase in performance for this benchmark.
 
asmlinkage int sys_sched_yield(void) { 
        cli(); 
        move_last_runqueue(current); 
        if (current->counter) // CHANGE IS HERE 
           current->counter--; 
        need_resched = 1; 
        sti(); 
        return 0; 
}
Below is a table summarizing the performance after this change.   Note that as the run queue length increases, but the NT and the Linux scheduler take longer to context switch.  Linux starts out as being the faster context swatter, but NT does relatively better as run queue length increases.  For run queue lengths of 20 or less (almost always the case in real life), Linux is better.
 
 
Operating System Benchmark Number of Contexts in Run Queue
2 4 8 10 20 40
Linux Before Change Process Switch 19 us 13 us 13 us 14 us 16 us 27 us
Thread Switch 16 us 11 us 10 us 10 us 15 us 23 us
Linux After Change Process Switch 4 us 6 us 11 us 12 us 15 us 28 us
Thread Switch 3 us 3 us 5 us 7 us 12 us 22 us
NT Process Switch 10 us 15 us 15 us 17 us 16 us 17 us
Thread Switch 5 us 8 us 8 us 9 us 10 us 11 us
 

Summary

By making a two line change to the source code, this benchmark can be greatly improved.  However, the benchmark arguably does not reflect real life usage.  Never the less, only a two line change to the kernel is required, for a significant benefit to a small number of users.