System Duty Cycle scheduler class
Perhaps you’ve read about the System Duty Cycle scheduler class. Some people reported about it, but rather cursory. So … what is the story behind the System Duty Cycle scheduler or short SDC?
Before digging into the SDC, i should explain that the scheduler is a sophisticated little beast. You have several scheduling classes: The TS (time sharing) is the normal scheduling class, there is the IA (interactive) that prioritize the process that’s in focus in your desktop environment and it changes priorities based on some rules to ensure an even distribution of compute power for all the threads. You have the FX (fixed) scheduler, that keeps the priority fixed during the complete life time of the process. There is the FSS (fair share scheduler) that distributes the compute time on the system by introducing a concept called shares. A thread gets compute time on the basis of shares it owns. The kernel has a special scheduling class called SYS (system) and at last there is a scheduling class called RT (realtime). To understand the idea of the SDC you have to know a few things about the Solaris scheduling:
- There isn't just one scheduler active in a Solaris system. You can run multiple of them in parallel and all of them can coexist ( perhaps not in the same processor set like FX and FSS e.g.)
- Every thread in a Solaris system is preemptible. It can be stopped to throw it from the CPU and to allow another thread to run on this CPU.
- There are 170 priorities in Solaris.
- 0 is the lowest.
- 169 is the highest.
- A process with a higher priority can preempt the process with a lower priority.
- When there is a thread with a higher priority runnable, it will preempt a thread with a lower priority running on a CPU
- The schedulers operate in assigned priority ranges. For example the IA, TS, FX and FSS scheduler operate in the range 0-59. The scheduler behind the SYS class works in the range 60-99. The RT class works in the range 100 to 159. The rest is for interrupts. </ul> </noautobr>When you look at the process table you will recognize that a Solaris System already runs with multiple active scheduling classes. There are some processes running in the SYS scheduling class:
There are a lot of processes running in the IA and the TS scheduling classes:
The scheduler system keeps track of all threads and all the scheduling classes. Of course the scheduling is much more complex, but for the purpose of this article, this information is sufficient. When you want to learn more about the scheduler i just want to recommend the chapter 3 of the book "Solaris Internals: Solaris 10 and Open Solaris Kernel Architecture " to you. The SDC class is missing in that book as it's a really recent development however is the best description of the scheduling subsystem that i'm aware of and the SDC is variant of the SYS class. You may already have recognized the problem. A thread running in SYS class will always throw a thread in one of the user-mode scheduling classes from the CPU. This behaviour was introduced out of a reason. Furthermore the scheduler does some favors to threads in the SYS class. They don't have a time quantum, so they stay on a CPU as long as they don't give up the CPU voluntarily. And when they are preempted by a higher priority thread they are put in front of the queues, so it's unlikely that they migrate to a different CPU. So when a SYS thread is running on a CPU and it needs to do compute intensive task, it's a little bit hard to get the thread from the CPU to execute userland threads. However the decision to implement it this way, was a rational one. The assumption of the SYS class was, that everything running in the kernel has a high urgency and most of the requests are sitting on the CPU for a really short time. That was correct for a long, long time. However the things have changed, especially when you look at ZFS. ZFS does a lot of computational intensive tasks like calculating checksums, doing compression, deduplicating data. The assumption of the relative short timespan that a kernel thread occupies the CPU, doesn't hold true anymore for all cases albeit it's still correct for most cases. The exceptions to this assumption are the challenge. As ZFS is part of the kernel-space, it runs in the SYS scheduling class, it has a higher priority than any user-mode thread short of real-time threads and so it gets CPU-time when it wants CPU time and preempts any user-mode thread. At the end it's able to starve the user-mode threads in corner-cases when it has to do a lot of computational intensive tasks and in any case it can increase the latency experienced by the users of the system to an extend it's recognizable by mere mortals. So the kernel engineers had to find a way to keep such threads from CPU hogging. They found a relatively simple way to do it. It's called the System Duty Cycle scheduler class and it's a really simple, but very elegant idea. The concept cycles around the the idea of the duty cycle. The duty cycle is defined as
So say it simple, it's the ratio of the time that the thread runs on CPU and the time it runs and it's ready to run on the CPU. The times it just sleeps or it waits for I/O doesn't count into the ratio, but that's obvious. It doesn't use the CPU in such moments and we don't care about this time. The idea is now: When the threads has a duty cycle ratio higher than a configured value, the priority is set to zero. So almost all processes on the system have higher priority and the thread governed by this class is just getting CPU power when there is nothing else to do. When the duty cycle ratio is lower l than this value, the thread gets the priority 99. That's the highest SYS priority and so it gets CPU time almost the entire time it wants some (only RT class threads, interrupts and other priority-99 SYS threads may get the CPU). Lets clarify things with a short example. We assume, that the system checks for the duty cycle in intervals ... let's say every 1ms. The switching threshold is 0.5. When it's equal or lower, we switch to 99. When it's higher we switch to priority 0. The thread governed by the SDC class is started and it gets CPU time. The first check occurs at 0.1ms Remember the formula and insert the data.
The priority is set to 0. So it gets no CPU time in the next interval, but it's runnable in that interval. Okay. At 0.2 ms the next check is done. We insert the values again:
The thread get's priority 99 again. At 0.3 ms the calculation is done again
So we switch back to priority 0. Another 1ms later
So we switch to priority 99. At 0.5 ms we have a DC ratio of 0.6:
So back back to priority 0. This processing is done as lone the thread in the SDC class exists ... again and again. With this simple mechanism you keep the the thread in the SDC class from hogging the CPU and you ensure that it runs for a certain fraction of the time it could run. In the time where the priority is 0, all other threads in the system can execute without being prevented from doing so by a ZFS doing some computational intensive task. Furthermore the kernel engineering introduced a time quantum that a SDC thread can use before the scheduler forces it from the CPU. This is done to ensure, that other priority-99 threads can get the CPU as well without being dependent of the running thread to stop processing on the CPU. Of course this is a little bit simplified, but not that much ... you can read the theory behind the SDC in a mail about the System Duty Cycle Scheduler class PSARC case. And that's all the secret behind the need for the threads called
zpool-poolnamethat do the ZFS processing and the strange scheduling class they run in: