System Duty Cycle scheduler class
Perhaps you’ve read about the System Duty Cycle scheduler class. Some people reported about it, but rather cursory. So … what is the story behind the System Duty Cycle scheduler or short SDC?
Before digging into the SDC, i should explain that the scheduler is a sophisticated little beast. You have several scheduling classes: The TS (time sharing) is the normal scheduling class, there is the IA (interactive) that prioritize the process that’s in focus in your desktop environment and it changes priorities based on some rules to ensure an even distribution of compute power for all the threads. You have the FX (fixed) scheduler, that keeps the priority fixed during the complete life time of the process. There is the FSS (fair share scheduler) that distributes the compute time on the system by introducing a concept called shares. A thread gets compute time on the basis of shares it owns. The kernel has a special scheduling class called SYS (system) and at last there is a scheduling class called RT (realtime).
To understand the idea of the SDC you have to know a few things about the Solaris scheduling:
- There isn't just one scheduler active in a Solaris system. You can run multiple of them in parallel and all of them can coexist ( perhaps not in the same processor set like FX and FSS e.g.)
- Every thread in a Solaris system is preemptible. It can be stopped to throw it from the CPU and to allow another thread to run on this CPU.
- There are 170 priorities in Solaris.
- 0 is the lowest.
- 169 is the highest.
- A process with a higher priority can preempt the process with a lower priority.
- When there is a thread with a higher priority runnable, it will preempt a thread with a lower priority running on a CPU
- The schedulers operate in assigned priority ranges. For example the IA, TS, FX and FSS scheduler operate in the range 0-59. The scheduler behind the SYS class works in the range 60-99. The RT class works in the range 100 to 159. The rest is for interrupts. </ul> </noautobr>When you look at the process table you will recognize that a Solaris System already runs with multiple active scheduling classes. There are some processes running in the SYS scheduling class:
jmoekamp@hivemind:~$ ps -ef -o pid,class,pri,args | grep "SYS" | grep -v "grep"
0 SYS 96 sched
2 SYS 98 pageout
3 SYS 60 fsflush
jmoekamp@hivemind:~$ ps -ef -o pid,class,pri,args | grep "IA" | grep -v "grep" | wc -l
49
jmoekamp@hivemind:~$ ps -ef -o pid,class,pri,args | grep "TS" | grep -v "grep" | wc -l
71

So say it simple, it's the ratio of the time that the thread runs on CPU and the time it runs and it's ready to run on the CPU. The times it just sleeps or it waits for I/O doesn't count into the ratio, but that's obvious. It doesn't use the CPU in such moments and we don't care about this time. The idea is now: When the threads has a duty cycle ratio higher than a configured value, the priority is set to zero. So almost all processes on the system have higher priority and the thread governed by this class is just getting CPU power when there is nothing else to do. When the duty cycle ratio is lower l than this value, the thread gets the priority 99. That's the highest SYS priority and so it gets CPU time almost the entire time it wants some (only RT class threads, interrupts and other priority-99 SYS threads may get the CPU). Lets clarify things with a short example. We assume, that the system checks for the duty cycle in intervals ... let's say every 1ms. The switching threshold is 0.5. When it's equal or lower, we switch to 99. When it's higher we switch to priority 0. The thread governed by the SDC class is started and it gets CPU time. The first check occurs at 0.1ms Remember the formula and insert the data.
The priority is set to 0. So it gets no CPU time in the next interval, but it's runnable in that interval. Okay. At 0.2 ms the next check is done. We insert the values again:
The thread get's priority 99 again. At 0.3 ms the calculation is done again
So we switch back to priority 0. Another 1ms later
So we switch to priority 99. At 0.5 ms we have a DC ratio of 0.6:
So back back to priority 0. This processing is done as lone the thread in the SDC class exists ... again and again. With this simple mechanism you keep the the thread in the SDC class from hogging the CPU and you ensure that it runs for a certain fraction of the time it could run. In the time where the priority is 0, all other threads in the system can execute without being prevented from doing so by a ZFS doing some computational intensive task. Furthermore the kernel engineering introduced a time quantum that a SDC thread can use before the scheduler forces it from the CPU. This is done to ensure, that other priority-99 threads can get the CPU as well without being dependent of the running thread to stop processing on the CPU. Of course this is a little bit simplified, but not that much ... you can read the theory behind the SDC in a mail about the System Duty Cycle Scheduler class PSARC case. And that's all the secret behind the need for the threads called
zpool-poolname
that do the ZFS processing and the strange scheduling class they run in:
jmoekamp@hivemind:~$ ps -ef -o pid,class,pri,args | grep "SDC" | grep -v "grep"
5 SDC 99 zpool-rpool
368 SDC 99 zpool-datapool