Chapter 11. Power Management

Contents

11.1. Power Management at CPU Level
11.2. The Linux Kernel CPUfreq Infrastructure
11.3. Viewing, Monitoring and Tuning Power-related Settings
11.4. Special Tuning Options
11.5. Creating and Using Power Management Profiles
11.6. Troubleshooting
11.7. For More Information

Abstract

Power management aims at reducing operating costs for energy and cooling systems while at the same time keeping the performance of a system at a level that matches the current requirements. Thus, power management is always a matter of balancing the actual performance needs and power saving options for a system. Power management can be implemented and used at different levels of the system. A set of specifications for power management functions of devices and the operating system interface to them has been defined in the Advanced Configuration and Power Interface (ACPI). As power savings in server environments can primarily be achieved on processor level, this chapter introduces some of the main concepts and highlights some tools for analyzing and influencing relevant parameters.

11.1. Power Management at CPU Level

At CPU level, you can control power usage in various ways: for example, by using idling power states (C-states), changing CPU frequency (P-states), and throttling the CPU (T-states). The following sections give a short introduction to each approach and its significance for power savings. Detailed specifications can be found at http://www.acpi.info/spec.htm.

11.1.1. C-States (Processor Operating States)

Modern processors have several power saving modes called C-states. They reflect the capability of an idle processor to turn off unused components in order to save power. Whereas C-states have been available for laptops for some time, they are a rather recent trend in the server market (for example, with Intel* processors, C-modes are only available since Nehalem).

When a processor runs in the C0 state, it is executing instructions. A processor running in any other C-state is idle. The higher the C number, the deeper the CPU sleep mode: more components are shut down to save power. Deeper sleep states save more power, but the downside is that they have higher latency (the time the CPU needs to go back to C0).

Some states also have submodes with different power saving latency levels. Which C-states and submodes are supported depends on the respective processor. However, C1 is always available.

Table 11.1, “C-States” gives an overview of the most common C-states.

Table 11.1. C-States

Mode

Definition

C0

Operational state. CPU fully turned on.

C1

First idle state. Stops CPU main internal clocks via software. Bus interface unit and APIC are kept running at full speed.

C2

Stops CPU main internal clocks via hardware. State where the processor maintains all software-visible states, but may take longer to wake up through interrupts.

C3

Stops all CPU internal clocks. The processor does not need to keep its cache coherent, but maintains other states. Some processors have variations of the C3 state that differ in how long it takes to wake the processor through interrupts.


11.1.2. P-States (Processor Performance States)

While a processor operates (in C0 state), it can be in one of several CPU performance states (P-states). Whereas C-states are idle states (all but C0), P-states are operational states that relate to CPU frequency and voltage.

The higher the P-state, the lower the frequency and voltage at which the processor runs. The number of P-states is processor-specific and the implementation differs across the various types. However, P0 is always the highest-performance state. Higher P-state numbers represent slower processor speeds and lower power consumption. For example, a processor in P3 state runs more slowly and uses less power than a processor running at P1 state. To operate at any P-state, the processor must be in the C0 state where the processor is working and not idling. The CPU P-states are also defined in the Advanced Configuration and Power Interface (ACPI) specification, see http://www.acpi.info/spec.htm.

C-states and P-states can vary independently of one another.

11.1.3. T-States (Processor Throttling States)

T-states refer to throttling the processor clock to lower frequencies in order to reduce thermal effects. This means that the CPU is forced to be idle a fixed percentage of its cycles per second. Throttling states range from T1 (the CPU has no forced idle cycles) to Tn, with the percentage of idle cycles increasing the greater n is.

Note that throttling does not reduce voltage and since the CPU is forced to idle part of the time, processes will take longer to finish and will consume more power instead of saving any power.

T-states are only useful if reducing thermal effects is the primary goal. Since T-states can interfere with C-states (preventing the CPU from reaching higher C-states), they can even increase power consumption in a modern CPU capable of C-states.

11.1.4. Turbo Features

Since quite some time, CPU power consumption and performance tuning is not only about frequency scaling anymore. In modern processors, a combination of different means is used to achieve the optimum balance between performance and power savings: deep sleep states, traditional dynamic frequency scaling and hidden boost frequencies. The turbo features (Turbo CORE* or Turbo Boost*) of the latest AMD* or Intel* processors allow to dynamically increase (boost) the clock speed of active CPU cores while other cores are in deep sleep states. This increases the performance of active threads while still complying to Thermal Design Power (TDP) limits.

However, the conditions under which a CPU core may use turbo frequencies are very architecture-specific. Learn how to evaluate the efficiency of those new features in Section 11.3.1, “Using the cpupower Tools”.

11.2. The Linux Kernel CPUfreq Infrastructure

Processor performance states (P-states) and processor operating states (C-states) are the capability of a processor to switch between different supported operating frequencies and voltages to modulate power consumption.

In order to dynamically scale processor frequencies at runtime, you can use the CPUfreq infrastructure to set a static or dynamic power policy for the system. Its main components are the CPUfreq subsystem (providing a common interface to the various low-level technologies and high-level policies) , the in-kernel governors (policy governors that can change the CPU frequency based on different criteria) and CPU-specific drivers that implement the technology for the specific processor.

The dynamic scaling of the clock speed helps to consume less power and generate less heat when not operating at full capacity.

11.2.1. In-Kernel Governors

You can think of the in-kernel governors as a sort of pre-configured power schemes for the CPU. The CPUfreq governors use P-states to change frequencies and lower power consumption. The dynamic governors can switch between CPU frequencies, based on CPU utilization to allow for power savings while not sacrificing performance. These governors also allow for some tuning so you can customize and change the frequency scaling behavior.

The following governors are available with the CPUfreq subsystem:

Performance Governor

The CPU frequency is statically set to the highest possible for maximum performance. Consequently, saving power is not the focus of this governor.

Tuning options: The range of maximum frequencies available to the governor can be adjusted (for example, with the cpupower command line tool).

Powersave Governor

The CPU frequency is statically set to the lowest possible. This can have severe impact on the performance, as the system will never rise above this frequency no matter how busy the processors are.

However, using this governor often does not lead to the expected power savings as the highest savings can usually be achieved at idle through entering C-states. Due to running processes at the lowest frequency with the powersave governor, processes will take longer to finish, thus prolonging the time for the system to enter any idle C-states.

Tuning options: The range of minimum frequencies available to the governor can be adjusted (for example, with the cpupower command line tool).

On-demand Governor

The kernel implementation of a dynamic CPU frequency policy: The governor monitors the processor utilization. As soon as it exceeds a certain threshold, the governor will set the frequency to the highest available. If the utilization is less than the threshold, the next lowest frequency is used. If the system continues to be underutilized, the frequency is again reduced until the lowest available frequency is set.

For openSUSE, the on-demand governor is the default governor and the one that has the best test coverage.

Tuning options: The range of available frequencies, the rate at which the governor checks utilization, and the utilization threshold can be adjusted. Another parameter you might want to change for the on-demand governor is ignore_nice_load. For details, refer to Procedure 11.1, “Ignoring Nice Values in Processor Utilization”.

Conservative Governor

Similar to the on-demand implementation, this governor also dynamically adjusts frequencies based on processor utilization, except that it allows for a more gradual increase in power. If processor utilization exceeds a certain threshold, the governor does not immediately switch to the highest available frequency (as the on-demand governor does), but only to next higher frequency available.

Tuning options: The range of available frequencies, the rate at which the governor checks utilization, the utilization thresholds, and the frequency step rate can be adjusted.

11.2.2. Related Files and Directories

If the CPUfreq subsystem in enabled on your system (which it is by default with openSUSE), you can find the relevant files and directories under /sys/devices/system/cpu/. If you list the contents of this directory, you will find a cpu{0..x} subdirectory for each processor, and several other files and directories. A cpufreq subdirectory in each processor directory holds a number of files and directories that define the parameters for CPUfreq. Some of them are writable (for root), some of them are read-only. If your system currently uses the on-demand or conservative governor, you will see a separate subdirectory for those governors in cpufreq, containing the parameters for the governors.

[Note]Different Processor Settings

The settings under the cpufreq directory can be different for each processor. If you want to use the same policies across all processors, you need to adjust the parameters for each processor. Instead of looking up or modifying the current settings manually (in /sys/devices/system/cpu*/cpufreq), we advise to use the tools provided by the cpupower package for that.

11.3. Viewing, Monitoring and Tuning Power-related Settings

The following command line tools are available for that purpose:

Using the cpupower Tools

The new cpupower tool was designed to give an overview of all CPU power-related parameters that are supported on a given machine, including turbo (or boost) states. Use the tool set to view and modify settings of the kernel-related CPUfreq and cpuidle systems as well as other settings not related to frequency scaling or idle states. The integrated monitoring framework can access both Kernel-related parameters and hardware statistics and is thus ideally suited for performance benchmarks. It also helps you to identify the dependencies between turbo and idle states.

Monitoring Power Consumption with powerTOP

powerTOP combines various sources of information (analysis of programs, device drivers, kernel options, amounts and sources of interrupts waking up processors from sleep states) and shows them in one screen. The tool helps you to identify the reasons for unnecessary high power consumption (for example, processes that are mainly responsible for waking up a processor from its idle state) and to optimize your system settings to avoid these.

11.3.1. Using the cpupower Tools

After installing the cpupower package, view the available cpupower subcommands with cpupower --help. Access the general man page with man cpupower, and the man pages of the subcommands with man cpupower-subcommand.

The subcommands frequency-info and frequency-set are mostly equivalent to cpufreq-info and cpufreq-set, respectively. However, they provide extended output and there are small differences in syntax and behavior:

Syntax Differences Between cpufreq* and cpupower

  • To specify the number of the CPU to which the command is applied, both commands have the -c option. Due to the command-subcommand structure, the placement of the -c option is different for cpupower:

    cpupower -c 4 frequency-info (versus cpufreq-info -c 4)

    cpupower lets you also specify a list of CPUs with -c. For example, the following command would affect the CPUs 1 , 2, 3, and 5:

    cpupower -c 1-3,5 frequency-set

  • If cpufreq* and cpupower are used without the -c option, the behavior differs:

    cpufreq-set automatically applies the command to CPU 0, whereas cpupower frequency-set applies the command to all CPUs in this case. Typically, cpupower *info subcommands access only CPU 0, whereas cpufreq-info accesses all CPUs, if not specified otherwise.

11.3.1.1. Viewing Current Settings with cpupower

Similar to cpufreq-info, cpupower frequency-info also shows the statistics of the cpufreq driver used in the Kernel. Additionally, it shows if turbo (boost) states are supported and enabled in the BIOS. Run without any options, it shows an output similar to the following:

Example 11.1. Example Output of cpupower frequency-info

analyzing CPU 0:
  driver: acpi-cpufreq
  CPUs which run at the same hardware frequency: 0 1 2 3
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 10.0 us.
  hardware limits: 2.00 GHz - 2.83 GHz
  available frequency steps: 2.83 GHz, 2.34 GHz, 2.00 GHz
  available cpufreq governors: conservative, userspace, powersave, ondemand, performance
  current policy: frequency should be within 2.00 GHz and 2.83 GHz.
                  The governor "ondemand" may decide which speed to use
                  within this range.
  current CPU frequency is 2.00 GHz (asserted by call to hardware).
  boost state support:
    Supported: yes
    Active: yes
    

To get the current values for all CPUs, use cpupower -c all frequency-info.

11.3.1.2. Viewing Kernel Idle Statistics with cpupower

The idle-info subcommand shows the statistics of the cpuidle driver used in the Kernel. It works on all architectures that use the cpuidle Kernel framework.

Example 11.2. Example Output of cpupower idle-info

CPUidle driver: acpi_idle
CPUidle governor: menu
     
Analyzing CPU 0:
Number of idle states: 3
Available idle states: C1 C2
C1:
Flags/Description: ACPI FFH INTEL MWAIT 0x0
Latency: 1
Usage: 3156464
Duration: 233680359
C2:
Flags/Description: ACPI FFH INTEL MWAIT 0x10
Latency: 1
Usage: 273007117
Duration: 103148860538

11.3.1.3. Monitoring Kernel and Hardware Statistics with cpupower

The most powerful enhancement is the monitor subcommand. Use it to report processor topology, and monitor frequency and idle power state statistics over a certain period of time. The default interval is 1 second, but it can be changed with the -i. Independent processor sleep states and frequency counters are implemented in the tool—some retrieved from kernel statistics, others reading out hardware registers. The available monitors depend on the underlying hardware and the system. List them with cpupower monitor -l. For a description of the individual monitors, refer to the cpupower-monitor man page.

The monitor subcommand allows you to execute performance benchmarks and to compare Kernel statistics with hardware statistics for specific workloads.

Example 11.3. Example cpupower monitor Output

|Mperf               || Idle_Stats  
 1                      2       
CPU | C0   | Cx   | Freq || POLL | C1   | C2   | C3   
   0|  3.71| 96.29|  2833||  0.00|  0.00|  0.02| 96.32
   1| 100.0| -0.00|  2833||  0.00|  0.00|  0.00|  0.00
   2|  9.06| 90.94|  1983||  0.00|  7.69|  6.98| 76.45
   3|  7.43| 92.57|  2039||  0.00|  2.60| 12.62| 77.52
     

1

Mperf shows the average frequency of a CPU, including boost frequencies, over a period of time. Additionally, it shows the percentage of time the CPU has been active (C0) or in any sleep state (Cx). The default sampling rate is 1 second and the values are read directly from the hardware registers. As the turbo states are managed by the BIOS, it is impossible to get the frequency values at a given instant. On modern processors with turbo features the Mperf monitor is the only way to find out about the frequency a certain CPU has been running in.

2

Idle_Stats shows the statistics of the cpuidle kernel subsystem. The kernel updates these values every time an idle state is entered or left. Therefore there can be some inaccuracy when cores are in an idle state for some time when the measure starts or ends.

Apart from the (general) monitors in the example above, other architecture-specific monitors are available. For detailed information, refer to the cpupower-monitor man page.


By comparing the values of the individual monitors, you can find correlations and dependencies and evaluate how well the power saving mechanism works for a certain workload. In Example 11.3 you can see that CPU 0 is idle (the value of Cx is near to 100%), but runs at a very high frequency. Additionally, the CPUs 0 and 1 have the same frequency values which means that there is a dependency between them.

11.3.1.4. Modifying Current Settings with cpupower

Similar to cpufreq-set, you can use cpupower frequency-set command as root to modify current settings. It allows you to set values for the minimum or maximum CPU frequency the governor may select or to create a new governor. With the -c option, you can also specify for which of the processors the settings should be modified. That makes it easy to use a consistent policy across all processors without adjusting the settings for each processor individually. For more details and the available options, refer to the cpupower-freqency-set man page or run cpupower frequency-set --help.

11.3.2. Monitoring Power Consumption with powerTOP

Another useful tool for monitoring system power consumption is powerTOP. It helps you to identify the reasons for unnecessary high power consumption (for example, processes that are mainly responsible for waking up a processor from its idle state) and to optimize your system settings to avoid these. It supports both Intel and AMD processors.

powerTOP combines various sources of information (analysis of programs, device drivers, kernel options, amounts and sources of interrupts waking up processors from sleep states) and shows them in one screen. Example 11.4, “Example powerTOP Output” shows which information categories are available:

Example 11.4. Example powerTOP Output

Cn               Avg  residency       P-states   (frequencies) 
1                 2      3              4            5     
C0 (cpu running)        (11.6%)       2.00 Ghz       0.1%
polling         0.0ms   ( 0.0%)       2.00 Ghz       0.0%
C1              4.4ms   (57.3%)       1.87 Ghz       0.0%
C2             10.0ms   (31.1%)       1064 Mhz      99.9%
     
     
Wakeups-from-idle per second : 11.2     interval: 5.0s 6
no ACPI power usage estimate available 7


Top causes for wakeups: 8
96.2% (826.0)       <interrupt> : extra timer interrupt
 0.9% (  8.0)     <kernel core> : usb_hcd_poll_rh_status (rh_timer_func)
 0.3% (  2.4)       <interrupt> : megasas
 0.2% (  2.0)     <kernel core> : clocksource_watchdog (clocksource_watchdog)
 0.2% (  1.6)       <interrupt> : eth1-TxRx-0
 0.1% (  1.0)       <interrupt> : eth1-TxRx-4
     
[...]
    
Suggestion: 9 Enable SATA ALPM link power management via:
echo min_power > /sys/class/scsi_host/host0/link_power_management_policy
or press the S key.

1

The column shows the C-states. When working, the CPU is in state 0, when resting it is in some state greater than 0, depending on which C-states are available and how deep the CPU is sleeping.

2

The column shows average time in milliseconds spent in the particular C-state.

3

The column shows the percentages of time spent in various C-states. For considerable power savings during idle, the CPU should be in deeper C-states most of the time. In addition, the longer the average time spent in these C-states, the more power is saved.

4

The column shows the frequencies the processor and kernel driver support on your system.

5

The column shows the amount of time the CPU cores stayed in different frequencies during the measuring period.

6

Shows how often the CPU is awoken per second (number of interrupts). The lower the number the better. The interval value is the powerTOP refresh interval which can be controlled with the -t option. The default time to gather data is 5 seconds.

7

When running powerTOP on a laptop, this line displays the ACPI information on how much power is currently being used and the estimated time until discharge of the battery. On servers, this information is not available.

8

Shows what is causing the system to be more active than needed. powerTOP displays the top items causing your CPU to awake during the sampling period.

9

Suggestions on how to improve power usage for this machine.


For more information, refer to the powerTOP project page at http://www.lesswatts.org/projects/powertop/. It also provides tips and tricks and an informative FAQ section.

11.4. Special Tuning Options

The following sections highlight some of the most relevant settings that you might want to touch.

11.4.1. Tuning Options for P-States

The CPUfreq subsystem offers several tuning options for P-states: You can switch between the different governors, influence minimum or maximum CPU frequency to be used or change individual governor parameters.

To switch to another governor at runtime, use cpupower frequency-set with the -g option. For example, running the following command (as root) will activate the on-demand governor:

cpupower frequency-set -g ondemand

If you want the change in governor to persist also after a reboot or shutdown, use the pm-profiler as described in Section 11.5, “Creating and Using Power Management Profiles”.

To set values for the minimum or maximum CPU frequency the governor may select, use the -d or -u option, respectively.

Apart from the governor settings that can be influenced with cpupower or cpufreq*, you can also tune further governor parameters manually, for example, Ignoring Nice Values in Processor Utilization.

Procedure 11.1. Ignoring Nice Values in Processor Utilization

One parameter you might want to change for the on-demand or conservative governor is ignore_nice_load.

Each process has a niceness value associated with it. This value is used by the kernel to determine which processes require more processor time than others. The higher the nice value, the lower the priority of the process. Or: the nicer a process, the less CPU it will try to take from other processes.

If the ignore_nice_load parameter for the on-demand or conservative governor is set to 1, any processes with a nice value will not be counted toward the overall processor utilization. When ignore_nice_load is set to 0 (default value), all processes are counted toward the utilization. Adjusting this parameter can be useful if you are running something that requires a lot of processor capacity but you do not care about the runtime.

  1. Change to the subdirectory of the governor whose settings you want to modify, for example:

    cd /sys/devices/system/cpu/cpu0/cpufreq/conservative/
  2. Show the current value of ignore_nice_load with:

    cat ignore_nice_load
  3. To set the value to 1, execute:

    echo 1 > ignore_nice_load
[Tip]Using the Same Value for All Cores

When setting the ignore_nice_load value for cpu0, the same value is automatically used for all cores. In this case, you do not need to repeat the steps above for each of the processors where you want to modify this governor parameter.

Another parameter that significantly impacts the performance loss caused by dynamic frequency scaling is the sampling rate (rate at which the governor checks the current CPU load and adjusts the processor's frequency accordingly). Its default value depends on a BIOS value and it should be as low as possible. However, in modern systems, an appropriate sampling rate is set by default and does not need manual intervention.

11.4.2. Tuning Options for C-states

By default, openSUSE uses C-states appropriately. The only parameter you might want to touch for optimization is the sched_mc_power_savings scheduler. Instead of distributing a work load across all cores with the effect that all cores are utilized only at a minimum level, the kernel can try to schedule processes on as few cores as possible so that the others can go idle. This helps to save power as it allows some processors to be idle for a longer time so they can reach a higher C-state. However, the actual savings depend on a number of factors, for example how many processors are available and which C-states are supported by them (especially deeper ones such as C3 to C6).

If sched_mc_power_savings is set to 0 (default value), no special scheduling is done. If it is set to 1, the scheduler tries to consolidate the work onto the fewest number of processors possible in the case that all processors are a little busy. To modify this parameter, proceed as follows:

Procedure 11.2. Scheduling Processes on Cores

  1. Become root on a command line.

  2. To view the current value of sched_mc_power_savings, use the following command:

    cpupower info -m
  3. To set sched_mc_power_savings to 1, execute:

    cpupower set -m 1

11.5. Creating and Using Power Management Profiles

openSUSE includes pm-profiler, intended for server use. It is a script infrastructure to enable or disable certain power management functions via configuration files. It allows you to define different profiles, each having a specific configuration file for defining different settings. A configuration template for new profiles can be found at /usr/share/doc/packages/pm-profiler/config.template. The template contains a number of parameters you can use for your profile, including comments on usage and links to further documentation. The individual profiles are stored in /etc/pm-profiler/. The profile that will be activated on system start, is defined in /etc/pm-profiler.conf.

Procedure 11.3. Creating and Switching Power Profiles

To create a new profile, proceed as follows:

  1. Create a directory in /etc/pm-profiler/, containing the profile name, for example:

     mkdir /etc/pm-profiler/testprofile
  2. To create the configuration file for the new profile, copy the profile template to the newly created directory:

    cp /usr/share/doc/packages/pm-profiler/config.template \
         /etc/pm-profiler/testprofile/config
  3. Edit the settings in /etc/pm-profiler/testprofile/config and save the file. You can also remove variables that you do not need—they will be handled like empty variables, the settings will not be touched at all.

  4. Edit /etc/pm-profiler.conf. The PM_PROFILER_PROFILE variable defines which profile will be activated on system start. If it has no value, the default system or kernel settings will be used. To set the newly created profile:

    PM_PROFILER_PROFILE="testprofile"
        

    The profile name you enter here must match the name you used in the path to the profile configuration file (/etc/pm-profiler/testprofile/config), not necessarily the NAME you used for the profile in the /etc/pm-profiler/testprofile/config.

  5. To activate the profile, run

    rcpm-profiler start

    or

    /usr/lib/pm-profiler/enable-profile testprofile 

Though you have to manually create or modify a profile by editing the respective profile configuration file, you can use YaST to switch between different profiles. Start YaST and select System+Power Management to open the Power Management Settings. Alternatively, become root and execute yast2 power-management on a command line. The drop-down list shows the available profiles. Default means that the system default settings will be kept. Select the profile to use and click Finish.

11.6. Troubleshooting

BIOS options enabled?

In order to make use of C-states or P-states, check your BIOS options:

  • To use C-states, make sure to enable CPU C State or similar options to benefit from power savings at idle.

  • To use P-states and the CPUfreq governors, make sure to enable Processor Performance States options or similar.

In case of a CPU upgrade, make sure to upgrade your BIOS, too. The BIOS needs to know the new CPU and its valid frequencies steps in order to pass this information on to the operating system.

CPUfreq subsystem enabled?

In openSUSE, the CPUfreq subsystem is enabled by default. To find out if the subsystem is currently enabled, check for the following path in your system: /sys/devices/system/cpu/cpufreq (or /sys/devices/system/cpu/cpu*/cpufreq for machines with multiple cores). If the cpufreq subdirectory exists, the subsystem is enabled.

Log file information?

Check syslog (usually /var/log/messages) for any output regrading the CPUfreq subsystem. Only severe errors are reported there.

If you suspect problems with the CPUfreq subsystem on your machine, you can also enable additional debug output. To do so, either use cpufreq.debug=7 as boot parameter or execute the following command as root:

echo 7 > /sys/module/cpufreq/parameters/debug

This will cause CPUfreq to log more information to dmesg on state transitions, which is useful for diagnosis. But as this additional output of kernel messages can be rather comprehensive, use it only if you are fairly sure that a problem exists.

11.7. For More Information