Manage the PCIe module temperature
Coral products that integrate the Edge TPU over PCIe must be operated using the Coral PCIe driver. This driver handles all device communications, but it also allows you to respond to the Edge TPU temperature and configure dynamic frequency scaling (DFS) thresholds. This page describes how you can use these features to maintain an optimal operating temperature with a PCIe-based Edge TPU.
This document applies to only the following products:
- System-on-Module
- Mini PCIe Accelerator
- M.2 Accelerator A+E key
- M.2 Accelerator B+M key
- M.2 Accelerator with Dual Edge TPU
- Accelerator Module
PCIe parameters overview
The PCIe products listed above do not include a thermal solution to dissipate heat from the system. So in order to sustain maximum performance from the Edge TPU and avoid permanent damage, you must design your system so the Edge TPU always operates below the maximum operating temperature specified in the product datasheet.
To help you do so, the Coral PCIe driver includes some programmable parameters that help you manage the Edge TPU temperature in the following ways:
- Read the Edge TPU temperature and then, if necessary, activate a cooling solution (such as a fan) or load-balance your work across other Edge TPUs in the system.
- Use dynamic frequency scaling (DFS)—also known as throttling—to incrementally reduce the Edge TPU operating frequency as it heats up.
- Shut down the Edge TPU when it reaches a critical temperature (highly recommended).
To employ any combination of these strategies, you need to read or write the Coral PCIe driver parameters defined in the following tables.
Exactly how you can read and write these parameters depends on your operating system, and is explained in the following sections (see the instructions for Linux and for Windows).
Read the Edge TPU temperature
You can periodically read the Edge TPU temperature using the temp
parameter, and then
respond with your own strategies to cool the system or load-balance your work.
Parameter | Description | Units |
---|---|---|
temp
|
The current Edge TPU junction temperature.
On Linux, this is available via device-specific sysfs nodes only (not from the kernel module). On Windows, this is available via performance counters only (not from the Windows Registry). |
Millidegree Celsius |
Use dynamic frequency scaling
By default, the Coral PCIe driver runs the Edge TPU at the maximum frequency of 500 MHz. Under some circumstances, extended operation at this frequency can cause overheating. So the PCIe driver includes a power throttling mechanism (known as dynamic frequency scaling, or DFS) that's enabled by default. This system periodically checks the Edge TPU temperature, and as it reaches the "trip points" specified by parameters in table 2, it reduces the Edge TPU operating frequency in 50-percent increments.
By reducing the operating frequency, the Edge TPU's inferencing speed becomes slower, but it also consumes less power and hopefully avoids reaching higher temperatures at which the Edge TPU may shut down or become permanently damaged.
As long as the chip does not shut down and the Edge TPU returns to lower temperatures, the DFS system restores the operating frequency in the reverse manner—ultimately returning to the maximum operating frequency.
Parameter | Description | Default value | Units |
---|---|---|---|
trip_point0_temp
|
If the Edge TPU temperature reaches or exceeds this value, the system sets the operating frequency to "reduced" (250 MHz) |
85000
|
Millidegree Celsius |
trip_point1_temp
|
If the Edge TPU temperature reaches or exceeds this value, the system sets the operating frequency to "low" (125 MHz) |
90000
|
Millidegree Celsius |
trip_point2_temp
|
If the Edge TPU temperature reaches or exceeds this value, the system sets the operating frequency to "lowest" (62.5 MHz) |
95000
|
Millidegree Celsius |
temp_poll_interval
|
The interval at which to read the temperature. Setting this to 0 disables DFS completely.
This should be several seconds because the temperature reading doesn't change instantly. Yet, it also doesn't need to be much larger than the default because the overhead of switching the operating frequency is negligible, so it isn't necessary to implement hysteresis around the trip points. |
5000
|
Milliseconds |
Whatever values you set for the trip_point*
parameters, they must evaluate as follows:
trip_point0_temp <= trip_point1_temp <= trip_point2_temp
If you set values that don't match this logic, the driver silently reverts to the default values in table 2.
Configure the shutdown/warning temperatures
The parameters in table 3 have different behaviors depending on whether you're using the Accelerator Module (the solderable module) or one of the PCIe card modules (such as the Mini PCIe Accelerator or an M.2 Accelerator):
- Accelerator Module: You can specify temperatures at which certain pins assert to warn you that the Edge TPU has reached that temperature. You can respond in whatever way suits your system, such as enabling a fan or shutting down the module.
- PCIe card modules: You can specify the temperature at which the Edge TPU will shut down.
You will not receive any warnings. If you want to manually respond to temperature changes, you can
instead poll the
temp
parameter in table 1.
Parameter | Description | Default value | Units | |
---|---|---|---|---|
PCIe card modules | Accelerator Module | |||
hw_temp_warn1
|
Not available. | If the Edge TPU reaches or exceeds this temperature, the Edge TPU asserts the INTR line. |
100000
|
Millidegree Celsius |
hw_temp_warn1_en
|
Not available. |
Enables/disables hw_temp_warn1 .
|
1
|
Boolean: 1 = enabled |
hw_temp_warn2
|
If the Edge TPU reaches or exceeds this temperature, the Edge TPU shuts down.1
When the Edge TPU shuts down, it enters an idle state. Generally, you must then restart your system to resume work with the Edge TPU. |
If the Edge TPU reaches or exceeds this temperature, the Edge TPU asserts the SD_ALARM line. It's your responsibility to shut down the Accelerator Module. |
100000
|
Millidegree Celsius |
hw_temp_warn2_en
|
Enables/disables hw_temp_warn2 .
|
1
|
Boolean: 1 = enabled |
1 This parameter is saved to a register in the Edge TPU (as are all parameters) and the shutdown mechanism is fully contained inside the PCIe card module. So even if the host system fails, the Edge TPU will safely shut down if it reaches this temperature.
hw_temp_warn2
to shut down the Edge TPU before it
exceeds the maximum operating temperature specified in the product datasheet. Failure to do so can
result in permanent damage to the Edge TPU and surrounding components, and can possibly cause fire
and other serious damage, injury, or death.
Using the parameters on Linux
On Linux, you can access the Coral PCIe driver parameters with files that are accessible as either kernel module parameters or sysfs nodes:
-
The kernel module parameters are located in this path:
/sys/module/apex/parameters/
These parameters are persistent and applied at boot time. This is useful if you have multiple modules for which you want to apply the same settings. For details about how to edit these, see how to specify kernel module parameters.
-
The sysfs nodes for each module are located at paths such as this:
/sys/class/apex/apex_0
These sysfs nodes are created by the PCIe driver at boot time and allow you to set different settings for different PCIe modules. The file name includes a unique number for each Edge TPU connected via PCIe (such as
apex_0
,apex_1
,apex_2
, and so on).
Whether you decide to use the kernel module parameters or the individual sysfs node parameters, the files that specify each PCIe parameter are named the same as shown in tables 1, 2, and 3 (although the parameter to read the temperature is available only as a sysfs node).
Using the parameters on Windows
On Windows 10, you can access the Coral PCIe driver parameters using the Windows Registry as follows:
-
Launch Registry Editor (type "regedit" from the Run window; you must be admin).
-
Open the following path:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\coral\Parameters
You should see the PCIe parameters as registry keys, as shown in figure 1.
-
Double-click to edit any of the parameters.
-
Reboot your system to apply any changes.
However, notice that the temp
parameter is not available in the Windows Registry, because this
parameter changes over time and is read-only. Instead, you can see the current temperature with
the Windows Performance Monitor as follows:
-
Launch Performance Monitor (type "perfmon" in the Run window).
-
Select Performance Monitor in the left pane, and click Add in the toolbar.
-
In the Add Counters dialog, select the Coral PCIe Accelerator counter, select which instances you want to view, and then click OK.
The activity chart then shows the Edge TPU temperature over time in degrees Celcius. But notice that the actual value below the chart is in millidegree Celsius (as indicated in table 1).
You can also get the Edge TPU temperature with the following PowerShell command:
Get-Counter -Counter '\Coral PCIE Accelerator(*)\Temperature'
Or, you can write your own tool to consume performance counter data.
Is this content helpful?