How to create a Custom Metrics Collector Module to visualize a system’s health using IoT Dashboards
An Azure IoT solution often requires implementing the essential portions of the solution directly on the devices using the Azure IoT Edge technologies. The Edge modules are implemented as Docker compatible containers. For the IoT solution to function properly, each edge module must also function appropriately. This creates the need to monitor the performance of these edge modules – which can easily be understood by details like the CPU Usage, Memory Usage, time consumed to execute the fundamental functions, and GPU Usage by individual edge modules.
We know that the most common errors that can occur are often a result of the over consumption or CPU, Memory, or GPU. In any case if the error happens that might result in container not being able to run successfully. Here is a brief description about the parameters that we use to measure the health of the system:
- CPU Usage – Describes the percentage of CPU as consumed by any edge-module. The CPU Usage is helpful to know if any edge module is not consuming more than required CPU available.
- Memory Usage – Describes the percentage of Memory consumed when an edge-module is running. The Memory Usage is helpful to know if any edge module has consumed the memory and it. The bar-chart will demonstrate if the module is consuming excessive memory.
- GPU Usage – This parameter gives the GPU utilized by all the modules together as well as the load that is present on the GPU due to any of the Edge module.
To facilitate the collection of performance metrics of edge modules Microsoft has supplied us with an IoT Edge module that collects workload module metrics and transports them off-device known as Azure IoT Edge Metrics Collector. However, this module is not helpful if we want to visualize the performance of individual edge modules as it collects module metrics from EdgeAgent and EdgeHub.
The reason why an individual module metrics is important can be explained with the help of a scenario: Let’s say we have 9 modules that offer different functionality and each of the module is dependent on the next module. In this kind of scenario if a ModuleA is fails, all the modules that were dependent on ModuleA will fail as well. If the client needs to check which module has failed and when, we will require telemetry data for say past 1 hour. But going through the telemetry data of 1 hour for all the failed modules can be tiring. To make this better, we have line charts like the one shown in the snippet below.
To get the output of the data for any module, all you need to do is to hover your cursor on the module’s line. This will even provide us with the time stamp and the value of telemetry data recorded at that time.
So, to get performance metrics for individual edge modules we can create our custom edge module that collects metrics and transports it to IoT Hub/ IoT Central, where metrics can be visualized inform of charts and graphs in Dashboards.
Steps to create a Custom Metrics Collector Module
- Create a new IoT Edge Module by following the standard steps. For Reference you can follow the steps in the link – https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/iot-edge/tutorial-python-module.md . You can follow the steps till ‘Select the target architecture’.
- Name the newly created IoT Edge Module as MetricsCollector.
- As discussed previously, Edge modules are nothing but Docker Containers running on devices or servers. We can get the information about all the docker containers using a simple command – ‘docker stats’. However, there is a catch – This command needs a little bit of modification so that metrics can be generated properly.
- The snippet of code as shown above gives the list of all the docker containers and their performance metrics.An example of the output of this snippet of code is shown in the screenshot. The screenshot shows the information about a single container that is running. It is in dictionary format. You can choose any of the key elements as metrics from the dictionary.
- The important metrics can be collected in and converted into JSON format so that it becomes easier to transfer them to IoT Central. This snippet of code demonstrates an example to collect the important metrics like CPU Usage, Memory Usage from the dictionary that has been earlier created.
- After collecting the import metrics from the dictionary. Assign your desired names to metrics that you would be requiring and form JSON data.
- Convert the JSON data to Message format as shown in the code and send the data to the output endpoint.
- You need to update the deployment template that is being used. Add a route in the deployment template for MetricsCollector to IoTHub.
The route will look like this:
To create charts in the IoT Central Dashboards
1. Click on the Dashboards as shown in the screenshot. You may see few already created graph. Click on the ‘Edit’ option as shown.
2. After clicking on Edit – there are two things that can be done
- Creating a new graph
- Making changes in already existing graph.
3. To make changes in any graph. Click on the graph, click on the pencil shaped icon to begin editing the graph. As shown in the screenshot.
4. As can be seen that the important information that needs to be provided to create a graph
- Title – The name of the graph.
- Device group – It is the device template that is being used.
- Device – The device name on which the device template is being assigned to.
- Telemetry – It consists of the telemetry information that is being sent from the MetricsCollector module to the IoT Central. You can add Telemetry information by clicking on Capability.
- For telemetry data, a list of telemetry information will come up. You can add the important one that would be required by you.
5. Finally, you can click on the ‘Update’ button to save all the changes made by you. And click on ‘Save’ button at top left corner.
6. After creating the graphs. It will look something like this.
Graphs or Charts are great way to understand the working state of edge-modules or the machine. This becomes very useful for the client as it helps them to get information about an edge-module without giving many commands or even without logging-in to the machine. The client simply needs to observe the graphs and understand the trend that graph has followed since last 30 minutes or 1 hour.
This information is beneficial to clients because
- The information about modules provides the client with the health of the modules. If any module runs into any kind of error, it would be easily visible in the dashboards to clients.
- If there is any error that is specific to a particular module then the client can directly check the logs of that module directly from the telemetry data that is being sent to the dashboards by that module. If there is any bug that needs to be debugged, then the client can notify the technical team about it.
Example: The client gets a clear idea about the time taken by each module to process any given number of images. One module has taken an average of 3 seconds to process 5 images over the last 10 runs. If all of the sudden the module is taking more than 3-5 seconds to process the again 5 images., then the client will know that there is some problem with that module.
- And finally, if we want to alert the client regarding the over-consumption of resources like Memory, CPU and GPU we can send the notification easily to the clients.
IoT Dashboards provide the client with a better visualization of their resources – Memory, CPU and GPU. The customized Metrics Collector module described in the blog caters to the additional need of having information about how the resources are being consumed by the individual modules as well as the health of the Edge Machine.
- Github: https://github.com/Azure-Samples/iotedge-module-prom-custom-metrics
- Collect and transport metrics – IoT Edge
- Tutorial: In-store analytics customize dashboard
- How to manage dashboards – IoT Central