When working with heavy computational workloads, such as ML inference, optimization and profiling are essential. It requires the correct approach and also the right tools.
This tutorial demonstrates how to use Arm Streamline Performance Analyzer to profile an ML-based Android application.
Streamline is a tool which allows you to profile programs running on Arm-based mobile devices. It provides both CPU and GPU counters. GPU counters are especially useful for GPU ML inference or combined ML and graphics applications. The most important metrics we can estimate using Streamline are:
- CPU usage (including activity for each core);
- GPU usage (including utilization for fragment and non-fragment queues); and
- GPU memory bandwidth
Streamline is included in Arm Mobile Studio, which can be downloaded from our developer portal.
In one of our previous blogs, we have described an AR filter project. Let’s take a look at how Streamline helped us to analyze the app performance.
In Streamline, we can see the list of the devices connected through adb.
After the device is selected, we can see the Android packages available for profiling.
The package must be ‘debuggable’ to be profiled with Streamline.
Note for Unity developers: to make the application package ‘debuggable’, you should enable “Development Build” option in the Build Settings.
Finally, we can press “Configure Counters” to select the counters we want to see in the capture and start the capture itself (the application will be launched automatically).
In our case, the application uses GPU ML inference, so we are interested in GPU counters. The easiest way to configure them is to select an existing template (in this example it is the Arm Mali G78 GPU):
Once the capture is finished, we can select a region using callipers and see the data specific to this region. On the image below, the selected area corresponds to a single frame in our AR application.
The values for each counter (such as CPU Activity, Mali GPU Usage, Mali Memory Bandwidth) are calculated for the range we have selected. These are displayed on the left from it.
In the AR Filter app, for each frame we execute 3 neural network models:
- Background Segmentation
- Face Detection
- Face Landmarks Recognition
Note how periods of high non-fragment queue activity correspond to these 3 networks.
Most of the ML frameworks use compute shaders or OpenCL kernels for GPU inference. In the case of Mali, this kind of GPU workload is scheduled to the non-fragment queue. In combined graphics and ML applications, this is how we can distinguish inference from graphics rendering (which relies both on fragment and non-fragment queues). Graphics workloads will also have fragment shader activity straight after the vertex (non-fragment) stage.
Using Streamline with ArmNN
In our AR project, we used ArmNN for neural network inference. It provides good performance on Arm-based mobile devices, but we can also benefit from using ArmNN and Streamline together. If ArmNN runtime is configured for profiling, we can see on the timeline each neural network execution and even individual layers.
ArmNN must be built with -DPROFILING_BACKEND_STREAMLINE=1 flag to add support for this functionality. Also, you need to enable it in the code when initializing ArmNN runtime:
armnn::IRuntime::CreationOptions options; options.m_ProfilingOptions.m_EnableProfiling = true; options.m_ProfilingOptions.m_TimelineEnabled = true; auto runtime = armnn::IRuntime::Create(options);
The capture is configured and recorded as usual. Then we are able to select “Arm NN timeline”:
In the following capture, you can see how 3 models are executed one-by-one in each frame. The selected range represents the first model (Background Segmentation). Note how it matches high non-fragment queue usage on Mali GPU.
We can expand each of the models and look at individual layer executions and links between them.
As you can see, the first model is the real bottleneck in our pipeline. We optimize the model and reduce the number of filters in the first decoder layer from 512 to 128.
The overall model execution time decreased from 37ms to 20ms:
And the duration of the layer called “DepthwiseConv2D:0:17” decreased from 71 microseconds to 51 microseconds:
Another useful feature in streamline is annotations. You can mark specific parts of your code and see them in the Streamline capture. For example, we can mark the start and end of each neural network execution:
ANNOTATE_COLOR(ANNOTATE_PURPLE, "NN Inference"); runtime->EnqueueWorkload(networkId, inputTensors, outputTensors); ANNOTATE_END();
If we expand the main thread entry in the Heat Map, we will be able to see each execution on the timeline and select the corresponding range using callipers:
We have covered the process of configuring Streamline and getting a profiling capture of an Android application.
Profiling ML-based applications using Streamline allows you to:
- Find bottlenecks in ML workloads
- See if there are any gaps between neural network executions that can be reduced
- Check if CPUs or GPUs are being utilized efficiently
- See how much a certain model optimization has helped to reduce inference time or memory bandwidth.
This will help you to find ways to optimize your application and get better performance.