What is TinyML?
Data transmission tends to dominate the energy consumption of an Internet of Things (IoT) device , yet a significant portion of the data transmitted is not used. The reality is that the vast majority of data is not that interesting. However, we have to keep looking so we do not miss it when something significant does happens . If IoT systems could be smart enough to only transmit the interesting data, they could last far longer on battery power and be less likely to flood the network with uninteresting data, resulting in efficiency gains. Enter ‘TinyML’, which seeks to deploy Machine Learning (ML) algorithms on ultra low power systems, to enable us to intelligently select which data to transmit, improving energy efficiency.
Challenges of TinyML
Deploying ML inference tasks on TinyML devices comes with a unique set of challenges. Arm Cortex-M class microcontrollers (MCUs) often have severely limited Flash and static random-access memory (SRAM) to store model weights and activations. Even size efficient models designed for mobile devices, like MobileNets , require an order of magnitude more memory than is typically available on MCUs. Furthermore, IoT applications often have to keep pace with the rate at which the data is collected, despite MCUs having relatively low computational resources. Finally, the faster we run the model, the more time we can spend in ‘sleep mode’. This is where we shut down the MCU to preserve energy and extend the battery life. To effectively perform ML on MCUs, we need to design optimal neural network architectures that fit the constraints of SRAM memory, Flash memory, and latency. MicroNets , our recent work published at MLSys 2021, tackles this challenge via neural architecture search.
Differentiable neural architecture search
We use differentiable neural architecture search (DNAS) to discover accurate neural network architectures while satisfying SRAM, Flash, and latency constraints. DNAS uses gradient descent to rapidly produce optimized models, but prefers constraints to be closed form functions. To accurately model our constraints, we performed an in-depth characterization of neural network performance on MCUs.
To characterize the performance of TinyML models, we tested hundreds of models and related their on-device metrics to their architecture. We deployed the neural networks to an STM32f746ZG MCU using the Tensorflow Lite for Microcontrollers (TFLite Micro) inference framework.
SRAM and Flash memory
We determined that the SRAM consumption of the model is dominated by the size of the buffer for intermediate tensors used during neural network execution, with some additional capacity consumed by persistent buffers that must be maintained between inferences. These factors can be determined directly from the architecture of the model. The SRAM overhead of the TFLite Micro framework and the bare-metal OS on top of this is typically minimal.
The Flash consumption is a similar story. Where the weights and biases of the model account for, on average, the majority of the Flash consumption with a significant portion of the overall consumption coming from the quantization parameters. Both of these can be determined from the model architecture. TFLite Micro and the system code are, again, overheads that can be accounted for statically.
Figure 1: Breakdown of SRAM and Flash memory usage of an example Keyword Spotting model.
We observed an interesting relationship between operation (Op) count and latency. Models with significantly different architectures can differ in Op throughput, which means actual on-device latency is the only way to accurately compare two distinct models. However, despite this, we actually observed that models sampled from the same backbone have very similar Op throughputs. Therefore, we are able to use model op count to predict latency, as long as we sample from the same backbone like we already do with DNAS. Op count is simple to calculate from the model architecture, and therefore, as with SRAM and Flash usage, we can predict it during the architecture search.
Figure 2: The latency of models sampled from two different backbones, measured on two MCUs.
MLPerf Tiny use cases
Visual wake words is a binary image classification dataset where the task is to determine if a person is in the image or not. An example of this use case would be a smart doorbell, where the camera can notify you when a person is at the door.
Keyword spotting is widely used in commercial IoT devices with households around the world using “Alexa” or “ok Google” to wake their devices multiple times a day. The dataset that we used has 12 labels, including 10 words to identify, as well as silence and unknown categories.
Anomaly detection is an unsupervised audio task where the goal is to identify faults or abnormal behavior in a machine through the sounds it produces. This use case has wide applicability in industrial settings, where an inexpensive microcontroller can be used to automatically detect faults in machinery for predictive maintenance.
Using DNAS we produced a set of MicroNets which achieve state-of-the-art performance in all metrics on all three target use cases. Our models trade off accuracy and size to best meet the requirements of any application. All of our models achieve real-time performance on their target MCU. The results are shown below and compared against a number of baseline models.
The models are open source and available here.
Figure 3: Keyword Spotting MicroNets Results
Figure 4: Visual Wake Words MicroNets Results. TFLM refers to the stock example model provided by Tensorflow Lite for Microcontrollers.
Figure 5: Anomaly Detection MicroNets Results
TinyML has the potential to revolutionize IoT and democratize AI, but the hardware constraints of microcontrollers make it difficult to deploy accurate models. The Arm ML Research Lab has been working on this topic for a number of years, to develop compact and accurate models that run efficiently on MCUs  and also to enable distributed learning . If you’re interested in these topics, please get in touch!
MicroNets demonstrates that differentiable neural architecture search can be used to rapidly find state-of-the-art models that can be easily deployed to commodity MCUs. This technique can be extended to other TinyML use cases and enable more efficient deployment of IoT applications. We plan to conduct future research on efficient TinyML model and system design.
Want to reference the work?
“MicroNets: Neural Network Architectures for Deploying TinyML Applications on Commodity Microcontrollers”, C. Banbury, C. Zhou, I. Fedorov, R. M. Navarro, U. Thakker, D. Gope, V. J. Reddi, M. Mattina, P. Whatmough, MLSys, 2021.
 Bouguera, Taoufik et al. “Energy Consumption Model for Sensor Nodes Based on LoRa and LoRaWAN.” Sensors (Basel, Switzerland) vol. 18,7 2104. 30 Jun. 2018, doi:10.3390/s18072104
 Mckinsey Global Institute. “The Internet of Things: Mapping the Value Beyond the Hype.” mckinsey.com
 Howard, Andrew G., et al. “Mobilenets: Efficient convolutional neural networks for mobile vision applications.” arXiv preprint arXiv:1704.04861, 2017.
 Banbury, Colby, et al. “Micronets: Neural network architectures for deploying tinyml applications on commodity microcontrollers.” Proceedings of Machine Learning and Systems 3, 2021.
 Chowdhery, A., Warden, P., Shlens, J., Howard, A., and Rhodes, R. Visual wake words dataset. arXiv preprint arXiv:1906.05721, 2019.
 Warden, P. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018.
 Purohit, H., Tanabe, R., Ichige, K., Endo, T., Nikaido, Y., Suefusa, K., and Kawaguchi, Y. Mimii dataset: Sound dataset for malfunctioning industrial machine investigation and inspection. arXiv preprint arXiv:1909.09347, 2019
I. Fedorov, R. P. Adams, M. Matting, P. N. Whatmough, ”SpArSe: Sparse Architecture Search for CNNs on Resource Constrained Microcontrollers”, Advances in Neural Information Processing Systems (NeurIPS), 2019
 I. Fedorov, M. Stamenovic, C. Jensen, L.-C. Yang, A. Mandell, Y. Gan, M. Mattina, P. N. Whatmough, “TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids”, InterSpeech, 2020
 U. Thakker, P. N. Whatmough, Z. Liu, M. Mattina, J. Beu, “Doping: A technique for Extreme Compression of LSTM Models using Sparse Structured Additive Matrices”, Machine Learning and Systems (MLSys), 2021
 D. A. E. Acar, Y. Zhao, R. M. Navarro, M. Mattina, P. N. Whatmough, V. Saligrama, “Federated learning based on dynamic regularization”, International Conference on Learning Representations (ICLR), 2021