Back to Blog

What is edge AI inference doing for more devices?

Image of Jeffrey Grosman
Jeffrey Grosman

AI inference is a common term - but what is edge AI inference? EdgeCortix provides an answer in terms of workloads, efficiency, and applications.

Artificial intelligence (AI) is changing the rules for many applications. Teams train AI models to recognize objects or patterns, then run AI inference using those models against incoming data streams. When size, weight, power, and time are of little concern, data center or cloud-based AI inference may do. But in resource-constrained edge devices, different technology is needed. What is edge AI inference doing for more devices? Let’s look at differences in AI inference for the edge and how intellectual property (IP) addresses them.

Mismatched workloads highlight CPU and GPU inefficiency

Most AI inference relies on some form of neural network architecture. At a primitive level, neural networks are multiply-accumulate schemes, applying sets of weighting coefficients to data in a highly parallel structure organized in several layers.

Anything beyond a trivial neural network model exposes a fundamental mismatch between the AI inference workload and traditional processor cores, memory, and interconnect. General-purpose CPUs don’t match the inference workload, with insufficient parallel execution and too much overhead for unnecessary (in an AI inference context) operations.

GPUs provide a better fit, with many small cores and configurable interconnects, but still have two significant efficiency problems for edge computing applications. First, high-performance GPUs are resource hungry, relying on ample AC power and forced air cooling, often unavailable in edge platforms. Also, GPU hardware utilization in AI inference tasks is low – typically in the 30 to 40% range. It’s like throwing away more than half of the available operations.

With complex interactions between execution units, memory, and interconnects, operations expressed in tera or peta operations per second (TOPS or POPS) say little about AI inference efficiency. Scaling up GPUs for more operations may cover inference performance shortfalls but chews into already scarce resources.

A better gauge of AI inference efficiency would capture throughput and energy consumption in a single metric. Inferences per second per watt (IPS/W) normalizes comparisons and shows how AI inference IP truly scales. In one edge AI inference scenario, EdgeCortix has demonstrated efficiency gains of as much as 16x IPS/W over GPU-based configurations are possible.


Back to the drawing board for neural network IP for AI

How are these order-of-magnitude efficiency gains achieved? Efficiency is the outcome of dealing with three architectural parameters in conceiving neural network inference IP.

  • Low latency: Unlike most cloud-based applications, latency is usually a primary concern at the edge. Edge AI inference has a fixed window for performing recognition against a real-time data stream bound by sample or frame rates or risk falling behind and missing changes. Inference decisions may also feed control algorithms in a deterministic window with limited time to respond and stay in control.
  • High utilization: Streamlining hardware and pulling out unnecessary pieces is one step toward improved efficiency. Another is finding a way to keep execution units busy. Scaling execution units for a maximum workload may result in idle units still consuming resources like space and power in less loaded cases. A more efficient approach right-sizes execution units to stay at work under any workload with any AI inference model.
  • Run-time reconfigurability: Optimizing hardware execution units and interconnects for particular AI inference models often results in inefficiency when models change. Instead, a co-design approach can look at the AI inference models in the application, compile them with knowledge of resources and utilization in the IP, and reconfigure hardware accordingly at run-time.

By accounting for all these parameters, neural network IP can go farther at the edge. Less power consumption translates to better battery life for more range and extended use. Determinism paves the way for real-time applications. And, by packing more inferences per second per watt into a given space, an edge device can take on more complex AI models and deliver features not possible with less efficient approaches.

Edge AI inference in more form factors for more applications

With more efficient, deterministic neural network IP in place, we can now return to the question: what is edge AI inference doing in more devices? Loosely defined, edge computing places more processing power close to where work happens. How much work is involved and what SWaP – size, weight, and power – is available helps determine the form factor of choice.

Scalability and run-time reconfigurability enable EdgeCortix’s neural network IP to assume various forms, ranging from high-end microcontrollers through system-on-chip designs to FPGA accelerator cards. Two components make up the EdgeCortix solution. MERA is the compiler and software framework, and the Dynamic Neural Accelerator IP (DNA IP) is the run-time reconfigurable AI processing core. EdgeCortix implements those two components into the SAKURA SoC, an edge AI chip ready for device use.

With the help of ecosystem partners, many possible implementations of edge AI inference are possible. An FPGA accelerator card can host a flexible implementation when more space and power are available, such as in a smart manufacturing or smart city application. A SAKURA SoC can deliver inference in a smaller package ready for a custom board design if size and weight are concerns, like in defense, 5G telecommunications, or robotics and drone applications. More customization is also an option, such as in custom SoCs designed for automotive sensing applications.

EdgeCortix edge AI inference products and technology include hardware, IP, and software in one workflow for AI developers

One more advantage: using EdgeCortix technology, AI model experts don’t need to understand the details of hardware implementations to get efficient, high-performance edge AI inference. Researchers who have worked only with GPU-based implementations thinking that size locks them out of many edge devices without extensive redesign will be pleasantly surprised.

Getting started with EdgeCortix technology

For those on less efficient AI inference platforms, it’s easy to get started with EdgeCortix technology. MERA is downloadable from a GitHub repository. Ready-to-run PCIe cards are available, one from EdgeCortix with a SAKURA SoC and one from BittWare, an Inference Pack with a bitstream loaded on an Intel Agilex FPGA.

What is edge AI inference doing for more devices? It’s time to find out more.

See the Technology

SAKURA-I efficient edge AI chips from EdgeCortix outperform the NVIDIA Jetson AGX Orin

Efficient Edge AI Chips with Reconfigurable Accelerators

Image of Nikolay Nez
Nikolay Nez
Read more
Edge AI software workflows can start with PyTorch, TensorFlow, or ONNX models, and MERA automatically converts them to code for EdgeCortix DNA IP with no hardware-specific knowledge required

Connecting Edge AI Software with PyTorch, TensorFlow Lite, and ONNX Models

Image of Antonio Nevado
Antonio Nevado
Read more