Artificial intelligence (AI) is changing the rules for many applications. Teams train AI models to recognize objects or patterns, then run AI inference using those models against incoming data streams. When size, weight, power, and time are of little concern, data center or cloud-based AI inference may do. But in resource-constrained edge devices, different technology is needed. What is edge AI inference doing for more devices? Let’s look at differences in AI inference for the edge and how intellectual property (IP) addresses them.
Most AI inference relies on some form of neural network architecture. At a primitive level, neural networks are multiply-accumulate schemes, applying sets of weighting coefficients to data in a highly parallel structure organized in several layers.
Anything beyond a trivial neural network model exposes a fundamental mismatch between the AI inference workload and traditional processor cores, memory, and interconnect. General-purpose CPUs don’t match the inference workload, with insufficient parallel execution and too much overhead for unnecessary (in an AI inference context) operations.
GPUs provide a better fit, with many small cores and configurable interconnects, but still have two significant efficiency problems for edge computing applications. First, high-performance GPUs are resource hungry, relying on ample AC power and forced air cooling, often unavailable in edge platforms. Also, GPU hardware utilization in AI inference tasks is low – typically in the 30 to 40% range. It’s like throwing away more than half of the available operations.
With complex interactions between execution units, memory, and interconnects, operations expressed in tera or peta operations per second (TOPS or POPS) say little about AI inference efficiency. Scaling up GPUs for more operations may cover inference performance shortfalls but chews into already scarce resources.
A better gauge of AI inference efficiency would capture throughput and energy consumption in a single metric. Inferences per second per watt (IPS/W) normalizes comparisons and shows how AI inference IP truly scales. In one edge AI inference scenario, EdgeCortix has demonstrated efficiency gains of as much as 16x IPS/W over GPU-based configurations are possible.
How are these order-of-magnitude efficiency gains achieved? Efficiency is the outcome of dealing with three architectural parameters in conceiving neural network inference IP.
By accounting for all these parameters, neural network IP can go farther at the edge. Less power consumption translates to better battery life for more range and extended use. Determinism paves the way for real-time applications. And, by packing more inferences per second per watt into a given space, an edge device can take on more complex AI models and deliver features not possible with less efficient approaches.
With more efficient, deterministic neural network IP in place, we can now return to the question: what is edge AI inference doing in more devices? Loosely defined, edge computing places more processing power close to where work happens. How much work is involved and what SWaP – size, weight, and power – is available helps determine the form factor of choice.
Scalability and run-time reconfigurability enable EdgeCortix’s neural network IP to assume various forms, ranging from high-end microcontrollers through system-on-chip designs to FPGA accelerator cards. Two components make up the EdgeCortix solution. MERA is the compiler and software framework, and the Dynamic Neural Accelerator IP (DNA IP) is the run-time reconfigurable AI processing core. EdgeCortix implements those two components into the SAKURA SoC, an edge AI chip ready for device use.
With the help of ecosystem partners, many possible implementations of edge AI inference are possible. An FPGA accelerator card can host a flexible implementation when more space and power are available, such as in a smart manufacturing or smart city application. A SAKURA SoC can deliver inference in a smaller package ready for a custom board design if size and weight are concerns, like in defense, 5G telecommunications, or robotics and drone applications. More customization is also an option, such as in custom SoCs designed for automotive sensing applications.
One more advantage: using EdgeCortix technology, AI model experts don’t need to understand the details of hardware implementations to get efficient, high-performance edge AI inference. Researchers who have worked only with GPU-based implementations thinking that size locks them out of many edge devices without extensive redesign will be pleasantly surprised.
For those on less efficient AI inference platforms, it’s easy to get started with EdgeCortix technology. MERA is downloadable from a GitHub repository. Ready-to-run PCIe cards are available, one from EdgeCortix with a SAKURA SoC and one from BittWare, an Inference Pack with a bitstream loaded on an Intel Agilex FPGA.
Jeffrey is EdgeCortix’s Executive VP of Marketing & US Operations. Prior to joining EdgeCortix, he brings over three decades of experience in marketing, branding, management, and operations leadership roles with world-class companies while leading them through rapid growth and large-scale transformation. Jeffrey holds an MBA from Emory University and an undergraduate degree with a dual major in Computer Science and Sociology from Rutgers University.