FPGA Self-Healing
Abstract
Approved for Public ReleaseConventional designs for computing on FPGAs and design mapping assume identical computational elements (transistors, LUTs, ALUs, memories) whose characteristics (e.g., delay, energy) do not change over time. They force this model despite increasing impact of variation and aging for highly-scaled technologies and harsh environments. Conventional solutions conservatively margin for the worst-case compute element at the end-of-life, giving up performance and energy efficiency from most elements over most of their lifetime. This is exacerbated for long-lived military platforms with decade long lifetimes and operation in hostile environments. When a single element falls out of specification, the entire IC may not work. When ICs do fail, they do so without providing advanced warning.We aim to provide introspective techniques to allow FPGAs to self-tune to the unique set of element characteristics, continuously monitor their own health, and perform repairs in-system to restore and optimize operation as elements fail andcharacteristics change. FPGAs are composed of a large number of nearly identical computational elements (LUTs (programmable gates),embedded memories, multipliers, routing switches). This provides choice in which elements to use for a computation and the opportunity to avoid elements whose characteristics are undesirable (non-functional, too slow, require too high voltage to operate). This choice can be exercised during off-line mapping to tolerate defects and high variation and optimize performance and energy efficiency,and it can be exercised during operation to replace aged or damaged elements with undamaged components.We provide a suite of techniques for self-diagnostics, continuous monitoring, and rapid in-system repair. Operational self-tests can verify functionality and at-speed operation. Shadow latches can continuously monitor switching times to identify slow elements and identify changes in element speed during operation and aging [COSMIC TRIP]. Virtualized pages can allow simple, coarse-grained relocation of computations to different portions of the FPGA that have different defect or variation and aging profiles, allowing wear-leveling and rapid repair in low defect scenarios. Pre-computed alternative mappings can support fast (10s of ms), in-place repair at the finer-grained gate and switch level, supporting higher variation and aging levels [CYA]. Fast FPGA placement and routing can handle even higher defect, variation, and aging rates by fully mapping functionality to the current state of the FPGA elements [PLD]. Virtualized pages can also allow migration of functionality among physical, perhaps heterogeneous, FPGAs as component degradation becomes more severe [ViTAL]. The introspective self-diagnostics can provide early indicators of diminishing reserve capacity to alert the needs for physical service of an IC.These self-healing FPGAs operate with less energy, function longer, provide continuity of operations, and prevent surprise failure of ICs. Avoiding the worst-case and aged elements, allows us to achieve target performance with lower voltage. In-system monitoring allows the system to notice and respond to changes that would otherwise disrupt operations. Repair allows automatic restoration and continuity of operation with little down time (seconds). Overall health monitoring can give early warning for IC remaining lifetime.
Document Details
- Document Type
- DoD Grant Award
- Publication Date
- Dec 15, 2023
- Source ID
- N000142412053
Entities
People
- André DeHon
Organizations
- Office of Naval Research
- United States Navy
- University of Pennsylvania