Standalone Section Drafts for IMC System-Level Subthemes

Scope: non-DRAM Technical_Article papers, with emphasis on neural-network and LLM/Transformer workloads. These passages are written as standalone section seeds for later integration. They follow the review style of the NREE role model: problem motivation first, then taxonomy of approaches, then design insights and unresolved issues.

Computing Precision

Computing precision is one of the central design variables that turns in-memory computing from a macro-level demonstration into a usable accelerator substrate. In conventional digital accelerators, the numerical format is mostly an implementation choice constrained by arithmetic units and memory bandwidth. In CIM/PIM systems, however, precision determines not only model accuracy, but also how many cells are consumed per weight, how many cycles are required per input, how much partial-sum range must be preserved, and how much area and energy are spent in ADCs, DACs, sense amplifiers, accumulators and digital correction logic. Thus, precision is not a peripheral detail of IMC. It is a cross-layer contract between model representation, array physics, peripheral circuits and system scheduling.

The literature has evolved from low-bit integer and bit-serial computation toward more flexible mixed-precision and floating-point support. Early and mid-stage works on mixed-precision quantization for ReRAM-based DNN accelerators established that layer-wise sensitivity can be exploited to reduce ADC and cell overhead without uniformly sacrificing accuracy. More recent silicon works, including integer/floating-point dual-mode gain-cell CIM, FP/INT digital CIM processors, FP8-aware macros and hierarchical-hybrid floating-point CIM with FP-DAC and FP-ADC, move the field toward formats that are closer to the arithmetic demands of modern AI models. This chronological transition is important. CNN inference could often tolerate fixed low-bit formats; Transformer and GenAI workloads increasingly expose the limits of a single precision regime because attention, normalization, activation outliers and accumulation depth stress different parts of the numerical pipeline.

Three technical insights emerge. First, the energy-optimal precision is rarely the model-level bit-width alone. A nominal 4-bit or 8-bit model may still require higher internal accumulation precision, multi-cycle bit slicing, exponent handling or partial-sum correction. Second, analog and mixed-signal IMC moves numerical burden into converters: ADC resolution, reference generation, sampling noise and calibration overhead often dominate the macro energy that the array was meant to save. Third, flexible precision is valuable only when the surrounding mapping and compiler stack can exploit it. A macro that supports FP8, INT4 and mixed modes may remain underutilized if operators cannot be assigned to the appropriate mode at layer granularity or token phase granularity.

The major open challenge is therefore not simply to build higher-precision IMC. It is to develop precision-adaptive IMC systems whose numerical choices are co-optimized with workload structure. For CNNs, this implies layer-wise or channel-wise policies that balance convolution reuse, ADC count and array occupancy. For LLMs, the problem broadens to operator-wise and phase-wise precision: prefill and decode, attention and MLP, dense and sparse regions, and normal tokens and outlier tokens may require different formats. Future work should report not only peak TOPS/W, but also the precision schedule, internal accumulation path, calibration cost and accuracy sensitivity under realistic model families. A useful direction is a precision design stack that couples quantization search, converter-aware simulation, and runtime selection of numerical modes, so that precision becomes an adaptive system resource rather than a fixed macro specification.

Dataflow

Dataflow determines whether the theoretical advantage of IMC survives beyond a single matrix-vector multiplication. The central promise of IMC is to reduce data movement by storing operands where computation occurs. Yet deep-learning execution is dominated by tensors that move through layers, tiles, heads, channels and batches. If activations, partial sums or weights must repeatedly leave the local array hierarchy, the system can reintroduce a new memory wall inside the accelerator itself. For this reason, dataflow is the bridge between circuit efficiency and application efficiency.

The ranked literature shows a progression from conventional weight-stationary mapping toward hybrid and capacity-aware schedules. HiSo, which co-optimizes intra-layer and inter-layer scheduling with hybrid dataflow, is especially instructive because it treats scheduling as a two-level problem rather than a local choice inside one layer. Intra- and inter-layer scheduling exploration for ReRAM DNN accelerators, TriCIM, FDCA, SAL, PIMCOMP, TetrisG-SDK and LEAP collectively show that the important design question is no longer whether weight-stationary, input-stationary or output-stationary is best in isolation. The question is when to switch stationarity, how much intermediate state can be retained locally, and how to map operators to tiles or macros without losing utilization.

The workload distinction is essential. CNN-style neural networks benefit from spatial reuse across convolution windows and channels, so techniques such as adaptive windows, grouped convolutions, layer pipelining and tile-stationary dataflows can improve utilization. LLMs and Transformers bring a different pressure. Prefill has large matrix operations and attention blocks with sequence-level parallelism, whereas autoregressive decoding has small-batch, latency-sensitive, token-by-token behavior. Dataflow for LLMs must therefore handle both bulk throughput and fine-grained synchronization. LEAP and sparse Transformer accelerators point toward balanced dataflow and fine-grained parallelism, but the broader field still lacks a unified dataflow model for prefill, decode, KV movement and mixed dense/sparse execution.

A mature IMC dataflow methodology should expose three quantities that are often hidden in macro-centric reporting: operand residency, partial-sum lifetime and synchronization granularity. Operand residency asks how long weights, activations and KV-like states remain close to compute. Partial-sum lifetime asks where accumulation occurs and how often precision conversion is paid. Synchronization granularity asks whether arrays, tiles, chiplets or hosts must wait for each other. Future dataflow work should also be benchmarked across operator classes, not only networks: convolution, GEMM, attention, normalization, activation, embedding lookup and graph-like gather/scatter have different movement signatures. The most promising direction is a dataflow compiler that can select stationarity dynamically across layers and LLM phases while obeying capacity, converter and interconnect constraints.

General Compiler Stacks

The compiler stack is the point at which IMC becomes programmable. Many IMC demonstrations show excellent macro-level energy efficiency, but a practical accelerator requires a path from software graphs to hardware operations, including operator lowering, tensor partitioning, instruction generation, data placement, scheduling, runtime control and performance estimation. Without this layer, every new model or hardware macro risks becoming a bespoke manual mapping exercise. The role of compilers and simulators is therefore not auxiliary; they define the portability and reproducibility of the field.

The literature has moved through several layers of abstraction. NeuroSim and DNN+NeuroSim established device-to-architecture evaluation and benchmarking as a common design language. PIMulator-NN and PIMSIM-NN extend simulation toward ISA-level and event-driven modeling. More recent compiler works, including PIMCOMP, COMPASS, CIMFlow, CIMWise and CIM-MLC, aim to connect DNN operators to resource-constrained crossbar or CIM arrays through explicit hardware abstractions. CIM-MLC is particularly important because it recognizes architectural diversity: CIM hardware differs in device precision, crossbar size, number of crossbars, programming interface, memory hierarchy and NoC structure. A useful compiler must represent that diversity without forcing every hardware team to rebuild a software stack from scratch.

The main insight is that IMC compilation is not merely another backend for a standard accelerator. IMC violates several assumptions embedded in conventional compiler flows. Operations may be bound to physical storage locations; weights may be expensive to rewrite; array dimensions constrain tensor shapes; analog non-idealities influence mapping decisions; partial sums may require explicit movement and conversion; and some memories can operate as both storage and compute depending on mode. Consequently, the compiler must reason about hardware state, not only operator scheduling. This is why resource-constrained frameworks such as COMPASS and end-to-end stacks such as PIMCOMP and CIM-MLC are crucial stepping stones.

The unresolved challenge is a common intermediate representation for IMC. The field needs abstractions that can express crossbar-level operations, digital bit-serial CIM, mixed-signal MVM, near-memory vector operations and hybrid CPU/IMC execution without collapsing them into a single unrealistic model. For GenAI-era workloads, the stack must also support dynamic shapes, autoregressive decoding, sparsity, KV/cache behavior and heterogeneous operator placement. Future work should prioritize open benchmarks and compiler interfaces that report not only latency and energy, but also compilation time, supported operator coverage, remapping overhead, non-ideality assumptions and the boundary between compile-time and runtime decisions. The long-term goal is a toolchain in which algorithm designers can explore model structure, precision and mapping while hardware designers can expose device and architecture constraints in a principled way.

Inter-Chip Architecture and Interconnects

Scaling IMC beyond a single die changes the design problem from local compute efficiency to distributed memory-compute orchestration. As models grow, especially in the LLM era, a single chip or macro cannot hold all parameters, activations and intermediate states. Multi-chiplet modules, 2.5D/3D integration, heterogeneous chiplets and package-level networks are therefore becoming natural extensions of IMC. The motivation is clear: if data movement dominates energy, then large-scale IMC must reduce movement not only between memory and compute within a macro, but also across chiplets, packages and host interfaces.

Earlier multi-chiplet DNN works such as SIAM, Big-Little chiplets and COMB-MCM explored scalable in-memory acceleration through mesh or package-level organization and heterogeneous chiplet allocation. These papers mainly addressed CNN/DNN workloads, where the partitioning problem is dominated by layer placement, feature-map movement and balancing memory capacity against compute throughput. Newer LLM-oriented works such as H3D-LLM and FLARE make the issue more acute. FLARE explicitly frames LLM deployment as a fine-grained hardware-software co-design problem across cores, chiplets and network-based accelerator systems. Its bottom-up mapping perspective is useful because LLM performance depends not only on aggregate compute, but also on where tensor-parallel shards, attention state and feed-forward blocks are placed.

The principal design insight is that inter-chip scaling cannot be treated as a simple replication of IMC tiles. Chiplet systems introduce new constraints: network bandwidth, packet latency, synchronization, power delivery, thermal density, die-to-die energy and memory consistency. In CNNs, poor inter-chip mapping may reduce throughput. In LLM decoding, poor inter-chip mapping may directly increase token latency because synchronization occurs repeatedly across generation steps. Thus, chiplet-level IMC must jointly optimize model partitioning, dataflow, interconnect topology and hardware heterogeneity.

Future work should move from architecture proposals toward workload-calibrated scaling laws. The field needs to know when adding chiplets improves energy efficiency, when it only hides capacity limits, and when communication dominates the compute saved by IMC. Evaluation should separate prefill and decode for LLMs, include realistic sequence lengths and batch sizes, and report die-to-die traffic per token. A particularly important direction is heterogeneity-aware orchestration: dense MLP, sparse attention, embedding, normalization and control tasks may not belong on identical chiplets. IMC chiplet systems will be convincing when they show not only high throughput, but a clear methodology for assigning each workload phase to the right compute-memory substrate.

Intra-Chip Architecture and Interconnects

Within a chip, interconnect determines how effectively many small IMC engines behave as one accelerator. IMC macros are typically efficient at local operations, but full models require movement of inputs, outputs, partial sums, control signals and sometimes weights across arrays, banks, tiles and buffers. A chip with excellent macros can still underperform if the NoC, broadcast fabric, reduction network or buffer hierarchy cannot sustain operand delivery and accumulation. This is why intra-chip interconnect should be considered part of the compute architecture rather than a passive wiring problem.

The literature includes direct studies of on-chip interconnect impact, latency-optimized reconfigurable NoCs, mesh-based programmable analog-AI accelerators, macro-level accumulation schemes, broadcast-alignment floating-point CIM and local attention reuse engines. These works collectively show that different communication patterns dominate at different scales. At the array level, partial-sum reduction and ADC sharing matter. At the macro level, operand broadcast, accumulation and register movement dominate. At the tile level, NoC topology, routing and synchronization decide whether parallel arrays remain busy. For Transformer workloads, the intra-chip fabric also has to support attention-specific reuse, irregular sparsity and phase-dependent parallelism.

The important design lesson is that intra-chip communication must be co-designed with mapping and precision. A high-precision floating-point CIM macro may require wider accumulation paths and more expensive reduction. A sparse accelerator may reduce arithmetic but increase control and indexing traffic. A flexible compiler may expose many mapping choices, but only a subset will match the physical interconnect. Therefore, the right metric is not peak macro utilization alone, but utilization after communication, reduction and conversion overheads are included.

The open research question is how to build interconnects that are both workload-aware and general enough for evolving models. CNNs can often exploit predictable broadcast and reduction structures. LLMs combine dense matrix multiplications with attention, dynamic sequence lengths and sparsity patterns that may change between models or even between tokens. Future IMC chips may need reconfigurable local fabrics: multicast for activations, reduction trees for partial sums, sparse-routing support for pruned or attention-selected data, and low-latency paths for decode. Reports should include traffic breakdowns at array, macro, tile and chip level. The field will benefit from treating intra-chip interconnect as a first-class design space, with standardized benchmarks that expose whether gains come from local compute or from genuinely improved communication efficiency.

Mapping

Mapping is the practical act of translating model tensors into the finite geometry of IMC hardware. It determines how weights are partitioned across arrays, how activations are streamed, how partial sums are merged, how non-idealities are tolerated and how scarce resources such as ADCs, buffers and macros are shared. In the NREE role model, mapping is presented as an unresolved limitation of HW-NAS because model search can be incomplete if it ignores how layers are actually placed on hardware. The same point applies broadly: mapping is where algorithmic elegance meets the physical constraints of arrays.

The ranked papers show several generations of mapping research. Early works optimized weight mapping and dataflow for CNNs on RRAM-based PIM and considered mixed-size or overlapped crossbars. Xbar-Partitioning highlighted parasitic and noise tolerance, showing that partitioning is not only a capacity issue but also an analog reliability issue. AERO and MAESTRO-style data-centric analysis brought more systematic design-space exploration. Recent works such as COMPASS, efficient weight mapping and resource scheduling, TriCIM and Fast-OverlaPIM move toward compiler-level or framework-level mapping under resource constraints. TetrisG-SDK is a particularly clear new example: it improves convolutional layer mapping by using adaptive windows, multi-macro exploration and grouped convolutions to increase utilization and reduce system-level latency and energy.

The main insight is that mapping must be judged at multiple granularities. At the array level, the key questions are crossbar dimensions, bit slicing, parasitics and input reuse. At the macro level, the question becomes whether parallelism is exposed or stranded. At the chip level, mapping interacts with NoC traffic and buffer capacity. At the workload level, CNNs, Transformers, graph workloads and general CIM kernels require different decomposition strategies. CNN mapping often revolves around convolution windows, channel grouping and feature reuse. LLM mapping must additionally handle attention heads, tensor parallelism, KV/cache movement, outliers and prefill/decode imbalance.

Future mapping research should move toward unified, workload-sensitive formulations. A useful mapping framework should report not only a final latency estimate, but also why a mapping is good: array occupancy, number of remaps, ADC utilization, partial-sum movement, inter-tile traffic and sensitivity to device errors. For GenAI workloads, mapping should be phase-aware and operator-aware. It may be inefficient to force all Transformer operations through the same array mapping that was optimized for CNNs. The long-term challenge is to integrate mapping with NAS, quantization and compiler stacks so that model structure, precision and placement are searched together rather than sequentially patched.

Neural Architecture Search

Neural architecture search for IMC addresses a design-space problem that is too large for manual exploration. A neural model can vary in layer type, width, depth, kernel size, attention structure, pruning policy and quantization. An IMC accelerator can vary in crossbar size, cell precision, ADC/DAC resolution, buffer size, tile count, dataflow and non-ideality mitigation. When these spaces are combined, a hand-designed model mapped to a fixed accelerator is unlikely to be optimal. HW-NAS offers a method for making the model aware of the hardware and, in more ambitious formulations, for co-optimizing the model and hardware together.

The role-model NREE paper provides the strongest conceptual frame for this section. It categorizes HW-NAS by search space, problem formulation, search strategy and hardware-cost estimation, and then asks what is missing when the target is IMC. The current corpus extends that discussion with works such as NAS4RRAM, NAX, Gibbon, hardware-aware Pareto exploration, CIMNAS, Efficient Neural Architecture Search with CIM-based architecture, NeuroSim Agent and H4H. The progression is from optimizing neural structures for fixed IMC assumptions toward joint model-quantization-hardware optimization. CIMNAS is a recent example that searches software parameters, quantization policies and device/circuit/architecture-level hardware parameters over a very large design space, while H4H points toward hybrid convolution-Transformer systems.

The key design insight is that IMC-aware NAS cannot be reduced to adding energy and latency terms to a conventional NAS objective. Hardware cost estimation is itself uncertain: analytical models, lookup tables, simulators and real measurements differ in accuracy, speed and transferability. Moreover, IMC non-idealities, converter precision and mapping constraints can change the ranking of candidate architectures. A network that is efficient on a GPU may be poor on IMC because its operators do not map cleanly to arrays or because it induces excessive partial-sum movement. Conversely, an IMC-friendly architecture may use structures that conventional NAS spaces rarely include.

Future directions should focus on three gaps. First, IMC NAS benchmarks are still insufficient, making cross-paper comparison difficult. Second, NAS should expand beyond CNN-centric models toward Transformers, hybrid CNN-Transformer models, recommender systems and multimodal GenAI workloads. Third, NAS must become mapping- and compiler-aware. Searching an architecture without modeling placement, dataflow and runtime scheduling risks optimizing a model that cannot realize its predicted efficiency. The most compelling direction is a closed-loop design flow in which NAS, quantization, mapping and hardware evaluation share a common intermediate representation and produce Pareto fronts that are reproducible across devices and workloads.

Quantization, Sparsification and Pruning

Quantization, sparsification and pruning are natural companions to IMC because they reduce the number of stored bits, active operations and data transfers. However, compression in IMC is not the same as compression for GPUs or CPUs. In an array-based accelerator, reducing model size helps only if the hardware can exploit the resulting structure. Unstructured sparsity may preserve accuracy but fail to reduce cycles if the array must still activate full rows or columns. Structured pruning may simplify hardware but damage accuracy or restrict the model. Quantization may reduce memory footprint but increase sensitivity to device variation and converter precision. Thus, the central question is how to make compression hardware-visible.

The ranked literature shows a clear evolution. Earlier works considered dynamic sparsity control, structured pruning of RRAM crossbars, model-compression-enabled all-weights-on-chip acceleration and bit-level sparsity. More recent works focus on co-design: DB-PIM exploits unstructured bit-level sparsity through dyadic block patterns and tailored SRAM-PIM macros; its 2026 extension jointly explores value-level and bit-level sparsity. DANCE extends the theme to compound AI by supporting N:M sparse compression and outlier-aware quantization for systems combining LLMs and expert models. Sparse Transformer accelerators add attention-specific ideas such as zero skipping and reusable local attention engines.

The main insight is that compression must be aligned with the granularity at which IMC hardware can skip work. Value-level zeros, bit-level zeros, channel pruning, block sparsity, N:M sparsity and attention sparsity each expose different opportunities and costs. Fine-grained sparsity can provide high theoretical savings, but requires indexing, selection circuits or irregular routing. Coarse-grained sparsity maps more easily but may leave accuracy on the table. LLMs make the trade-off sharper because outliers, activation sparsity, expert routing and attention patterns are more dynamic than many CNN pruning scenarios.

Future compression research should report the full cost of exploiting sparsity: metadata, selection logic, load imbalance, routing, retraining and accuracy recovery. For GenAI workloads, it should separate compression for weights, activations, KV/cache and attention. The next step is compression-aware IMC design that treats sparsity and quantization as runtime-varying properties rather than static preprocessing. A promising path is to combine model compression, precision adaptation and mapping so that the hardware can select the cheapest faithful representation for each operator, layer or token phase.

Yimin Wang