CI-Lab 智能计算研究组

Focus-dLLM accepted by ACL 2026: confidence-guided sparse attention for long-context diffusion LLM inference

Tue, 07 Apr 2026 00:00:00 +0000

👏 Paper title: Focus-dLLM: Accelerating Long-Context Diffusion LLM Inference via Confidence-Guided Context Focusing.

Focus-dLLM accelerates long-context diffusion large language model inference by reducing redundant bidirectional attention computation. Diffusion LLMs can process long contexts in a non-autoregressive decoding paradigm, but full attention over long sequences creates a major computational bottleneck.

The framework is training-free and uses past token confidence to predict the regions that should remain in focus during diffusion decoding. It then applies sink-aware pruning to remove redundant attention computation while preserving influential attention sinks. This design improves long-context dLLM efficiency without requiring model retraining or architectural changes.

Focus-dLLM is motivated by the observation that not every token contributes equally during every diffusion decoding step. As generation progresses, confidence signals can help identify which context regions are more important for subsequent denoising and which attention interactions are likely redundant.

By using these signals at inference time, the method provides a lightweight acceleration path for long-context dLLMs. It is particularly attractive because it preserves the original model parameters and can be applied without collecting new training data or modifying the model architecture.

GitHub: Longxmas/Focus-dLLM

GCoDE accepted by IEEE TC: architecture-mapping co-search for efficient device-edge GNN co-inference

Thu, 01 Jan 2026 00:00:00 +0000

👏 Paper title: GCoDE: Efficient Device-Edge Co-Inference for GNNs via Architecture-Mapping Co-Search.

GCoDE targets efficient GNN inference across device-edge systems. GNN workloads are challenging for co-inference because graph partitions, message passing, model architecture, and communication overhead all interact with one another.

The framework jointly searches neural architectures and deployment mappings instead of optimizing them separately. By modeling computation, communication, and graph placement together, GCoDE avoids designs that are accurate but communication-heavy or efficient but accuracy-limited, improving the practicality of GNN serving across constrained devices and edge servers.

GCoDE is motivated by the close coupling between graph neural network structure and distributed execution cost. Changing the architecture affects intermediate features and computation patterns, while changing the mapping affects communication and latency.

By co-searching both dimensions, the framework can discover designs that are better suited to the actual device-edge system. This makes GCoDE a more holistic approach to GNN co-inference than methods that only tune the model or only tune the deployment plan.

MIREDO accepted by ASP-DAC 2026: MIP-driven dataflow optimization for CIM accelerators

Thu, 01 Jan 2026 00:00:00 +0000

👏 Paper title: MIREDO: MIP-Driven Resource-Efficient Dataflow Optimization for Computing-in-Memory Accelerator.

MIREDO focuses on dataflow optimization for computing-in-memory accelerators. CIM architectures can reduce data movement for DNN workloads, but their practical efficiency depends strongly on how workloads are mapped under strict array, transfer, and architectural constraints.

The framework formulates dataflow optimization as a mixed-integer programming problem. It combines a hierarchical hardware abstraction with an analytical latency model so that workload characteristics, dataflow choices, and CIM-specific constraints can be optimized together. This systematic search helps close the gap between theoretical CIM capability and actual system-level performance.

MIREDO is useful because CIM accelerators often have many hidden constraints: array capacity, memory hierarchy, interconnect cost, operand placement, and data reuse all affect performance. A dataflow that looks efficient at the algorithm level may be inefficient once these hardware limits are considered.

By expressing the optimization problem formally, MIREDO can search this design space more systematically than hand-tuned heuristics. The framework helps identify mappings that use CIM resources effectively and provides a clearer methodology for comparing accelerator configurations.

SlimInfer accepted by AAAI 2026: dynamic token pruning for faster long-context LLM inference

Thu, 01 Jan 2026 00:00:00 +0000

👏 Paper title: SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning.

SlimInfer accelerates long-context LLM inference by pruning less critical prompt tokens during the forward pass. Long-context serving is often limited by prefill computation, hidden-state processing, and KV cache memory pressure, so reducing only attention cost is not enough for full-system acceleration.

The framework uses a layer-wise, block-wise pruning strategy motivated by the information diffusion phenomenon: as important context information propagates through the network, later layers can preserve semantic behavior with fewer active hidden states. SlimInfer pairs this pruning mechanism with a predictor-free asynchronous KV cache manager, reducing computation, memory use, and I/O overhead while maintaining long-context task quality.

SlimInfer is designed to improve the full inference pipeline rather than a single isolated kernel. By pruning hidden states during prefill and coordinating KV cache updates asynchronously, it targets both computation and system-level memory movement.

This makes the framework valuable for serving long-context applications, where latency and memory footprint grow quickly with sequence length. It offers a practical acceleration strategy that does not require retraining the LLM or adding a separate prediction model.

GitHub: Longxmas/SlimInfer

CIMinus accepted by IEEE TC: sparse DNN workload modeling and exploration for SRAM-based CIM

Wed, 31 Dec 2025 00:00:00 +0000

👏 Paper title: CIMinus: Empowering Sparse DNN Workloads Modeling and Exploration on SRAM-based CIM Architectures.

CIMinus provides a systematic modeling framework for sparse DNN workloads on SRAM-based compute-in-memory architectures. Although sparsity is a major opportunity for reducing neural network cost, CIM arrays impose rigid mapping and dataflow constraints that make sparse execution difficult to evaluate.

The framework models workload latency and component-level energy consumption under diverse sparsity patterns and multi-macro CIM mappings. With its sparsity abstraction and pruning-to-evaluation workflow, CIMinus helps designers understand when sparsity leads to real system benefits and how mapping strategies should be chosen for practical CIM accelerators.

CIMinus is motivated by the gap between sparse model compression and hardware-realistic evaluation. A sparse network may contain fewer operations on paper, but the actual benefit depends on whether the CIM architecture can exploit that sparsity efficiently.

By connecting sparsity patterns, pruning strategies, and architecture-level cost models, CIMinus gives designers a way to evaluate sparse DNN workloads before committing to a hardware design. This supports more informed accelerator exploration and avoids overly optimistic sparsity assumptions.

TinyFormer accepted by IEEE TCAS-I: efficient sparse transformer design and deployment on tiny devices

Wed, 31 Dec 2025 00:00:00 +0000

👏 Paper title: TinyFormer: Efficient Sparse Transformer Design and Deployment on Tiny Devices.

TinyFormer brings transformer models into tiny-device scenarios such as MCU-based embedded and IoT systems. These platforms have severe storage and memory constraints, making it challenging to design and deploy modern transformer architectures directly.

The framework combines SuperNAS for supernet search, SparseNAS for sparse single-path model selection, and SparseEngine for efficient deployment. By co-optimizing architecture, sparsity, and inference execution, TinyFormer enables transformer inference under strict MCU budgets and improves sparse inference speed while preserving accuracy.

TinyFormer is built around the full deployment path rather than only model compression. It searches for architectures that fit tiny devices, selects sparse structures that reduce inference cost, and provides an engine that can actually execute the resulting model efficiently.

This matters because transformers are increasingly useful for sensing and sequence tasks, but their memory and compute demands often exceed what microcontrollers can support. TinyFormer helps bridge that gap by treating model design and hardware-aware deployment as a single problem.

A3GNN accepted by ICCD 2025: affordable, adaptive, and automatic GNN training on CPU-GPU platforms

Mon, 10 Nov 2025 00:00:00 +0000

👏 Paper title: Towards Affordable, Adaptive and Automatic GNN Training on CPU-GPU Heterogeneous Platforms.

A3GNN targets practical GNN training on CPU-GPU heterogeneous platforms, where performance depends on how graph sampling, feature access, and computation are divided across devices. Static training recipes can underuse available hardware or exceed memory limits as graph and model characteristics change.

The work introduces an adaptive training flow that coordinates locality-aware sampling with fine-grained scheduling. By balancing throughput, memory footprint, and accuracy, A3GNN aims to make high-performance GNN training more affordable and automatic on commodity heterogeneous platforms.

A3GNN is designed for the reality that many labs and deployment environments rely on mixed CPU-GPU resources rather than large specialized clusters. The framework adapts training decisions to the platform so that available compute and memory can be used more effectively.

The contribution is not only faster execution but also a more automatic training workflow. By reducing manual tuning pressure, A3GNN helps make GNN training more accessible on practical heterogeneous hardware setups.

GitHub: BUAA-CI-LAB/A3GNN

ACE-GNN accepted by IEEE TCAD: adaptive GNN co-inference scheduling for dynamic edge environments

Sun, 28 Sep 2025 00:00:00 +0000

👏 Paper title: ACE-GNN: Adaptive GNN Co-Inference with System-Aware Scheduling in Dynamic Edge Environments.

ACE-GNN improves GNN co-inference in dynamic edge environments, where bandwidth, device load, and multi-device access patterns can change during deployment. Static partitioning and fixed pipeline strategies may work well under one condition but become inefficient when the system state shifts.

The framework builds system-level awareness into runtime optimization. It predicts performance under changing edge conditions, searches for efficient execution schemes, and adaptively schedules between pipeline parallelism and data parallelism. Together with batch inference and communication middleware, ACE-GNN improves stability, latency, and energy efficiency for device-edge GNN serving.

ACE-GNN is designed for environments where static deployment choices quickly become suboptimal. Edge bandwidth can fluctuate, device load can change, and the graph structure itself can create unpredictable communication patterns.

By adapting scheduling decisions at runtime, the framework improves robustness for real-world GNN serving. It also demonstrates that efficient co-inference requires both model-level awareness and system-level scheduling, especially when multiple devices and edge resources collaborate.

CIMFlow accepted by DAC 2025: an integrated framework for systematic digital CIM design and evaluation

Mon, 23 Jun 2025 00:00:00 +0000

👏 Paper title: CIMFlow: An Integrated Framework for Systematic Design and Evaluation of Digital CIM Architectures.

CIMFlow is an integrated framework for systematic design and evaluation of digital compute-in-memory architectures. CIM research often requires coordination between software workloads, architecture definitions, compilation, and simulation, but these pieces are frequently developed in isolation.

CIMFlow provides a full-stack infrastructure that includes an instruction set architecture, an MLIR-based compiler, and a SystemC-based simulator. Its modular design supports flexible architectural exploration and helps researchers rapidly prototype, validate, and compare digital CIM concepts on DNN workloads.

The framework reduces the friction of evaluating CIM ideas because it links the software stack and architecture model in one flow. Researchers can express workloads, compile them toward CIM execution, and analyze performance with a simulator that reflects architectural choices.

This makes CIMFlow useful not only as a tool but also as a common methodology for comparing digital CIM designs. It helps move CIM research from isolated prototypes toward reproducible system-level exploration.

“Yingjie Qi presented the CIMFlow work at the DAC 2025 conference in San Francisco.”

Finesse accepted by ISCA 2025: agile software-hardware co-design for pairing-based cryptography

Fri, 20 Jun 2025 00:00:00 +0000

👏 Paper title: Finesse: An Agile Design Framework for Pairing-based Cryptography via Software/Hardware Co-Design.

Finesse addresses the long design cycle and limited flexibility of accelerators for pairing-based cryptography. Pairing workloads are important for modern cryptographic applications, but changing algorithms, curve parameters, and system requirements can make fixed accelerator designs difficult to maintain.

The framework combines a specialized compiler, multi-granularity simulation, a unified IR/ISA abstraction, and parameterized pipelined hardware. This software-hardware co-design flow enables rapid design-space exploration and faster iteration while delivering high-throughput pairing acceleration across different curve families and hardware configurations.

Finesse is valuable because pairing-based cryptography has complex arithmetic kernels and diverse parameter choices. A fixed-function accelerator may be fast for one configuration but difficult to adapt as cryptographic requirements evolve.

By providing a reusable design framework, Finesse lets software and hardware decisions be explored together. This shortens the path from algorithm specification to efficient accelerator implementation and helps make high-performance cryptographic hardware more agile.

GitHub: BUAA-CI-LAB/Finesse

“Tianwei Pan presented the Finesse framework at the ISCA 2025 conference in Tokyo.”

IEEE TCAD paper on SRAM-PIM co-design with joint value-level and bit-level sparsity

Mon, 16 Jun 2025 00:00:00 +0000

👏 Paper title: Efficient SRAM-PIM Co-Design by Joint Exploration of Value-Level and Bit-Level Sparsity.

This work explores sparsity in SRAM-based processing-in-memory at two complementary levels. Value-level sparsity can skip zero operands, while bit-level sparsity can remove unnecessary operations inside nonzero numerical values. Treating only one level leaves useful efficiency opportunities unused.

The proposed co-design jointly exploits structured zero values and unstructured zero bits in digital SRAM-PIM arrays. By aligning sparsity-aware algorithms with hardware execution, the framework reduces redundant computation and improves accelerator efficiency beyond what single-level sparsity optimization can provide.

This work is motivated by the observation that different kinds of sparsity appear at different layers of the computation. Value-level sparsity removes entire operands, while bit-level sparsity can still reduce work inside remaining nonzero values.

By combining both views, the co-design exposes more optimization opportunities for SRAM-PIM arrays. It also provides a more complete framework for sparse DNN acceleration, where model-level pruning and hardware-level bit operations need to be optimized together.

DAC 2024 paper on automated GNN design and deployment for device-edge co-inference systems

Sun, 23 Jun 2024 00:00:00 +0000

👏 Paper title: Graph Neural Networks Automated Design and Deployment on Device-Edge Co-Inference Systems.

This work studies how to design and deploy GNNs across a device-edge co-inference system. Running all computation on the device can exceed local resources, while offloading too much work can create communication bottlenecks, especially for graph workloads with irregular data dependencies.

The proposed framework treats model architecture and deployment mapping as a joint design problem. By modeling communication together with computation, it searches for GNN architectures and partitioning schemes that improve end-to-end efficiency rather than optimizing model accuracy in isolation.

This is important because GNN co-inference is shaped by both model structure and graph data movement. A design with high accuracy can still perform poorly if intermediate graph features or messages create too much device-edge communication.

By searching the architecture and mapping space together, the framework can find deployment choices that match the system constraints. It provides an early foundation for later device-edge GNN co-inference methods that optimize model design and execution placement as one problem.

DB-PIM accepted by DAC 2024: exploiting unstructured bit-level sparsity for efficient SRAM-PIM design

Sun, 23 Jun 2024 00:00:00 +0000

👏 Paper title: Towards Efficient SRAM-PIM Architecture Design by Exploiting Unstructured Bit-Level Sparsity.

DB-PIM improves digital SRAM-PIM by looking beneath value-level sparsity and exploiting sparsity at the bit level. Many neural network operands contain redundant zero bits even when the values themselves are not zero, creating an opportunity for finer-grained acceleration.

The framework combines sparsity-preserving algorithm choices with hardware support such as dyadic block multiplication units and CSD-based adder trees. This co-design reduces unnecessary bit-level computation and improves performance and energy efficiency for SRAM-PIM accelerators.

This work expands the notion of sparsity from zero values to sparse bit patterns inside numerical representations. That distinction matters for PIM arrays because bit-serial or bit-parallel operations can waste energy on redundant zero bits even when the stored value is nonzero.

DB-PIM therefore connects algorithmic sparsity preservation with hardware mechanisms that can benefit from it. The result is a more fine-grained optimization path for SRAM-PIM architectures, especially for neural network workloads with rich bit-level redundancy.

GNNavigator accepted by DAC 2024: automatic guideline exploration for adaptive GNN training

Sun, 23 Jun 2024 00:00:00 +0000

👏 Paper title: GNNavigator: Towards Adaptive Training of Graph Neural Networks via Automatic Guideline Exploration.

GNNavigator addresses the difficulty of optimizing GNN training across diverse graphs, models, and hardware platforms. GNN training performance depends on sampling, aggregation, memory behavior, and device characteristics, making fixed optimization rules brittle.

The framework introduces software-hardware co-abstraction and performance modeling to automatically explore training guidelines. It helps select configurations that balance runtime, memory usage, and accuracy, enabling more adaptive GNN training optimization across graph learning workloads.

The work is motivated by the fact that GNN training behavior varies dramatically across datasets and platforms. Sampling choices, graph topology, feature dimensions, and accelerator characteristics can all change the best training strategy.

GNNavigator turns this tuning process into an automated exploration problem. Instead of relying on fixed heuristics, it searches for guidelines that match the current workload and hardware, making GNN training more robust across diverse scenarios.

DDC-PIM accepted by IEEE TCAD: doubling effective data capacity for SRAM-based processing-in-memory

Tue, 07 Nov 2023 00:00:00 +0000

👏 Paper title: DDC-PIM: Efficient Algorithm/Architecture Co-Design for Doubling Data Capacity of SRAM-Based Processing-in-Memory.

DDC-PIM targets a central limitation of SRAM-based processing-in-memory: the tight relationship between array capacity and computable data placement. While SRAM-PIM can reduce data movement, practical designs must still fit operands and intermediate data into limited array resources.

The paper exploits the cross-coupled structure of 6T SRAM cells to store and use complementary bit pairs more efficiently. Through algorithm and architecture co-design, DDC-PIM increases effective data capacity for SRAM-PIM arrays and improves the practicality of in-memory DNN acceleration.

The work is especially relevant because many SRAM-PIM optimizations are constrained by how data must be placed inside memory arrays. Improving effective capacity can reduce data remapping, array pressure, and unnecessary movement between compute and storage regions.

DDC-PIM shows how circuit-level properties and algorithm-level mapping can reinforce one another. This makes it a representative example of why PIM design often needs joint reasoning across devices, arrays, data representation, and neural network computation.

IEEE CAL paper on architectural implications of GNN aggregation programming abstractions

Wed, 01 Nov 2023 00:00:00 +0000

👏 Paper title: Architectural Implications of GNN Aggregation Programming Abstractions.

GNN aggregation is often expressed through high-level programming abstractions, but different abstractions can imply very different data movement, parallelism, and memory-access behavior. This paper studies the architectural consequences of these abstraction choices.

The work builds a taxonomy around data organization and propagation patterns, then characterizes performance across graph properties and hardware platforms. The resulting analysis helps clarify when an aggregation abstraction is friendly to acceleration and when it may introduce hidden inefficiencies.

This is useful because GNN software abstractions are often selected for programming convenience, but they can strongly influence memory traffic, scheduling opportunities, and hardware utilization. The paper makes these implications visible and measurable.

For accelerator and framework designers, the study provides guidance on matching aggregation APIs to hardware behavior. It also helps identify which abstraction patterns are likely to scale across graph datasets and which may require architecture-specific optimization.

L2 compression accepted by ICCV 2023: unified lossy and lossless post-training model size compression

Mon, 02 Oct 2023 00:00:00 +0000

👏 Paper title: Lossy and Lossless (L2) Post-training Model Size Compression.

The L2 compression framework addresses the storage and transmission cost of deep neural networks after training. Instead of treating lossy and lossless compression as separate steps, the paper combines them into a unified post-training workflow.

The method introduces a parametric weight transformation to coordinate different lossy compression choices and a differentiable counter to guide optimization toward a compression-friendly representation. It can target a desired global compression ratio, allocate adaptive compression across layers, and preserve model accuracy while substantially reducing model size.

Unlike approaches that rely only on quantization, pruning, or generic entropy coding, L2 treats the compression pipeline as a coupled optimization problem. This lets the framework reason about how weight transformation affects downstream lossless encoding.

The result is useful for model deployment scenarios where storage, transmission, or on-device memory is constrained. By operating after training, L2 also provides a practical compression option for existing models without requiring a full retraining pipeline.

DAC 2023 paper on hardware-aware automated GNN design for edge computing platforms

Sun, 09 Jul 2023 00:00:00 +0000

👏 Paper title: Hardware-Aware Graph Neural Network Automated Design for Edge Computing Platforms.

This paper focuses on automated GNN design under edge deployment constraints. GNN architecture choices affect not only accuracy but also latency, memory behavior, and suitability for different edge devices, so model search must account for hardware behavior from the beginning.

The proposed hardware-aware design flow evaluates candidate GNN architectures with deployment efficiency in mind and incorporates device heterogeneity into the search process. By connecting architecture search with edge-platform constraints, the method improves the balance between model quality and practical execution cost.

The work recognizes that a GNN architecture that performs well in isolation may be unsuitable for real edge deployment if it causes excessive latency, memory pressure, or energy use. Hardware-aware search therefore becomes a necessary part of model design rather than an afterthought.

By integrating deployment feedback into the automated design loop, the approach can discover architectures that better match specific edge devices. This provides a more realistic path for moving GNN models from research benchmarks to constrained computing platforms.

IEEE TCAS-I paper on reconfigurable in-cache MPUF systems using SOT-MRAM true randomness

Fri, 01 Apr 2022 00:00:00 +0000

👏 Paper title: Reconfigurable and Dynamically Transformable In-Cache-MPUF System With True Randomness Based on the SOT-MRAM.

This work explores secure hardware primitives built directly inside cache memory. Physical unclonable functions benefit from device-level variation and randomness, while in-cache integration reduces the need for separate security blocks and keeps sensitive operations close to memory.

The proposed in-cache MPUF system uses spin-orbit-torque MRAM and exploits thermal noise as a true randomness source. Its reconfigurable and dynamically transformable design supports flexible challenge-response behavior, strengthening security primitives while reusing memory structures already present in computing systems.

By integrating security functionality into cache-like memory structures, the design reduces the need for separate cryptographic hardware blocks. It also benefits from the non-volatility and stochastic properties of SOT-MRAM, which can be turned into useful entropy sources.

The result is a security-oriented memory design that combines storage, randomness, and configurable identity generation. This is especially relevant for lightweight authentication and hardware security in systems where area and energy overhead must remain low.

SCIS paper on NAND-SPIN processing-in-MRAM architecture for CNN acceleration

Fri, 01 Apr 2022 00:00:00 +0000

👏 Paper title: NAND-SPIN-Based Processing-in-MRAM Architecture for Convolutional Neural Network Acceleration.

This paper investigates how spintronic memory can be used not only for storage but also for computation. CNN inference repeatedly moves weights and activations between memory and compute units, so a processing-in-MRAM design can directly target one of the largest efficiency bottlenecks.

The proposed NAND-SPIN-based architecture uses in-memory logic and data-local execution to accelerate convolution-heavy neural network workloads. By embedding useful computation into MRAM structures, the design reduces data movement and opens a path toward more energy-efficient neural network accelerators.

The architecture is motivated by the high cost of repeatedly fetching CNN weights and activations from memory. By taking advantage of spintronic memory behavior, computation can be placed closer to stored data and performed with less communication overhead.

This makes the work part of a broader shift from processor-centric acceleration to memory-centric acceleration. For CNN workloads, where convolution dominates both data reuse and data movement, such PIM designs can substantially improve energy efficiency.

IEEE TCAD paper on accelerating graph connected components with emerging processing-in-memory

Tue, 01 Mar 2022 00:00:00 +0000

👏 Paper title: Accelerating Graph Connected Component Computation with Emerging Processing-In-Memory Architecture.

Connected component computation is a fundamental graph primitive, but its irregular memory accesses and large working sets make it difficult to accelerate using conventional processor-memory organization. This paper explores how emerging processing-in-memory architectures can reduce the data movement that dominates graph analytics workloads.

The work combines algorithmic adaptation with architectural support. By reorganizing graph traversal, data placement, and update operations around PIM-friendly execution, the proposed co-design improves locality and reduces off-chip traffic for connected component analysis.

The study highlights that accelerating graph algorithms requires more than placing simple arithmetic near memory. Graph connected component computation involves irregular propagation and repeated updates, so the algorithm itself must be shaped to match the strengths and limits of PIM hardware.

By aligning the computation pattern with emerging memory-side execution, the work offers a path toward more efficient graph analytics systems. It also provides design lessons for other graph workloads that suffer from similar data movement and locality challenges.

Eventor accepted by DAC 2022: FPGA acceleration for event-based monocular multi-view stereo

Tue, 01 Feb 2022 00:00:00 +0000

👏 Paper title: Eventor: An Efficient Event-Based Monocular Multi-View Stereo Accelerator on FPGA Platform.

Eventor accelerates event-based monocular multi-view stereo, a 3D vision workload built around asynchronous event camera streams. Event cameras offer high temporal resolution and sparse output, but the EMVS pipeline still contains intensive stages such as event back-projection and volumetric ray-counting.

The paper maps these critical kernels to an FPGA-based heterogeneous platform with parallel and pipelined processing elements. It also reformulates parts of the EMVS algorithm through scheduling, approximation, and hybrid quantization, improving throughput and energy efficiency for real-time event-based 3D perception on embedded systems.

Eventor is designed around the sparse and asynchronous nature of event data. Instead of treating event streams like conventional video frames, the accelerator uses hardware parallelism to process event-driven geometry operations directly.

This makes the work relevant for robotics, autonomous systems, and low-latency vision, where event cameras can provide fast response under challenging motion or lighting conditions. The FPGA implementation demonstrates how algorithm reformulation and architecture mapping can make event-based 3D reconstruction more practical.

Triangle counting acceleration accepted by IEEE TC: from graph algorithms to in-memory architecture

Mon, 01 Nov 2021 00:00:00 +0000

👏 Paper title: Triangle Counting Accelerations: From Algorithm to In-Memory Computing Architecture.

This work studies triangle counting, a core graph analytics primitive used to measure graph clustering and local connectivity. Because triangle counting repeatedly intersects neighbor sets and moves large graph data structures through the memory hierarchy, conventional CPU-centric execution can be dominated by memory traffic rather than arithmetic.

The paper develops an acceleration path from algorithm design to processing-in-memory architecture. By moving key operations closer to memory and reorganizing graph computation around data locality, the proposed design reduces unnecessary data movement and improves the efficiency of triangle counting for large graph workloads.

The work is positioned as an end-to-end acceleration study rather than a single hardware tweak. It considers how graph representation, intersection behavior, and memory access patterns interact, then maps the dominant operations onto an in-memory computing substrate.

This kind of co-design is valuable for graph workloads because arithmetic is often not the only bottleneck. By reducing traffic across the memory hierarchy, the architecture can improve throughput and energy behavior for triangle counting and related graph mining tasks.

FedSkel accepted by CIKM 2021: efficient federated learning with skeleton gradient updates

Fri, 01 Oct 2021 00:00:00 +0000

👏 Paper title: FedSkel: Efficient Federated Learning on Heterogeneous Systems with Skeleton Gradients Update.

FedSkel addresses the efficiency bottleneck of federated learning on heterogeneous edge devices, where clients often have very different compute capability, network bandwidth, and data distributions. Instead of updating the full model on every device and transmitting all gradients, FedSkel identifies compact skeleton networks that preserve the most essential model updates.

The framework updates only these skeleton gradients, reducing local back-propagation cost and communication traffic while keeping the learning process effective. This makes federated learning more practical for resource-constrained and imbalanced edge environments, where privacy-preserving training must also respect device-level limitations.

From a system perspective, FedSkel is useful because it attacks both major costs in federated learning: the amount of local work performed by each client and the volume of updates exchanged with the server. This is especially important when edge clients differ widely in hardware capability or network quality, since the slowest or weakest devices can otherwise limit the overall training process.

The result is a more deployment-oriented federated learning strategy. Rather than assuming all clients can afford full-gradient training, FedSkel adapts the update workload to the essential structure of the model, helping heterogeneous clients participate more efficiently without abandoning collaborative learning.

S2Engine accepted by IEEE TC: a systolic architecture for sparse convolutional neural networks

Tue, 01 Jun 2021 00:00:00 +0000

👏 Paper title: S2Engine: A Novel Systolic Architecture for Sparse Convolutional Neural Networks.

S2Engine targets the challenge of accelerating sparse convolutional neural networks while retaining the regular dataflow advantages of systolic arrays. Sparse CNNs can reduce arithmetic work, but irregular nonzero patterns often make hardware utilization and data reuse difficult.

The architecture coordinates sparse computation, data reuse, and array-level scheduling so that sparsity can be exploited without sacrificing the scalability of systolic execution. This design improves inference efficiency for sparse CNN workloads and offers a hardware-friendly path for deploying compact neural networks.

S2Engine is important because sparse neural networks often create irregular execution patterns that conventional accelerators handle poorly. The design keeps computation structured enough for systolic processing while still skipping redundant work introduced by sparsity.

By bridging sparsity and regular array execution, the architecture supports efficient CNN inference for compressed models. This makes it relevant to accelerator designs that need high utilization, predictable data movement, and support for increasingly sparse DNN workloads.

RTAS 2021 paper on memory-efficient graph neural network execution for edge platforms

Sat, 01 May 2021 00:00:00 +0000

👏 Paper title: Optimizing Memory Efficiency of Graph Neural Networks on Edge Computing Platforms.

Graph neural networks are attractive for edge intelligence, but their feature tensors and neighborhood aggregation patterns can exceed the limited memory budget of embedded and edge platforms. This paper focuses on reducing the peak memory footprint of GNN execution so that graph workloads can run more reliably on constrained devices.

The proposed feature decomposition method divides feature processing into smaller, manageable pieces while preserving the semantics of GNN computation. By lowering transient memory pressure during inference, the technique enables resource-limited platforms to execute larger graph models or larger graph inputs without relying on expensive memory expansion.

The key insight is that memory efficiency can be improved without changing the high-level GNN task. By decomposing feature computation, the system avoids materializing large intermediate tensors all at once and can schedule memory use more carefully.

This is particularly relevant for edge platforms, where memory capacity is often a harder constraint than raw compute. The work gives GNN deployment a more practical path on embedded and mobile devices that cannot simply scale memory with model size.

Mon, 01 Jan 0001 00:00:00 +0000