MIREDO paper is accepted by ASP-DAC 2026!

👏 Paper title: MIREDO: MIP-Driven Resource-Efficient Dataflow Optimization for Computing-in-Memory Accelerator. We propose MIREDO, a framework that formulates dataflow optimization for Computing-in-Memory (CIM) accelerators as a Mixed-Integer Programming problem. By jointly modeling workload characteristics, dataflow strategies, and CIM-specific constraints with an analytical latency model, MIREDO navigates the vast design space to find optimal configurations, achieving up to 3.2× performance improvement across various DNN models. [related project]

SlimInfer paper is accepted by AAAI 2026!

👏 Paper title: SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning. We identify an information diffusion phenomenon in LLMs, where critical token information spreads across the sequence, enabling aggressive pruning in later layers. Based on this, we propose SlimInfer, which performs dynamic block-wise pruning with a predictor-free asynchronous KV cache manager, achieving up to 2.53× TTFT speedup and 1.88× latency reduction on LLaMA-3.1-8B-Instruct. [related project]

CIMinus paper is accepted by TC 2025!

👏 Paper title: CIMinus: Empowering Sparse DNN Workloads Modeling and Exploration on SRAM-based CIM Architectures. We propose CIMinus, a cost modeling framework for efficient design space exploration of sparse DNN workloads on SRAM-based compute-in-memory architectures. It introduces FlexBlock, an expressive sparsity abstraction, and provides an integrated workflow from model pruning to system-level evaluation, accurately estimating speedups and energy savings within 5.27% error. [related project]

TinyFormer paper is accepted by IEEE TCAS-I 2025!

👏 Paper title: TinyFormer: Efficient Sparse Transformer Design and Deployment on Tiny Devices. We propose TinyFormer, a framework for developing and deploying resource-efficient transformer models on Microcontrollers (MCUs). Integrating architecture search, sparse model optimization, and automated deployment, it achieves 96.1% accuracy on CIFAR-10 under strict hardware constraints, delivering up to 12.2× inference speedup compared to CMSIS-NN. [related project]

Distributed GNN training paper is accepted by ICCD 2025!

👏 Paper title: Towards Affordable, Adaptive and Automatic GNN Training on CPU-GPU Heterogeneous Platforms. In this paper, we introduces A3GNN, a framework for Affordable, Adaptive, and Automatic GNN training on heterogeneous CPU-GPU platforms. It improves resource usage through locality-aware sampling and fine-grained parallelism scheduling. Moreover, it leverages reinforcement learning to explore the design space and achieve pareto-optimal trade-offs among throughput, memory footprint, and accuracy. [related project]

GNN adaptive co-inference paper is accepted by TCAD 2025!

👏 Paper title: ACE-GNN: Adaptive GNN Co-Inference with System-Aware Scheduling in Dynamic Edge Environments. We present ACE-GNN, the first adaptive GNN co-inference framework for dynamic edge environments. It enables rapid runtime scheme optimization and adaptive scheduling between pipeline and data parallelism, coupled with efficient batching and communication middleware, achieving up to 12.7× speedup and 82.3% energy savings. [related project]

CIMFlow paper is accepted by DAC 2025!

👏Paper title: CIMFlow: An Integrated Framework for Systematic Design and Evaluation of Digital CIM Architectures
We introduce CIMFlow, an integrated framework that bridges compilation and simulation with a flexible ISA for digital Compute-in-Memory architectures. It addresses SRAM capacity limits through advanced partitioning and parallelism, achieving up to 2.8× speedup and 61.7% energy reduction across diverse deep learning workloads. [related project]

Agile PBC co-design framework paper is accepted by ISCA 2025!

👏Paper title: Finesse: An Agile Design Framework for Pairing-based Cryptography via Software/Hardware Co-Design. Finesse introduces a software/hardware co-design framework for pairing-based cryptography, featuring a unified IR/ISA/hardware abstraction, a parameterized pipelined architecture, and an optimizing compiler. It achieves up to 6.2× higher iso-area throughput than prior flexible designs and outperforms specialized ASICs by up to 3.2×. [Related Project]

Our research paper is accepted by TCAD 2025!

👏 Paper title: Efficient SRAM-PIM Co-Design by Joint Exploration of Value-Level and Bit-Level Sparsity. We propose Dyadic Block PIM (DB-PIM), an algorithm-architecture co-design framework harnessing both value-level and bit-level sparsity in digital SRAM-PIM. It circumvents structured zero values in weights and bypasses unstructured zero bits, skipping a majority of unnecessary computations for significant efficiency gains. [related project]

GNN adaptive training paper is accepted by DAC 2024!

👏 Paper title: GNNavigator: Towards Adaptive Training of Graph Neural Networks via Automatic Guideline Exploration. GNNavigator introduces an adaptive GNN training configuration optimization framework that balances runtime, memory, and accuracy. By leveraging a unified software-hardware co-abstraction and a novel training performance model, it meets diverse application requirements through effective design space exploration. [related project]

GNN co-inference paper is accepted by DAC 2024!

👏 Paper title: Graph Neural Networks Automated Design and Deployment on Device-Edge Co-Inference Systems. In this paper, we abstract the communication process in device-edge co-inference into a specific operation, creating a unified design space for GNN architecture and co-inference schemes. Using random search, we achieve joint optimization, leading to a GNN architecture that integrates partitioning schemes, enabling a trade-off between communication and computation, and outperforming SOTA methods. [related project]

SRAM-PIM architecture design paper is accepted by DAC 2024!

👏 Paper title: Towards Efficient SRAM-PIM Architecture Design by Exploiting Unstructured Bit-Level Sparsity. We propose Dyadic Block PIM (DB-PIM), an algorithm-architecture co-design framework exploiting unstructured bit-level sparsity in SRAM-PIM. It combines a novel sparsity-preserving algorithm with dyadic block multiplication units and CSD-based adder trees, achieving up to 6.53× speedup and 77.50% energy savings. [related project]

PIM algorithm/architecture co-design paper is accepted by IEEE TCAD!

👏 Paper title: DDC-PIM: Efficient Algorithm/Architecture Co-Design for Doubling Data Capacity of SRAM-Based Processing-in-Memory. We propose DDC-PIM, an algorithm/architecture co-design methodology that doubles the equivalent data capacity of SRAM. By exploiting the cross-coupled structure of 6T SRAM to store bitwise complementary pairs in their complementary states, it maximizes data capacity and integration density of each SRAM cell. [related project]

GNN programming abstraction paper is accepted by IEEE Computer Architecture Letters!

👏 Paper title: Architectural Implications of GNN Aggregation Programming Abstractions. This paper evaluates the architectural implications of programming abstractions for Graph Neural Network (GNN) aggregation. It introduces a taxonomy based on data organization and propagation methods and performs a comprehensive performance characterization across platforms and graph properties. Key findings include insights into abstraction selection, hardware adaptability, and the structural impact of graphs, providing valuable guidance for GNN acceleration research. [related project]

Model compression paper is accepted by ICCV 2023!

👏 Paper title: Lossy and Lossless (L2) Post-training Model Size Compression. We propose a unified post-training model size compression method combining lossy and lossless techniques with parametric weight transformation and a differentiable counter. It achieves a stable 10× compression ratio without accuracy loss and 20× with minimal degradation, with controllable global compression and layer-wise adaptation. [related project]

Hardware-aware GNN NAS paper is accepted by DAC 2023!

👏 Paper title: Hardware-Aware Graph Neural Network Automated Design for Edge Computing Platforms. We explore hardware-aware GNN architecture design for edge devices, leveraging “predicting GNNs with GNNs” to efficiently estimate candidate architecture performance during NAS. By integrating device heterogeneity analysis into exploration, our method achieves significant improvements in both accuracy and efficiency. [related project]

Reconfigurable In-Cache-MPUF system paper is accepted by IEEE TCAS-I!

👏 Paper title: Reconfigurable and Dynamically Transformable In-Cache-MPUF System With True Randomness Based on the SOT-MRAM. In this paper, we present a reconfigurable Physically Unclonable Functions (PUF) based on the Spin-Orbit-Torque Magnetic Random-Access Memory (SOT-MRAM), which exploits thermal noise as the true dynamic entropy source. [related project]

Graph CC PIM architecture paper is accepted by IEEE TCAD!

👏 Paper title:Accelerating Graph Connected Component Computation with Emerging Processing-In-Memory Architecture. In this article, we propose to accelerate CC computation with the emerging processing-in-memory (PIM) architecture through an algorithm–architecture co-design manner. [related project]

EMVS accelerator paper is accepted by ACM/IEEE DAC 2022!

👏 Paper title: Eventor: An Efficient Event-Based Monocular Multi-View Stereo Accelerator on FPGA Platform. In this paper, Eventor is proposed as a fast and efficient EMVS accelerator by realizing the most critical and time-consuming stages including event back-projection and volumetric ray-counting on FPGA. [related project]

PIM architecture paper is published in IEEE Transactions on Computers!

👏 Paper title: Triangle Counting Accelerations: From Algorithm to In-Memory Computing Architecture. In this paper, we propose to accelerate TC with the emerging processing-in-memory (PIM) architecture through an algorithm-architecture co-optimization manner. [related project]

Federated learning paper is published in ACM International Conference on Information & Knowledge (CIKM’21)!

👏 Paper title: FedSkel: Efficient Federated Learning on Heterogeneous Systems with Skeleton Gradients Update. In this work, we propose FedSkel to enable computation-efficient and communication-efficient federated learning on edge devices by only updating the model’s essential parts, named skeleton networks. [related project]

Systolic architecture paper is published in IEEE Transactions on Computers!

👏 Paper title: S2Engine: A Novel Systolic Architecture for Sparse Convolutional Neural Networks. In this work, we propose S2Engine – a novel systolic architecture that can fully exploit the sparsity in CNNs with maximized data reuse. [related project]