👏 Paper title: MIREDO: MIP-Driven Resource-Efficient Dataflow Optimization for Computing-in-Memory Accelerator. We propose MIREDO, a framework that formulates dataflow optimization for Computing-in-Memory (CIM) accelerators as a Mixed-Integer Programming problem. By jointly modeling workload characteristics, dataflow strategies, and CIM-specific constraints with an analytical latency model, MIREDO navigates the vast design space to find optimal configurations, achieving up to 3.2× performance improvement across various DNN models. [related project]
👏 Paper title: SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning. We identify an information diffusion phenomenon in LLMs, where critical token information spreads across the sequence, enabling aggressive pruning in later layers. Based on this, we propose SlimInfer, which performs dynamic block-wise pruning with a predictor-free asynchronous KV cache manager, achieving up to 2.53× TTFT speedup and 1.88× latency reduction on LLaMA-3.1-8B-Instruct. [related project]
👏 Paper title: CIMinus: Empowering Sparse DNN Workloads Modeling and Exploration on SRAM-based CIM Architectures. We propose CIMinus, a cost modeling framework for efficient design space exploration of sparse DNN workloads on SRAM-based compute-in-memory architectures. It introduces FlexBlock, an expressive sparsity abstraction, and provides an integrated workflow from model pruning to system-level evaluation, accurately estimating speedups and energy savings within 5.27% error. [related project]
👏 Paper title: TinyFormer: Efficient Sparse Transformer Design and Deployment on Tiny Devices. We propose TinyFormer, a framework for developing and deploying resource-efficient transformer models on Microcontrollers (MCUs). Integrating architecture search, sparse model optimization, and automated deployment, it achieves 96.1% accuracy on CIFAR-10 under strict hardware constraints, delivering up to 12.2× inference speedup compared to CMSIS-NN. [related project]
👏 Paper title: Towards Affordable, Adaptive and Automatic GNN Training on CPU-GPU Heterogeneous Platforms. In this paper, we introduces A3GNN, a framework for Affordable, Adaptive, and Automatic GNN training on heterogeneous CPU-GPU platforms. It improves resource usage through locality-aware sampling and fine-grained parallelism scheduling. Moreover, it leverages reinforcement learning to explore the design space and achieve pareto-optimal trade-offs among throughput, memory footprint, and accuracy. [related project]
👏 Paper title: ACE-GNN: Adaptive GNN Co-Inference with System-Aware Scheduling in Dynamic Edge Environments. We present ACE-GNN, the first adaptive GNN co-inference framework for dynamic edge environments. It enables rapid runtime scheme optimization and adaptive scheduling between pipeline and data parallelism, coupled with efficient batching and communication middleware, achieving up to 12.7× speedup and 82.3% energy savings. [related project]
👏Paper title: CIMFlow: An Integrated Framework for Systematic Design and Evaluation of Digital CIM Architectures
We introduce CIMFlow, an integrated framework that bridges compilation and simulation with a flexible ISA for digital Compute-in-Memory architectures. It addresses SRAM capacity limits through advanced partitioning and parallelism, achieving up to 2.8× speedup and 61.7% energy reduction across diverse deep learning workloads.
[related project]
👏Paper title: Finesse: An Agile Design Framework for Pairing-based Cryptography via Software/Hardware Co-Design. Finesse introduces a software/hardware co-design framework for pairing-based cryptography, featuring a unified IR/ISA/hardware abstraction, a parameterized pipelined architecture, and an optimizing compiler. It achieves up to 6.2× higher iso-area throughput than prior flexible designs and outperforms specialized ASICs by up to 3.2×. [Related Project]
👏 Paper title: Efficient SRAM-PIM Co-Design by Joint Exploration of Value-Level and Bit-Level Sparsity. We propose Dyadic Block PIM (DB-PIM), an algorithm-architecture co-design framework harnessing both value-level and bit-level sparsity in digital SRAM-PIM. It circumvents structured zero values in weights and bypasses unstructured zero bits, skipping a majority of unnecessary computations for significant efficiency gains. [related project]
👏 Paper title: GNNavigator: Towards Adaptive Training of Graph Neural Networks via Automatic Guideline Exploration. GNNavigator introduces an adaptive GNN training configuration optimization framework that balances runtime, memory, and accuracy. By leveraging a unified software-hardware co-abstraction and a novel training performance model, it meets diverse application requirements through effective design space exploration. [related project]
👏 Paper title: Graph Neural Networks Automated Design and Deployment on Device-Edge Co-Inference Systems. In this paper, we abstract the communication process in device-edge co-inference into a specific operation, creating a unified design space for GNN architecture and co-inference schemes. Using random search, we achieve joint optimization, leading to a GNN architecture that integrates partitioning schemes, enabling a trade-off between communication and computation, and outperforming SOTA methods. [related project]
👏 Paper title: Towards Efficient SRAM-PIM Architecture Design by Exploiting Unstructured Bit-Level Sparsity. We propose Dyadic Block PIM (DB-PIM), an algorithm-architecture co-design framework exploiting unstructured bit-level sparsity in SRAM-PIM. It combines a novel sparsity-preserving algorithm with dyadic block multiplication units and CSD-based adder trees, achieving up to 6.53× speedup and 77.50% energy savings. [related project]
👏 Paper title: DDC-PIM: Efficient Algorithm/Architecture Co-Design for Doubling Data Capacity of SRAM-Based Processing-in-Memory. We propose DDC-PIM, an algorithm/architecture co-design methodology that doubles the equivalent data capacity of SRAM. By exploiting the cross-coupled structure of 6T SRAM to store bitwise complementary pairs in their complementary states, it maximizes data capacity and integration density of each SRAM cell. [related project]
👏 Paper title: Architectural Implications of GNN Aggregation Programming Abstractions. This paper evaluates the architectural implications of programming abstractions for Graph Neural Network (GNN) aggregation. It introduces a taxonomy based on data organization and propagation methods and performs a comprehensive performance characterization across platforms and graph properties. Key findings include insights into abstraction selection, hardware adaptability, and the structural impact of graphs, providing valuable guidance for GNN acceleration research. [related project]
👏 Paper title: Lossy and Lossless (L2) Post-training Model Size Compression. We propose a unified post-training model size compression method combining lossy and lossless techniques with parametric weight transformation and a differentiable counter. It achieves a stable 10× compression ratio without accuracy loss and 20× with minimal degradation, with controllable global compression and layer-wise adaptation. [related project]
👏 Paper title: Hardware-Aware Graph Neural Network Automated Design for Edge Computing Platforms. We explore hardware-aware GNN architecture design for edge devices, leveraging “predicting GNNs with GNNs” to efficiently estimate candidate architecture performance during NAS. By integrating device heterogeneity analysis into exploration, our method achieves significant improvements in both accuracy and efficiency. [related project]
👏 Paper title: Reconfigurable and Dynamically Transformable In-Cache-MPUF System With True Randomness Based on the SOT-MRAM. In this paper, we present a reconfigurable Physically Unclonable Functions (PUF) based on the Spin-Orbit-Torque Magnetic Random-Access Memory (SOT-MRAM), which exploits thermal noise as the true dynamic entropy source. [related project]
👏 Paper title: Eventor: An Efficient Event-Based Monocular Multi-View Stereo Accelerator on FPGA Platform. In this paper, Eventor is proposed as a fast and efficient EMVS accelerator by realizing the most critical and time-consuming stages including event back-projection and volumetric ray-counting on FPGA. [related project]
👏 Paper title: FedSkel: Efficient Federated Learning on Heterogeneous Systems with Skeleton Gradients Update. In this work, we propose FedSkel to enable computation-efficient and communication-efficient federated learning on edge devices by only updating the model’s essential parts, named skeleton networks. [related project]