# SPINBIS: Spintronics-Based Bayesian Inference System With Stochastic Computing

Xiaotao Jia, *Member, IEEE*, Jianlei Yang<sup>®</sup>, *Member, IEEE*, Pengcheng Dai, Runze Liu, Yiran Chen<sup>®</sup>, *Fellow, IEEE*, and Weisheng Zhao<sup>®</sup>, *Fellow, IEEE* 

Abstract-Bayesian inference is an effective approach for solving statistical learning problems, especially with uncertainty and incompleteness. However, Bayesian inference is a computing-intensive task whose efficiency is physically limited by the bottlenecks of conventional computing platforms. In this paper, a spintronics-based stochastic computing (SC) approach is proposed for efficient Bayesian inference. The inherent stochastic switching behaviors of spintronic devices are exploited to build a stochastic bitstream generator (SBG) for SC with hybrid CMOS/magnetic tunnel junction (MTJ) circuits design. Aiming to improve the inference efficiency, an SBG sharing strategy is leveraged to reduce the required SBG array scale by integrating a switch network between SBG array and SC logic. A device-to-architecture level framework is proposed to evaluate the performance of spintronics-based Bayesian inference system (SPINBIS). Experimental results on data fusion applications have shown that SPINBIS could improve the energy efficiency about 12× than MTJ-based approach with 45% design area overhead and about 26x than FPGA-based approach.

*Index Terms*—Bayesian inference, magnetic tunnel junction (MTJ), spintronics, stochastic computing (SC).

#### I. INTRODUCTION

THE RISE of deep learning has greatly promoted the development of artificial intelligence. However, most deep learning approaches usually require large scale training data and also bring some overfitting problems. Meanwhile,

Manuscript received May 28, 2018; revised September 2, 2018, October 26, 2018, and December 19, 2018; accepted January 25, 2019. Date of publication February 4, 2019; date of current version March 18, 2020. This work was supported in part by the National Natural Science Foundation of China under Grant 61602022, Grant 61501013, Grant 61571023, Grant 61521091, and Grant 1157040329, in part by the State Key Laboratory of Software Development Environment under Grant SKLSDE-2018ZX-07, in part by the National Key Technology Program of China under Grant 2017ZX01032101, in part by CCF-Tencent under Grant IAGR20180101, and in part by 111 Talent Program under Grant B16001. This paper was recommended by Associate Editor L. P. Carloni. (*Xiaotao Jia and Jianlei Yang Weisheng Zhao.*)

X. Jia, P. Dai, and W. Zhao are with the Beijing Advanced Innovation Center for Big Data and Brain Computing, Fert Beijing Research Institute, School of Microelectronics, Beihang University, Beijing 100191, China (e-mail: weisheng.zhao@buaa.edu.cn).

J. Yang and R. Liu are with the Beijing Advanced Innovation Center for Big Data and Brain Computing, Fert Beijing Research Institute, School of Computer Science and Engineering, Beihang University, Beijing 100191, China (e-mail: jianlei@buaa.edu.cn).

Y. Chen is with the Department of Electrical and Computer Engineering, Duke University, Durham, NC 27708 USA.

Digital Object Identifier 10.1109/TCAD.2019.2897631

Binary Number Comp. Bitstream

Fig. 1. Traditional SBG circuits. If x > y, then outputs 1; otherwise outputs 0.

they can neither represent the uncertainty and incompleteness of the world nor take the advantages of well-studied human experience. In order to overcome these limitations, some research tends to utilize Bayesian inference or combine Bayesian approaches with deep learning. Bayesian inference provides a powerful approach for information fusion, reasoning, and decision making that has established it as the key tool for data-efficient learning, uncertainty quantification, and robust model composition. It is widely used in applications of artificial intelligence and expert systems, such as multisensor fusion [1] and Bayesian belief network [2]. Recently, Bayesian learning has drawn great attention from the deep learning community and is well combined with many deep neural networks [3].

The fundamental of Bayesian inference is Bayes' rule which could be implemented by probabilistic computing. Probability computing is a kind of computation-intensive task which is inefficient with deterministic computation mode [4]. Stochastic computing (SC) is an unconventional computing mode which has observed to be suitable for efficient probability computing [5] with high error tolerance abilities and low-cost implementations of arithmetic operations. However, it is difficult to leverage the parallelism of SC algorithms on traditional von-Neumann architectures [6]. Hence, reconfigurable approach [7] and analog computing [8], [9] is utilized to realize SC in order to improve the Bayesian inference efficiency. The SC is usually realized by bit-wise operations on stochastic bitstreams (SBs) which is created by pseudorandom number generators (RNGs) and comparators as shown in Fig. 1. It is still expensive to implement stochastic bitstream generator (SBG) on von-Neumann architectures with CMOS technologies which is critical for performing SC.

Recently, spintronic devices [such as magnetic tunnel junction (MTJ)] have posed some promising advantages on generating random numbers because of the stochastic switching behaviors [11]. As shown in Fig. 2, an MTJ device usually switches with a nondeterministic manner according to the applied bias voltage and duration time due to the inherent

0278-0070 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications\_standards/publications/rights/index.html for more information.



Fig. 2. Experimental measurements of the switching probability with respect to the duration of the applied programming pulse, for different programming voltages [10].

thermal fluctuation of magnetization. Such a stochastic switching behavior has been exploited for generating random numbers [12]–[14] efficiently. And consequently, the inherent randomness of spintronic devices could be well revealed as the stochastic resources to perform SC.

In this paper, spintronic device-based SC is proposed for efficient Bayesian inference system (SPINBIS). The main contributions of this paper are listed as follows.

- An efficient SBG is proposed by leveraging the stochastic switching behaviors of MTJ device. The generated bitstreams have a very low correlation which is critical for SC accuracy. And a state-aware self-control strategy is adopted to improve the SBG efficiency.
- 2) An SBG sharing strategy is leveraged to reduce the required SBG array scale by integrating a switch matrix between SBG array and SC logic. The power consumption of SPINBIS is greatly reduced benefiting from this strategy.
- 3) A device-to-architecture level framework is built to evaluate the performance of SPINBIS with the data fusion applications. Experimental results indicate that it could achieve significant improvement on inference efficiency in terms of power, area and speed.

The remainder of this paper is organized as follows. Section II states some preliminaries and related works. The architecture of SPINBIS is presented in Section III as well as the SBG sharing techniques. Section IV describes the SBG circuit design and state-aware self-control strategy. A deviceto-architecture evaluation framework and experimental results on typical applications are illustrated in Section V. Concluding remarks are given in Section VI.

# II. BACKGROUND

# A. Stochastic Computing

SC was first introduced in the 1960s by Von Neumann [15]. The basic idea of SC is to represent probability value p by the ratio of "1" in the binary bitstreams. It is obvious that the representation of p by stochastic bitstream is not unique. The value of the bitstream is only related to its length and the count of 1, but has nothing to do with the position of 1. There are two encoding formats for stochastic bitstream: 1) unipolar and 2) bipolar format. The value range of unipolar is [0, 1] while



Fig. 3. Stochastic circuit that realizes the arithmetic function  $y = x_1x_2x_4 + x_3(1 - x_4)$ .



Fig. 4. (a) Typical structure of PMA-MTJ. (b) Circuit schematic view of 1T1MTJ structure.

bipolar is [-1, 1]. If the bitstream length is *n*, out of which *k* bits are 1s, then the probability value *p* is represented by p = k/n if using unipolar encoding format or p = (2k-n)/n if using bipolar encoding format. In this paper, unipolar encoding format is adopted because  $p \in [0, 1]$  in Bayesian inference problem. Arithmetic operations in SC are realized by using simple logic gates. For example, the multiplication operation is achieved by an AND gate and scaled addition is achieved by a MUX as shown in Fig. 3. Even though there exists a slight loss in computation accuracy, the advantage of SC is that it could significantly improve the computation energy efficiency when compared with conventional methods [16]–[19]. It is very suitable for inherent error-resilience applications using SC to make a tradeoff between the accuracy constraints and the energy efficiency requirements [20]–[22].

SC is not an exact computing method while the accuracy problem is arisen from several reasons. The first reason is that the probability values *p* are usually converted to stochastic bitstream with a lower quantization accuracy compared with fixed or floating point methods. The second reason is that the correlations between different bitstreams usually degrade the computation accuracy since these bitstreams are usually obtained by pseudo RNGs. Aiming to improve the quality of SBGs, many pioneer researchers have proposed several SBG models, such as linear feedback shift registers (LFSRs) [23]–[25], weighted binary SNG [26]. However, such CMOS-based approaches usually pose some bottlenecks on power consumption and chip area efficiency. And consequently some emerging devices-based approaches are investigated in this paper.

#### B. Magnetic Tunnel Junction Device

Fig. 4(a) shows a typical structure of the MTJ device with perpendicular magnetic anisotropy (PMA) [27]. MTJ is a

sandwich structure consisting of two ferromagnetic (FM) layers and a tunneling barrier layer. One FM layer is defined as reference layer (PL) with fixed magnetization direction. Another FM layer is a kind of free layer (FL) whose magnetization direction could be parallel or anti-parallel with that of PL. The MTJ resistance is determined by the relative magnetization directions of PL and FL while parallel (P) magnetization behaviors as low-resistance  $(R_P)$  state (logic "0") and anti-parallel (AP) magnetization behaviors as highresistance  $(R_{AP})$  state (logic 1). Tunnel magneto resistance ratio TMR =  $(R_{AP} - R_P)/R_P$  is defined to characterize the relationship of  $R_P$  and  $R_{AP}$ . Fig. 4(a) shows the circuit schematic view of a popular 1T1MTJ memory cell. MTJ state can be flipped by applying a polarized current injection with spin transfer torque mechanism. The switching current is controlled by the voltage between bit-line (BL) and the source-line (SL). The nMOS transistor serves as a switch and controlled by word-line. As shown in Fig. 4(b), the MTJ state is switched from P state to AP state if the injected current  $(I_{P \rightarrow AP})$  flows through the MTJ from FL to PL. On the contrary, the MTJ state is switched from AP state to P state if  $I_{AP \rightarrow P}$  is injected. The MTJ state could be flipped only if the applied bias voltage is larger than a critical current  $I_{c0}$  with enough duration time as shown in Fig. 2.

The stochastic behavior of MTJ switching has been revealed by [11] and is resulted from the unavoidable thermal fluctuations of magnetization [28]. The MTJ device usually switches with a stochastic manner according to the applied voltage magnitude and duration time as shown in Fig. 2, which could be represented as a random event with the probability *p*. Such stochastic behaviors have been evaluated in MRAM designs [29], neural network [30], [31], in-memory computing [32], and RNG [33]. Meanwhile, such an inherent probabilistic switching property is a very promising approach to generate SBs for SC.

Recently, a simple MTJ-based SBG is proposed in [12] but it lacks of many circuit details. In [34], an MTJ-based analog-to-stochastic converter is proposed for stochastic computation in vision chips. In [17], MTJ-based SC is integrated into artificial neural network applications. However, the energy efficiency of their SBGs is relative low. Furthermore, they have not considered the correlation between different SBGs which will significantly degrade the computation accuracy. Voltage-controlled MTJs are introduced for SC to reduce the power consumption in [14] but each SBG involves too many MTJs. Bitstream correlation is discussed in [14], however, the proposed shuffle operation could not remove the relevance essentially and may still result in unexpected results.

## **III. SPINBIS ARCHITECTURE**

#### A. Motivation

A typical Bayesian inference system (BIS) [35] is shown in Fig. 5. The input of BIS is a set of bias voltages corresponding to evidence or likelihood. SBG array is utilized to generate SBs according to the input voltages. The bitstreams are processed by the following SC logics which are determined by the given application. There are two major concerns to realize Bayesian



Fig. 5. Typical BIS [35].



Fig. 6. Proposed SPINBIS architecture.

inference applications on such system. One concern is that it usually requires large amount of SBGs because each evidence is represented by one SBG. As we have observed from many applications, especially with large scale, there are many evidences who have the same probabilities and may share the same SBGs to reduce the required SBG array scale. And the other concern is that it usually requires digital-to-analog converters (DACs) to convert the input digital sources into analog format which are defined as bias voltages [35]. Meanwhile, the bias voltages margin is usually very small and high accuracy DACs are required to improve the input margin so that the design overheads are difficult to tolerant. In this paper, SPINBIS is proposed to overcome these two disadvantages.

#### B. Overview of SPINBIS

SPINBIS is a spin-based BIS. As shown in the diagram from Fig. 6, an SBG sharing strategy is exploited in SPINBIS to significantly reduce the required array size which is different from the previous approach [35]. The SBG sharing strategy allows the inputs with the same evidence could be potentially represented by the bitstream generated from the same SBG. However, there are some inputs who are connected together by one or more logic gates, which are regarded as *conflicting* with each other. Conflicting inputs are not allowed to share the same SBG. As shown in Fig. 6, the conflict sets are extracted from the SC logic which contain the conflicting relationship. The SC logic block is determined according to the specified applications. The SBG array is prebuilt according to the specified applications and the generated bitstreams are assigned to SC logics by a switch matrix which is controlled by the digital inputs [36]. The input of switch matrix is the generated bitstreams from SBG array, and the output is connected to the



Fig. 7. SC (a) logic diagram and its (b) conflict set. Terminals of  $T_1 \sim T_9$  are supposed to have the probabilities  $\{p_1, p_2, p_1, p_3, p_1, p_4, p_5, p_3, p_3\}$ .

SC logics. The switch matrix is a crossbar structure while each cross point is realized by a transistor which is controlled by the switch controller. In summary, the bitstreams from SBG array are assigned to SC logics according to the switch matrix which is controlled by switch controller.

#### C. Stochastic Computing Logic and Conflict Set

One of the most attractive advantages of SC is that the involved arithmetic operations could be efficiently realized by simple logic gates, including AND, MUX, etc. The SC logics are determined according to the specified applications. Once the application is given, the SC logic is determined. For the determined computing logics with N inputs, it requires N bitstreams from the switch matrix. As shown in Fig. 7(a) of an SC logics example, there are two independent subcircuits with nine inputs ( $T_1, T_2, \ldots, T_9$ ) and two outputs of  $R_1$  and  $R_2$ . For a naive BIS [35], it requires nine bitstreams from SBG array with nine SBG circuits.

Suppose that  $bs_i$  means the input bitstream of terminal  $T_i$ , and  $p_i$  is the input probability of terminal  $T_i$ . p(bs) means the corresponding probability of bitstream bs. The SC logics in Fig. 7(a) are built to realize

$$p_{R_1} = p_1 \cdot p_2 \cdot p_5 + p_3 \cdot p_4 \cdot (1 - p_5) = p(bs_1 \& bs_2 \& bs_5) + p(bs_3 \& bs_4 \& \overline{bs_5}) p_{R_2} = p_6 \cdot p_7 \cdot p_8 \cdot p_9 = p(bs_6 \& bs_7 \& bs_8 \& bs_9).$$
(1)

As we have seen from (1), AND operations are executed among  $\{bs_1, bs_2, bs_5\}$  so that they are defined as conflicting with each other according to the SC principle. And  $T_1, T_2$ , and  $T_5$  are formulated as a conflict set as shown in Fig. 7(b). Similarly,  $T_3, T_4$ , and  $T_5$  are formulated as another conflict set as well as  $T_6, T_7, T_8$ , and  $T_9$ . The input terminals in the same conflict set are not allowed to share the same bitstream source from SBG array even if they have the same input evidence. Otherwise, the input terminals with same input evidence are allowed to share the same bitstream.

#### D. SBG Array and SBG Sharing Strategy

As shown in Fig. 2, MTJ switching probability is associated with bias voltage and duration time. Generally, either bias voltage or duration time is fixed and the other one is varied for random switching. In the previous approach [35], bitstreams are directly fed into SC logic from SBG array so that it usually requires many SBGs, as well as DACs. Furthermore, the output probability of SBG is highly sensitive to the input bias voltage whose margin is very small as reported in [35]. Accurate mapping from digital probabilities to voltages requires DACs with high precision, and it is difficult to tackle the nonlinear relationship between probabilities and voltages. More importantly, a slight noise or process variation may map a probability to an unexpected voltage. Aiming to overcome these limitations, the bitstreams in SPINBIS are provided with a prebuilt SBG array and assigned to SC logic through a switch matrix.

Different with the SBG array in [35], a prebuild SBG array based on SBG sharing strategy is utilized in SPINBIS to improve the stability of SBG and reduce the required number of SBGs. The BL/SL of each SBG in the array is supplied by an internal voltage source that could provide more stable bias voltage than DACs. By this manner, the generated probability of each SBG is predetermined and will be multiplexed by the switch matrix. According to the SBG sharing strategy, the required number of SBGs could be much smaller compared with the input terminals of SC logics because the nonconflicting terminals are allowed to share the same bitstream. Since the SBG array is prebuilt, it has to provide enough kinds of bitstreams to satisfy the required accuracy of the SC which will be discussed later.

Assuming that it requires L kinds of probabilities for a specified application, we define  $p_1, p_2, \ldots, p_L$  as the required probabilities. Each kind of probability corresponds to one SBG set which is denoted as  $\text{SBG}_{i,\phi(i)}$ , where  $i = 1, 2, \dots, L$  is the index of each kind of probability set, and  $\phi(i)$  is the required number of SBGs in each SBG set. For each SBG set  $SBG_{i,\phi(i)}$ , they generates the same probability  $p_i$  but the bitstreams are different from each other. Let  $M = \phi(1) + \phi(2) + \dots + \phi(L)$ , and M denotes the total number of SBGs in SBG array. The SBG array is constructed based on the conflict sets and input probabilities. The conflict sets are pre-extracted from the SC logics according to the specified application. For a particular application, input probabilities could be evaluated and usually have a certain distribution which is adopted to determine the probability set in combination with the pre-extracted conflict sets.

Taking the example of SC logics in Fig. 7, input terminals of  $T_1 \sim T_9$  are supposed to have the probabilities  $\{p_1, p_2, p_1, p_3, p_1, p_4, p_5, p_3, p_3\}$ , where  $T_1, T_3$ , and  $T_5$  have the same probability  $p_1, T_4, T_8$ , and  $T_9$  have the same probability  $p_3$ . Since  $T_1$  and  $T_3$  do not belong to the same conflict set, they could share the same bitstream from SBG array. But  $T_5$ has to adopt the bitstream from another different SBG because it is conflicted with  $T_1$  and  $T_3$ . Similarly,  $T_4$  and  $T_8$  could share the same bitstream but not for  $T_9$ . In this case, only seven SBGs are required in SPINBIS while nine SBGs are required in [35]. Hence, the SBG sharing strategy could significantly



Fig. 8. Switch matrix block of SPINBIS

reduce the required SBGs scale, especially for the applications with large scale of input probabilities.

#### E. Switch Matrix and Switch Controller

The SBG sharing strategy is realized by exploiting a multiplexing network between SBG array and SC logics. As shown in Fig. 8, the switch matrix receives the bitstreams from SBG array and assigns them to SC logics. The assigning procedures are determined by the switch controller. There are Mbitstream sources  $bs_i$  to be linked to switch matrix left side terminal  $I_i$ , where j = 1, 2, ..., M. And there are N outputs  $(O_1, O_2, \ldots, O_N)$  from switch matrix to be linked to the input terminals  $T_k$  of SC logics, where k = 1, 2, ..., N. The switch matrix is built with a crossbar structure while nMOS transistor is located at each cross-point as a selector. The selection operations of these transistors are carried out by the switch controller which is determined by the digital inputs and conflict sets. For each column of the switch matrix, there is only one selector is switched ON because each input terminal of SC logics only accepts one bitstream. For each row of switch matrix, there may be zero, one or more selectors are switched ON because the bitstreams from SBG array may be shared by different input terminals of SC logics.

The switching procedures are illustrated in Algorithm 1. In lines 1–4, the vector bs[i] indicates the first available bitstream index of probability  $p_i$ . Lines 5–11 generate control signals for all terminals in the given conflict set. For the terminals in one conflict set, the digital inputs (line 7) are obtained by the terminal index (line 6). Then the probability index in vector P is calculated by line 8. The control signal is generated by line 9. In line 10, the first available bitstream index of the probability is updated. By this way, it could guarantee that each bitstream will not be allocated for terminals who belong to the same conflict set.

Even though SBG sharing strategy has been utilized to reduce the required scale of SBG array, the scale of switch matrix is still too large because the SC logics usually have too many input terminals. In this paper, a terminal clustering strategy is further proposed to reduce the scale of switch matrix. For the input terminals of SC logics who always have

11: end for

the same digital input, they are clustered as a single terminal if they are in the different conflict sets. As shown in Fig. 7, terminals  $T_1$  and  $T_3$  belong to different conflict sets, if they always have the same input probability, they are clustered as the same input terminal.

## F. Discussion

The switch matrix and SBG sharing strategy is proposed in SPINBIS to reduce the required number of SBGs with certain design overhead. We compare the design complexity between SPINBIS and the work in [35]. The SC logics of SPINBIS are the same as [35]. The scale of SBG array is reduced from N to M according to the SBG sharing strategy, where  $M \ll N$ . Since the SBG array accounts for substantial part of energy consumption in SPINBIS, the energy consumption is reduced by [(N-M)/N] when the scale of SBG array is reduced from N to M. Assuming that there are T transistors in each SBG circuit, the utilization of transistors in SBG array is reduced from T \* N to T \* M. According to the terminal clustering strategy, the number of switch matrix output N is reduced as  $N' = \alpha N$ , where  $\alpha \in (0, 1)$ , and the utilization of transistors in switch matrix is  $M * \alpha N$ . In summary, the utilization of transistors of SBG array is reduced from T \* N to T \* M but with the overhead of  $M * \alpha N$  resulted from switch matrix. Since  $M \ll N$ , the total area of SPINBIS  $(T * M + M * \alpha N)$  is mainly determined by  $M * \alpha N$ . Based on the above discussion, the advantages of SPINBIS can be well highlighted when dealing with large scale applications with regular structure and input patterns.

#### IV. SPINTRONIC DEVICE-BASED ENERGY EFFICIENT SBG

The performance of SBG is critical for efficient SPINBIS both in inference accuracy and inference speed as well as the power consumption. A high quality SBG should have the following two properties at least as follows.

1) The generated bitstream could represent the given probability as accurately as possible. If the deviation between



Fig. 9. State transition diagram of (a) simple SBG and (b) self-control SBG.

 TABLE I

 ENABLE SIGNAL CONFIGURATION FOR reset, write, AND read OPERATIONS

|       | Write En. | Read En. | <i>Rst.</i> 0 | Wrt. 1 |
|-------|-----------|----------|---------------|--------|
| reset | High      | Low      | High          | Low    |
| write | Ingn      | LOW      | Low           | High   |
| read  | Low       | High     | -             | -      |

probability value and bitstream is too large, the SC results will be unpredictable.

 The correlations among different SBs should be as small as possible because high correlation usually degrade the accuracy of SC significantly.

In this section, an efficient SBG circuit is proposed by utilizing the inherent random behaviors of MTJ for Bayesian inference.

#### A. Schematic of SBG

The SBs are generated by reading the MTJ states which have been prewritten as shown in Fig. 2. If the readout of MTJ is with high resistance, i.e., AP state, 1 will be generated as one stochastic bit; otherwise, 0 will be generated. Generally, each bit generation is accomplished by three stages: 1) *reset*; 2) *write*; and 3) *read*. Bitstreams are obtained by performing these three stages continuously. The state transition diagram of simple SBG is illustrated in Fig. 9(a).

Both the *reset* procedure and *write* procedure is a kind of programming operation on MTJ device. The *reset* operation aims to program the MTJ with bias voltage and duration time which is large enough to achieve a successful switching while the switching probability is close to 100%. But the *write* operation aims to program the MTJ according to the required switching probability ( $p \in [0, 1]$ ) as shown in Fig. 2. Assuming that the initial MTJ state is unknown, the *reset* operation [*Write 0* in Fig. 9(a)] is to switch it to P state with the probability p = 100% while the *write* operation [*Write 1* in Fig. 9(a)] is to switch it with the probability  $p \in [0, 1]$ .

The enable signal configuration has been illustrated in Table I. Both the *write* and *reset* operations are accomplished by the write circuit as shown in Fig. 10(a) while the *read* operation is finished by the read circuit as shown in Fig. 10(b). The multiplexers  $MUX_2$  and  $MUX_3$  are adopted to switch the write current or read current flowing through the MTJ.

During *write* and *reset* operations, *Write En.* is set as high, thus terminal 1 of MUX<sub>2</sub> and MUX<sub>3</sub> are connected with corresponding terminal Y. For *reset* operation, *Wrt. 1* is set as low and *Rst. 0* is set as high so that terminal 0 of MUX<sub>1</sub> and terminal 1 of MUX<sub>4</sub> are connected with terminal Y. By applying a bias voltage between SL and *GND*, write current flows through the MTJ from bottom to top as the blue arrow shows. For *write* operation, *Wrt. 1* is set as high and *Rst. 0* is set as low so that terminal 1 of MUX<sub>1</sub> and terminal 0 of MUX<sub>4</sub> are connected with terminal Y. By applying a bias voltage between BL and *GND*, current flows through the MTJ from top to bottom as the red arrow shows.

During read stage, Read En. is set as high while Write En. is set as low so that terminal 0 of MUX<sub>2</sub> and MUX<sub>3</sub> are connected with terminal Y. A precharging sense amplifier is adopted to compare the MTJ state of data cell with that of the reference cell as shown in Fig. 10(b). The MTJ resistance state of reference cell [Fig. 10(d)] is usually set as  $(R_P + R_{AP}/2 \text{ so that})$ both AP state and P state of data cell could be identified correctly. The read circuit consists of a two-branch sensing circuit with equalizing transistors [37] and a voltage sense amplifier with dynamic latched comparator [38] for digital output. Both branches of read circuit are composed by a load pMOS, a read enable nMOS and a clamped nMOS [39], [40]. The read operation is enabled by setting *Read En.* as high so that nMOS  $N_1$ and  $N_2$  are turned on. The clamped nMOS is utilized to prevent read disturbance by applying a proper bias voltage  $V_{\text{clamp}}$ . The resistance of reference cell is usually located between  $R_P$  and  $R_{\rm AP}$  in order to identify the AP state or P state of data cell. During *read* stage, the resistance difference between date cell and reference cell is converted to the difference of  $V_{data}$  and  $V_{\rm ref}$  which could be sensed by a dynamic latched voltage comparator with clock enabled. The state of data cell is read out at each rising edge of  $C_{clk}$ .

## B. Energy Efficient SBG Using Self-Control Strategy

Energy efficiency has been considered as a primary concern for applications on embedded computing platforms. Several research works have been proposed toward efficient implementation of MTJ-based SC. The work in [17] indicated that the energy consumption required for switching  $P \rightarrow AP$  with 99.9% probability is less than that of switching AP  $\rightarrow$  P. Hence, they reset the MTJ to AP state every time and then attempt to switch it to P state to generate one stochastic bit. However, the energy consumption of resetting  $P \rightarrow AP$  is still wasted because no bit is generated during the reset procedure. As illustrated in Fig. 9(a) of simple SBG, the MTJ is first reset as P state after each stochastic bit is generated. The stochastic bits are generated by reading out the MTJ state after the write procedure. Actually, the stochastic bits are generated based on whether the MTJ state is switched successfully or not in write procedure. In this paper, we propose an efficient SBG while the *reset* procedure is also utilized for generating stochastic bits.

A self-control strategy is proposed by storing the MTJ state of previous cycle in a register and then comparing it with the state of current cycle to determine the stochastic bit as



Fig. 10. Schematic of SBG circuit. (a) Write circuit. (b) Red circuit. (c) Self control circuit. (d) Reference cell. (e) SBG symbol.

output. That is, the stochastic bit is generated according to the comparison whether MTJ state is changed or not. The state transition diagram of SBG with self-control strategy is illustrated in Fig. 9(b). If the current state is different from the last state, the output bit is 1, otherwise, the output bit is 0. Meanwhile, the direction of *write* operation is determined by the stored state of last cycle. According to the self-control strategy, the *reset* $\rightarrow$ *write* $\rightarrow$ *read* procedures are compressed as *write* $\rightarrow$ *read* procedures. In *write* procedure, the biased voltage between BL and SL are carefully set as a certain range to guarantee both *write* 0 and *write* 1 operations are with the same probability value. The speed and energy efficiency of bitstream generation could be improved by 2× theoretically.

The self-control circuit is demonstrated in Fig. 10(c). The transmission gate (TG) and D-flip-flop (DFF) is clocked by  $T_{\rm clk}$  and  $D_{\rm clk}$ , respectively. The output of comparator in Fig. 10(b) (i.e., MTJ State) is highly sensitive to its loads. Hence, the TG is inserted to eliminate the loads influence. There is a small delay in rising edge of  $T_{clk}$  after the rising edge of  $C_{\text{clk}}$  to guarantee the output of comparator is stable. DFF is utilized to latch the MTJ state which will be compared with next cycle for one stochastic bit output. If the current MTJ state is different from the latched state, the output of SBG is 1, otherwise, is 0. Meanwhile, the latched MTJ state also determines the direction of write operation in the next cycle. If the latched MTJ state is P, the write current flows through the MTJ from top to bottom which attempts to switch the MTJ state from P to AP. Otherwise, the write current has the opposite direction. There is also a small delay in rising edge of  $D_{clk}$  after the rising edge of  $T_{clk}$ . When  $T_{clk}$  is high and  $D_{clk}$  is low, TG output is the current state of MTJ and DFF output is the last state of MTJ. During this period, XOR operation on current state and last state is regarded as one bit output. If the current state is different from the latched state in the last cycle, it means that the state of MTJ has already switched successfully. The result of the XOR gate is high, and consequently one bit of 1 is generated. Otherwise, one bit 0 is generated. After the rising edge of  $D_{clk}$ , current state is latched in DFF until the next *read* stage. In next *write* stage, the DFF output *Rst. 0* is utilized to control MUX<sub>4</sub> and *Wrt. 1* is utilized to control MUX<sub>1</sub>.

The aforementioned SPINBIS architecture requires that each SBG should be equipped with two internal voltage sources with fixed voltage values. In this paper, it is achieved by a voltage divider [41] that consists of one resistor and one nMOS transistor as shown in Fig. 11(a).  $V_{ref}$  is adopted as the writing voltage for each SBG, and determined by varying the resistance value *R*. As shown in Fig. 11(b),  $V_{ref}$  varies smoothly according to *R* (black solid line). And the desired writing voltage between BL or SL [red circles in Fig. 11(b)] could be achieved by adjusting the resistor value.

For the sake of convenient, the SBG with self-control strategy is denoted as self-control SBG and the one described in Section IV-A is denoted as simple SBG.

#### C. Evaluation of SBG Circuits

The proposed SBG circuits are evaluated to explore its performance in this section.

1) Simulation Setup: The SBG circuit is composed by hybrid CMOS/MTJ structures with 45-nm CMOS and 45-nm MTJ technologies. A behavioral model of MTJ is described by Verilog-A language [42] while the stochastic switching behaviors are also included. However, the original MTJ model



Fig. 11. (a) Schematic of voltage divider. (b) Black solid line represents the fitting curve of R and  $V_{ref}$ , the red circles represent the desired writing voltages between BL and SL of SBGs.

in [42] only provides stochastic switching behaviors for a single device. That is, the obtained bitstreams from different SBG circuits are always the same if they have a same bias voltage and duration time. Since many MTJs are utilized in SBG array, the bitstreams generated with original MTJ model [42] will have very strong correlation with each other which will lead to inaccuracy SC results. Hence, a new compact MTJ model is proposed for stochastic switching with the property that the switching behaviors of different MTJ instances are different with each other.

As described in [42], MTJ switching time is obtained by the critical current and other electrical and physical parameters. The stochastic behavior is independent with the critical switching current, and is implemented by random functions with uniform or normal distributions. The basic switching time dt is determined according to the applied bias voltage for each cycle. Then a random number that obeys normal distribution  $\sim N(seed, dt, \sigma)$  is generated per cycle as the final switching time, where seed is the random seed for random number generation function,  $\sigma$  is the user specified standard deviation. The parameter seed is set as a constant value in the model published in [42]. It indicates that the switching time is the same if two MTJs have the same dt and  $\sigma$ . Aiming to obtain the different random behaviors, the MTJ model is revised by setting seed as different values for different MTJ instances, which is denoted as instance-vary model. Fortunately, the Verilog-A language of version 13.1 and above supports the grammar of arandom[param]. The param argument is optional and can be set as global or instance. If param is set as instance for each MTJ's required randomness, different seed values will be generated for different MTJ instances. This feature could satisfy the MTJ's instance-vary randomness requirement well. The parameters definition and default value of MTJ model are provided in Table II for experimental configurations. The simulation results of MTJ switching with instance-vary model are shown in Fig. 12. With the same write operations, MTJ1 and MTJ2 have different switching results which are critical to generate irrelevant bitstreams for SC.

2) SBG Simulation Results: The simulation results of simple SBG circuit are illustrated in Fig. 13. The

 TABLE II

 PARAMETERS DEFINITION AND DEFAULT VALUE OF MTJ MODEL



Fig. 12. Simulation results of the proposed instance-vary MTJ model. Two MTJs are simulated simultaneously with the same bias voltage and pulse width.



Fig. 13. Simulation results of simple SBG circuit.

 $reset \rightarrow write \rightarrow read$  operations are performed iteratively for 3 cycles from 5 ns to 65 ns. The reset and write operations are enabled when Write En. is high. For reset operation  $(AP \rightarrow P \text{ or } P \rightarrow P)$ , the bias voltage between SL and *GND* is about 1.8 V and the duration time is about 7 ns to guarantee the switching probability  $p \rightarrow 100\%$ . For write operation  $(P \rightarrow AP)$ , if the bias voltage between BL and GND is set as about 1.166 V and duration time is about 5.4 ns, the switching probability p is about 50%. For read operation, Read En. is set as high and the MTJ state is read out while  $V_{load}$  and  $V_{\text{clamp}}$  is about 0.8 V. For each cycle of reset  $\rightarrow$  write  $\rightarrow$  read operations, the MTJ is first reset as P state with the switching probability  $p \rightarrow 100\%$  when Write En. and Wrt. 1 is set as low. The write current flows through the MTJ from bottom to up as the blue arrow shown in Fig. 10(a). And then the MTJ attempts to finish  $P \rightarrow AP$  switching with the provided switching probability when Wrt. 1 is set as high. The write current flows through the MTJ from up to bottom as the red arrow shows in Fig. 10(a). At last, read operation is performed by setting Read En. as high and Write En. as low. For the 3 cycles of *reset* $\rightarrow$ *write* $\rightarrow$ *read* operations as shown in Fig. 13, writing  $P \rightarrow AP$  fails in the first cycle but succeeds



Fig. 14. Simulation wave of self-control SBG circuit.

in the following two cycles. And consequently, the bitstream is generated as "011."

The comprehensive simulation results of self-control SBG circuit are shown in Fig. 14 for 4 cycles from 5 ns to 45 ns. The first cycle aims to initialize MTJ as P state. The MTJ is read out as P state at 13 ns and TG is turned on with a delay of 0.5 ns. Since the *last\_state* is meaningless for the first cycle, the XOR result is discarded in this cycle. And then the current state P is latched in DFF from 14 ns to 24 ns. For the second cycle, Wrt. 1 is enabled for a write operation while *last state* is P (logic 0). The *write* operation is finished successfully so that the MTJ is in AP state. From 23 ns to 24 ns, the AP state of MTJ is passed through TG and denoted as current\_state. By performing XOR operation on the last\_state (latched P) and current state (AP), one bit of 1 is generated for the second cycle. The current\_state (AP) is then latched in DFF and becomes as the *last state* for the next cycle. For the third cycle, Rst. 0 is enabled for writing MTJ from AP to P state since the latched *last\_state* is in AP state. However, the writing operation of AP to P fails for the third cycle. It means that the MTJ state in the third cycle is not changed compared with that in the second cycle. Hence, one bit of 0 is generated for the third cycle. For the forth cycle, MTJ still attempts to switch from AP to P state and finishes the switching successfully. Hence, one bit of 1 is generated. Finally, a bitstream "101" is obtained among these four cycles.

For generating a bitstream with *n* bits, simple SBG circuit requires 2n write operations (including reset and write) and *n* read operations. But for self-control SBG circuit, only n + 1 write operations (including initialization and write) and n + 1 read operations are required. It is obvious that the self-control SBG circuit could improve the speed and energy efficiency about  $2 \times$  compared with simple SBG circuit.

*3)* SBG Performance: The proposed SBG circuits are evaluated to analyze the performance of the generated SBs both on representation accuracy and correlation.

For evaluating the accuracy, n-bits SBs are generated while n is the bitstream length of 64, 128, 256, and 1000. The



Fig. 15. MTJ switching probability with different applied BL voltage. SL voltage is set as 1.8 V.



Fig. 16. MTJ switching probability with different applied BL/SL voltage combination.

bitstream with length n = 1000 is denoted as the ground truth. The MTJ switching probability of simple SBG is demonstrated in Fig. 15 with different BL voltage while the SL voltage is set as 1.8 V. Compared with the ground truth of length n = 1000, the generated bitstreams with length n = 64, 128, and 256 have the average errors of 1.5%, 0.7%, and 0.6%, respectively. Meanwhile, the relationship between switching probability and different BL/SL voltage combinations are also demonstrated in Fig. 16 for self-control SBG circuit. Compared with the ground truth of length n = 1000, the generated bitstreams with length n = 64, 128, and 256 have the average errors of 1.8%, 0.9%, and 0.5%, respectively.

As described above, SC usually requires a low correlation among different bitstreams. Many evaluation metrics of statistical correlation between different bitstreams have been proposed [43]. The SC correlation (SCC) measurement [44] is adopted in this paper, which is particularly proposed for SC

$$SCC(X_1, X_2) = \begin{cases} \frac{ad-bc}{n \times \min(a+b, a+c) - (a+b)(a+c)} & \text{if } ad > bc\\ \frac{ad-bc}{(a+b)(a+c) - n \times \max(a-d, 0)} & \text{otherwise} \end{cases}$$
(2)

where  $X_1$  and  $X_2$  are two SBs for measurement, *a* is the number of 1s bit-overlapping between  $X_1$  and  $X_2$ , *b* is the number of bit-overlapping of 1s in  $X_1$  and 0s in  $X_2$ , *c* is the number of bit-overlapping of 0s in  $X_1$  and 1s in  $X_2$ , *d* is the number of 0s bit-overlapping between  $X_1$  and  $X_2$ . From (2), SCC  $\rightarrow +1$  if  $X_1$  and  $X_2$  have a maximum similarity; otherwise, SCC  $\rightarrow -1$  if  $X_1$  and  $X_2$  are totally different. And consequently, we have SCC  $\in [-1, 1]$ . For a certain probability  $p \in [0, 1]$ , many bitstreams are generated to compute the SCC absolute value, which is regarded as self-SCC measurement. A self-SCC measurement sample is illustrated in Fig. 17 with the bit length n = 64, 128, 256, and 512. For measuring SCC between different probabilities, two groups of bitstreams are generated to compute the SCC absolute value, which is self-SCC measurement and self-sect between different probabilities, two groups of bitstreams are generated to compute the SCC absolute value, which is section of the section o







Fig. 18. Cross-SCC measurement for probability combinations, i.e., evaluating the generated bitstreams between different probabilities while only the combinations of (19%, 41%), (12%, 48%), (49%, 25%), (23%, 44%), and (18%, 58%) are illustrated.

regarded as cross-SCC measurement. A cross-SCC measurement sample is demonstrated in Fig. 18 with the bit length n = 64, 128, 256, 512. As can be seen from Figs. 17 and 18, the SCC values are relatively small so that they could satisfy the requirements of SC. And the SCC value decreases when the bitstream length increases.

There are 93 CMOS transistors, 1 resistor, and 5 MTJs in the proposed self-control SBG circuits which will be adopted to analyze the occupied chip area. Also the energy consumption for generating bitstreams of probability  $p \in [0, 1]$  is demonstrated in Fig. 19, from which we can see the larger probability usually requires more energy consumption.

4) Process Variation: MTJ switching behavior is deeply impacted by the process variation, such as MTJ geometric variation and initial magnetization angle variation [45]. Variation in surface area (A) and tunneling oxide thickness ( $t_{ox}$ ) are the main causes behind the resistance change in MTJ material because  $R_{\text{MTJ}} \propto (1/A) \cdot e^{t_{ox}}$ . Assuming that the variation of A and  $t_{ox}$  follows Gaussian distribution with a standard deviations of 5% and 2% of their mean value, respectively, [46], the sensitivity of the relationship between required probability and applied voltage is evaluated and shown in Table III. The accuracy could be improved by increasing the bitstream length.



Fig. 19. Energy consumption for generating bitstreams of probability  $p \in [0, 1]$ .

TABLE III SIMULATION RESULTS OF PROBABILITY-VOLTAGE RELATIONSHIP UNDER THE CERTAIN PROCESS VARIATION

| Bitstream Length | 64     | 128    | 256    |
|------------------|--------|--------|--------|
| Max Error        | 0.1295 | 0.0949 | 0.0733 |
| Avg. Error       | 0.0460 | 0.0336 | 0.0269 |

#### V. APPLICATIONS

As demonstrated in Fig. 6, the SC architectures are determined according to the specified applications. A deviceto-architecture level evaluation framework is illustrated for SPINBIS and a typical application is demonstrated as a case study in this section.

#### A. Evaluation Framework

SPINBIS is implemented by hybrid CMOS/MTJ technologies with three design hierarchies: 1) device, 2) circuit; and 3) architecture levels as shown in Fig. 20. The hybrid CMOS/MTJ circuits are simulated by Cadence Spectre simulator while the MTJ model is written by Verilog-A language. The dynamic switching of MTJ device is realized with two regimes of Sun model [47] and Neel-Brown model [48]. The stochastic MTJ switching behaviors are modeled by [11]. In order to reduce the correlation of bitstreams generated by different SBG circuits, the random seed is configured as different for different MTJ instances as described in Section IV-C. With the circuit simulation results, the SBG array, switch matrix, and SC logics are abstracted as behavioral blocks by performing characterizations. Meanwhile, the RTL implementation of switch controller is synthesized by Synopsys Design Compiler with 45-nm FreePDK library. After performing the characterization of switch controller, an architectural level simulation is carried out according to the specified application trace. Finally, the evaluation results of SPINBIS are obtained in terms of inference accuracy, energy efficiency, inference speed, and design area.

#### B. Case Study: Data Fusion for Target Locating

Data fusion aims to achieve more consistent, accurate, and useful information by integrating multiple data sources instead



Fig. 20. Evaluation framework of SPINBIS.

of by any individual data source. In this section, a simple data fusion example is demonstrated and the corresponding Bayesian inference procedures are also studied.

1) Problem Definition and **Bayesian** Inference Algorithm: Sensor fusion aims to determine a target location by multiple sensors [49]. Assuming that there are 3 sensors on a 2-D plane while the width and length of 2-D plane is 64 and sensors are located at (0, 0), (0, 32), (32, 0). Each sensor has two data channels: 1) distance (d) and 2) bearing (b). The measured data  $(d_1, b_1, d_2, b_2, d_3, b_3)$  from three sensors with two channels are utilized to calculated the target location  $(x^{\star}, y^{\star})$ . In data fusion application, the probability value that target object locates at one position of the plane is calculated based on the sensor data. The position with the largest probability value is considered to be the position that object target is located at.

Based on the observed data  $(d_1, b_1, d_2, b_2, d_3, b_3)$ , the probability of target object located on (x, y) is denoted as  $p(x, y|d_1, b_1, d_2, b_2, d_3, b_3)$  and could be calculated based on Bayes' theory

$$p(x, y|d_1, b_1, d_2, b_2, d_3, b_3) \propto p(x, y) * \prod_i p(d_i|x, y)p(b_i|x, y)$$
  
(3)

where p(x, y) is denoted as prior probability, and  $p(d_i|x, y)$  and  $p(b_i|x, y)$  are known as evidence or likelihood information. Since the target may locate at any position, the prior probability p(x, y) is the same for any position. Hence, p(x, y) is ignored in the following BIS.  $p(d_i|x, y)$  means the probability that the *i*th sensor return the distance value of  $d_i$  if the target object is located at position (x, y). The meaning of  $p(b_i|x, y)$  is similar to that of  $p(d_i|x, y)$ . The value of  $p(d_i|x, y)$  and  $p(b_i|x, y)$  is calculated by

$$p(d_i|x, y) = \frac{1}{\sqrt{2\pi}\sigma_i^d} \cdot e^{-\frac{\left(d(x, y) - \mu_i^d\right)^2}{2\left(\sigma_i^d\right)^2}}$$

TABLE IVTRANSISTOR UTILIZATIONS OF SBG ARRAY AND SWITCH MATRIX<br/>FOR DIFFERENT GRID SIZE, WHERE  $\mathcal{K}_{energy} = (M/N)$  AND $\mathcal{K}_{cmos} = [(T * M + M * N')/(T * N)]$  INDICATES THE IMPROVEMENT<br/>ON ENERGY AND AREA EFFICIENCY, RESPECTIVELY

| Grid Size      | T  | N     | M   | N'   | $\mathcal{K}_{energy}$ | $\mathcal{K}_{cmos}$ |
|----------------|----|-------|-----|------|------------------------|----------------------|
| $32 \times 32$ | 92 | 6144  | 320 | 2817 | 0.052                  | 1.64                 |
| $64 \times 64$ | 92 | 24576 | 320 | 5557 | 0.013                  | 0.79                 |

$$p(b_i|x, y) = \frac{1}{\sqrt{2\pi\sigma_i^b}} \cdot e^{-\frac{\left(b(x,y)-\mu_i^b\right)^2}{2\left(\sigma_i^b\right)^2}}$$
(4)

where d(x, y) is the Euclidian distance between position (x, y)and the *i*th sensor,  $\mu_i^d$  is the distance data provided by the *i*th sensor,  $\sigma_i^d = 5 + \mu_i^d/10$ . b(x, y) is the viewing angle from the *i*th sensor to position (x, y),  $\mu_i^b$  is the bear data provided by the *i*th sensor,  $\sigma_i^b$  is set as 14.0626 degree.

2) Bayesian Inference System: From (3), the Bayesian inference is calculated by the product of a series of conditional probabilities, which could be realized by performing SC with AND gates and SBs. Given any two positions  $(x_1, y_1)$  and  $(x_2, y_2)$ , the calculations of  $p(x_1, y_1|d_1, b_1, d_2, b_2, d_3, b_3)$  and  $p(x_2, y_2|d_1, b_1, d_2, b_2, d_3, b_3)$  could be finished in parallel since they are independent with each other.

The SPINBIS architectures are reformulated as Fig. 21 for sensor fusion applications. For each probability calculation, it requires five AND gates to perform multiplications for six conditional probabilities. The SBG sharing and terminal clustering strategies are utilized to reduce the required scale of SBG array and switch matrix. The 2-D plane is partitioned as  $64 \times 64$  and  $32 \times 32$  grids for target locating problem. The finer grid partition usually achieves more accurate locating results. Table IV shows the scale of SBG array and switch matrix for different grid size. The meaning of symbol T, N, M, and N' in Table IV have been described in Section III-F. For  $64 \times 64$ grid size problem, the energy consumption, and transistor utilization of SBG array and switch matrix are 1.3% and 80% of that in [35], respectively. For  $32 \times 32$  grid size problem, the energy consumption is 5.2% of that in [35] while the transistor utilization is  $1.64 \times$  of that in [35].

3) Simulation Results: The obtained data from sensors is represented as bitstreams with length of 64, 128, and 256 for SC. The fusion results on  $64 \times 64$  grid are shown as heat maps in Fig. 22 and compared with exact result. The bitstream with larger length usually has a better inference accuracy. Meanwhile, Kullback-Leibler (KL) divergence is further introduced to measure the differences between stochastic inference results and exact solutions as shown in Fig. 23. The dashed vellow line and blue line represent the KL divergence value under the specified process variations for  $64 \times 64$  grid and  $32 \times 32$  grid, respectively. While the solid yellow line and blue line represent the KL divergence value without process variations. For the same KL divergence value, the length of bitstream with process variation is usually larger than that without process variation but still smaller than the length of work [49]. As reported in previous work [49], the sensor fusion on  $32 \times 32$ grid for  $10^4$  cycles could obtain a KL divergence of 0.029. However, SPINBIS only need about 128 cycles to achieve



Fig. 21. SPINBIS diagram of target locating problem.



Fig. 22. Sensor fusion results of SPINBIS for target locating problem on  $64 \times 64$  grid compared with exact solutions.



Fig. 23. KL divergence analysis of SPINBIS for target locating problem.

such a KL divergence as shown in Fig. 23 even with the consideration of process variation. In summary, these advantages benefit from the high accuracy and low correlation bitstreams generated by the MTJ-based SBG array.

4) *Performance:* The performance of SPINBIS with the considerations of process variations is compared with FPGA [49] and MTJ [35] based approaches. The sensor fusion problem is evaluated by these approaches on the  $32 \times 32$  grid.

TABLE V COMPARISON BETWEEN SC METHOD AND 8-BIT FIXED POINT BINARY IMPLEMENTATION ON FPGA

| Method | Bitstream | KL         | Energy                        | Utilization |        |
|--------|-----------|------------|-------------------------------|-------------|--------|
|        | Length    | Divergence | Energy                        | LUT         | FF     |
|        | 64        | 0.051      | $0.66 \mu J$                  |             |        |
| SC     | 128       | 0.037      | $\frac{1.32\mu J}{2.36}$ 9316 |             | 17608  |
|        | 200       | 0.031      | $2.06 \mu J$                  | 9510        | 17008  |
|        | 256       | 0.021      | $2.64 \mu J$                  |             |        |
|        | 512       | 0.014      | $5.28 \mu J$                  |             |        |
| Binary | -         | -          | $1.99 \mu J$                  | 234496      | 253952 |

#### TABLE VI

SPINBIS PERFORMANCE COMPARISON WITH OTHER METHODS WITH THE REQUIREMENT OF KL DIVERGENCE LESS THAN 0.029 ON 32 × 32 GRID, WHERE  $E_{cyc}$  IS ENERGY CONSUMPTION OF EACH CYCLE,  $T_{cyc}$  IS THE DURATION TIME OF EACH CYCLE,  $N_{cyc}$  IS THE TOTAL CYCLE COUNT,  $E_{tot}$  IS THE TOTAL ENERGY CONSUMPTION FOR ALL CYCLES,  $T_{tot}$  IS THE TOTAL INFERENCE TIME,  $N_{cmos}$  IS THE NUMBER OF UTILIZED CMOS TRANSISTORS

| Method   | $ \begin{array}{c} E_{cyc} \\ (nJ) \end{array} $ | $T_{cyc}$ $(ns)$ | $N_{cyc}$ | $\begin{array}{c} E_{tot} \\ (\mu J) \end{array}$ | $T_{tot}$ $(\mu s)$ | $\frac{N_{cmos}}{(\times 10^3)}$ |
|----------|--------------------------------------------------|------------------|-----------|---------------------------------------------------|---------------------|----------------------------------|
| FPGA     | 10.3                                             | 10               | 256       | 2.64                                              | 2.56                | -                                |
| MTJ [35] | 4.58                                             | 40               | 256       | 1.17                                              | 10.24               | $\approx 830$                    |
| SPINBIS  | 0.78                                             | 10               | 128       | 0.10                                              | 1.28                | $\approx 1200$                   |

First, the SC method is compared with 8-bit fixed point binary implementation on the same FPGA platform of Xilinx Zynq 7020. The SC method is referred to [49] but reimplemented by ourselves for the sake of fairness and clarity. In Table V, SC results are illustrated with different bitstream length. As shown in Table V, longer bitstream realization could obtain a lower KL divergence (better accuracy) but requires more energy consumption. Once the bitstream length is larger than about 200, SC method consumes more energy than 8-bit fixed point binary implementation. Additionally, the resources utilization of SC approach is much lower than binary implementation. In fact, SC method provides a tradeoff between energy consumption and inference accuracy. Hence, SC is very promising for fault-tolerant embedded applications which require higher area efficiency. Then the comparison of SC results of different approaches are illustrated in Table VI. All of the inference approaches are required to satisfy the requirement of KL divergence less than 0.029. As seen from Table VI, the energy efficiency of MTJ-based approach [35] is significantly improved than FPGA approach [49]. Furthermore, SPINBIS achieves better energy efficiency and inference speed

compared with MTJ [35] and FPGA [49] approaches and bring about 45% design area overhead compared with MTJ-based approach [35].

## VI. CONCLUSION

Spintronic device is a promising technology because of its low power, high speed, infinite endurance, and easy integration with CMOS circuit. In this paper, the inherent stochastic behavior of MTJ is utilized to build the SBG which is critical for BIS. A state-aware self-control strategy is proposed to improve the energy efficiency and speed of SBG circuit. The SBG sharing strategy and terminal clustering strategy are further proposed in SPINBIS to reduce the energy consumption and design area overhead. A device-to-architecture level framework is demonstrated to evaluate the performance of SPINBIS and a typical application is demonstrated as a case study. Experimental results on data fusion applications demonstrate that SPINBIS could improve the energy efficiency about  $12 \times$  than MTJ-based approach with 45% design area overhead and about  $26 \times$  than FPGA-based approach.

In the future, we will carry on our research on the following aspects. First, the probability and voltage relation is not very smooth. It is necessary to improve the stability of the proposed SBG. Second, the adopted switch matrix could still have a congestion problem even if the scale is reduced from  $M \times N$  to  $M \times N'$ . Further reduction of the scale of SPINBIS is also a desirable research point.

#### REFERENCES

- [1] P. Pinheiro and P. Lima, "Bayesian sensor fusion for cooperative object localization and world modeling," in *Proc. CIAS*, 2004, pp. 1–8.
- [2] N. Cruz-Ramírez *et al.*, "Diagnosis of breast cancer using Bayesian networks: A case study," *Comput. Biol. Med.*, vol. 37, no. 11, pp. 1553–1564, 2007.
- [3] Y. Gal, R. Islam, and Z. Ghahramani, "Deep Bayesian active learning with image data," arXiv preprint arXiv:1703.02910, 2017.
- [4] C. S. Thakur *et al.*, "Bayesian estimation and inference using stochastic electronics," *Front. Neurosci.*, vol. 10, p. 104, Mar. 2016.
- [5] A. Alaghi and J. P. Hayes, "Survey of stochastic computing," ACM Trans. Embedded Comput. Syst., vol. 12, no. 2S, p. 92, 2013.
- [6] J. Grollier, D. Querlioz, and M. D. Stiles, "Spintronic nanodevices for bioinspired computing," *Proc. IEEE*, vol. 104, no. 10, pp. 2024–2039, Oct. 2016.
- [7] M. Lin, I. Lebedev, and J. Wawrzynek, "High-throughput Bayesian computing machine with reconfigurable hardware," in *Proc. FPGA*, 2010, pp. 73–82.
- [8] P. Mroszczyk and P. Dudek, "The accuracy and scalability of continuoustime Bayesian inference in analogue CMOS circuits," in *Proc. ISCAS*, 2014, pp. 1576–1579.
- [9] J. S. Friedman, L. E. Calvet, P. Bessière, J. Droulez, and D. Querlioz, "Bayesian inference with Muller C-elements," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 63, no. 6, pp. 895–904, Jun. 2016.
- [10] A. F. Vincent *et al.*, "Analytical macrospin modeling of the stochastic switching time of spin-transfer torque devices," *IEEE Trans. Electron Devices*, vol. 62, no. 1, pp. 164–170, Jan. 2015.
- [11] T. Devolder *et al.*, "Single-shot time-resolved measurements of nanosecond-scale spin-transfer induced switching: Stochastic versus deterministic aspects," *Phys. Rev. Lett.*, vol. 100, no. 5, 2008, Art. no. 057206.
- [12] L. A. de Barros Naviner, H. Cai, Y. Wang, W. Zhao, and A. B. Dhia, "Stochastic computation with spin torque transfer magnetic tunnel junction," in *Proc. NEWCAS*, 2015, pp. 1–4.
- [13] Y. Wang *et al.*, "A novel circuit design of true random number generator using magnetic tunnel junction," in *Proc. NANOARCH*, 2016, pp. 123–128.

- [14] S. Wang et al., "Hybrid VC-MTJ/CMOS non-volatile stochastic logic for efficient computing," in Proc. DATE, 2017, pp. 1438–1443.
- [15] J. von Neumann, "Probabilistic logics and synthesis of reliable organisms from unreliable components," in *Automata Studies*, C. Shannon and J. McCarthy, Eds., Princeton Univ. Press, 1956, pp. 43–98.
- [16] R. Venkatesan, S. Venkataramani, X. Fong, K. Roy, and A. Raghunathan, "Spintastic: Spin-based stochastic logic for energy-efficient computing," in *Proc. DATE*, 2015, pp. 1575–1578.
- [17] A. Mondal and A. Srivastava, "Power optimizations in MTJ-based neural networks through stochastic computing," in *Proc. ISLPED*, 2017, pp. 1–6.
- [18] J. S. Friedman, J. Droulez, P. Bessière, J. Lobo, and D. Querlioz, "Approximation enhancement for stochastic Bayesian inference," *Int. J. Approx. Reasoning*, vol. 85, pp. 139–158, Jun. 2017.
- [19] Y. Liu, S. Liu, Y. Wang, F. Lombardi, and J. Han, "A stochastic computational multi-layer perceptron with backward propagation," *IEEE Trans. Comput.*, vol. 97, no. 9, pp. 1273–1286, Sep. 2018.
- [20] A. Ren et al., "SC-DCNN: Highly-scalable deep convolutional neural network using stochastic computing," ACM SIGOPS Oper. Syst. Rev., vol. 51, no. 2, pp. 405–418, 2017.
- [21] K. Kim et al., "Dynamic energy-accuracy trade-off using stochastic computing in deep neural networks," in Proc. DAC, 2016, pp. 1–6.
- [22] J. Li *et al.*, "Towards acceleration of deep convolutional neural networks using stochastic computing," in *Proc. ASPDAC*, 2017, pp. 115–120.
- [23] P. Jeavons, D. A. Cohen, and J. Shawe-Taylor, "Generating binary sequences for stochastic computing," *IEEE Trans. Inf. Theory*, vol. 40, no. 3, pp. 716–720, May 1994.
- [24] R. Cai et al., "VIBNN: Hardware acceleration of Bayesian neural networks," in Proc. ASPLOS, 2018, pp. 476–488.
- [25] K. Kim, J. Lee, and K. Choi, "An energy-efficient random number generator for stochastic circuits," in *Proc. ASPDAC*, 2016, pp. 256–261.
- [26] P. K. Gupta and R. Kumaresan, "Binary multiplication with PN sequences," *IEEE Trans. Acoust. Speech Signal Process.*, vol. 36, no. 4, pp. 603–606, Apr. 1988.
- [27] M. Wang *et al.*, "Current-induced magnetization switching in atom-thick tungsten engineered perpendicular magnetic tunnel junctions with large tunnel magnetoresistance," *Nat. Commun.*, vol. 9, no. 1, p. 671, 2018.
- [28] M. Marins de Castro *et al.*, "Precessional spin-transfer switching in a magnetic tunnel junction with a synthetic antiferromagnetic perpendicular polarizer," *J. Appl. Phys.*, vol. 111, no. 7, 2012, Art. no. 07C912.
- [29] Y. Zhang et al., "Multi-level cell spin transfer torque MRAM based on stochastic switching," in Proc. IEEE NANO, 2013, pp. 233–236.
- [30] G. Srinivasan, A. Sengupta, and K. Roy, "Magnetic tunnel junction enabled all-spin stochastic spiking neural network," in *Proc. DATE*, 2017, pp. 530–535.
- [31] S. Angizi *et al.*, "Leveraging spintronic devices for efficient approximate logic and stochastic neural networks," in *Proc. Great Lakes Symp. VLSI*, 2018, pp. 397–402.
- [32] K. Cao *et al.*, "In-memory direct processing based on nanoscale perpendicular magnetic tunnel junctions," *Nanoscale*, vol. 10, no. 45, pp. 21225–21230, 2018.
- [33] X. Fong, M.-C. Chen, and K. Roy, "Generating true random numbers using on-chip complementary polarizer spin-transfer torque magnetic tunnel junctions," in *Proc. Device Res. Conf.*, 2014, pp. 103–104.
- [34] N. Onizawa, D. Katagiri, W. J. Gross, and T. Hanyu, "Analogto-stochastic converter using magnetic tunnel junction devices for vision chips," *IEEE Trans. Nanotechnol.*, vol. 15, no. 5, pp. 705–714, Sep. 2016.
- [35] X. Jia *et al.*, "Spintronics based stochastic computing for efficient Bayesian inference system," in *Proc. ASPDAC*, 2018, pp. 580–585.
- [36] G. M. Masson, G. C. Gingher, and S. Nakamura, "A sampler of circuit switching networks," *Computer*, vol. 6, no. 12, pp. 32–48, Jun. 1979.
- [37] J. Kim, K. Ryu, S. H. Kang, and S.-O. Jung, "A novel sensing circuit for deep submicron spin transfer torque MRAM (STT-MRAM)," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 20, no. 1, pp. 181–186, Jan. 2012.
- [38] P. M. Figueiredo and J. C. Vital, Offset Reduction Techniques in High-Speed Analog-to-Digital Converters: Analysis, Design and Tradeoffs. Dordrecht, The Netherlands: Springer, 2009.
- [39] J. Yang et al., "Radiation-induced soft error analysis of STT-MRAM: A device to circuit approach," *IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.*, vol. 35, no. 3, pp. 380–393, Mar. 2016.
- [40] J. Yang et al., "Exploiting spin-orbit torque devices as reconfigurable logic for circuit obfuscation," *IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.*, vol. 38, no. 1, pp. 57–69, Jan. 2019.

- [41] R. J. Baker, *CMOS: Circuit Design, Layout, and Simulation*, vol. 1. Hoboken, NJ, USA: Wiley, 2008.
- [42] Y. Wang et al., "Compact model of magnetic tunnel junction with stochastic spin transfer torque switching for reliability analyses," *Microelectron. Rel.*, vol. 54, nos. 9–10, pp. 1774–1778, 2014.
- [43] S.-S. Choi, S.-H. Cha, and C. C. Tappert, "A survey of binary similarity and distance measures," *J. Syst. Cybern. Informat.*, vol. 8, no. 1, pp. 43–48, 2010.
- [44] A. Alaghi and J. P. Hayes, "Exploiting correlation in stochastic circuit design," in *Proc. ICCD*, 2013, pp. 39–46.
- [45] A. Nigam *et al.*, "Delivering on the promise of universal memory for spin-transfer torque ram (STT-RAM)," in *Proc. ISLPED*, 2011, pp. 121–126.
- [46] J. Li, C. Augustine, S. Salahuddin, and K. Roy, "Modeling of failure probability and statistical design of spin-torque transfer magnetic random access memory (STT MRAM) array for yield enhancement," in *Proc. DAC*, 2008, pp. 278–283.
- [47] D. Worledge et al., "Spin torque switching of perpendicular TalCoFeBIMgO-based magnetic tunnel junctions," Appl. Phys. Lett., vol. 98, no. 2, 2011, Art. no. 022501.
- [48] R. Heindl *et al.*, "Validity of the thermal activation model for spintransfer torque switching in magnetic tunnel junctions," *J. Appl. Phys.*, vol. 109, no. 7, 2011, Art. no. 073910.
- [49] A. Coninx et al., "Bayesian sensor fusion with fast and low power stochastic circuits," in Proc. ICRC, San Diego, CA, USA, 2016, pp. 1–8.



Xiaotao Jia (S'13–M'17) received the B.S. degree in mathematics from Beijing Jiao Tong University, Beijing, China, in 2011 and the Ph.D. degree in computer science and technology from Tsinghua University, Beijing, in 2016.

He is currently a Post-Doctoral Researcher with the Fert Beijing Research Institute, Beihang University, Beijing. His current research interests include spintronic circuits and Bayesian learning systems.



**Pengcheng Dai** received the B.S. degree in electronic engineering from Beihang University, Beijing, China, in 2017, where he is currently pursuing the graduation degree with the School of Electronic and Information Engineering.

His current research interests include computing architectures for deep learning and machine vision.

**Runze Liu** received the B.S. degree from the School of Computer Science and Engineering, Beihang University, Beijing, China, in 2018.

His current research interests include computing architectures for deep learning and machine vision.



**Yiran Chen** (M'04–SM'16–F'18) received the B.S. and M.S. degrees from Tsinghua University, Beijing, China, and the Ph.D. degree from Purdue University, West Lafayette, IN, USA, in 2005.

After five years in industry, he joined the University of Pittsburgh, Pittsburgh, PA, USA, in 2010 as an Assistant Professor and then was promoted to an Associate Professor with tenure in 2014, and was named a Bicentennial Alumni Faculty Fellow. He is currently a Tenured Associate Professor with the Department of Electrical and

Computer Engineering, Duke University, Durham, NC, USA, and serves as the Co-Director of the Duke Center for Evolutionary Intelligence, focusing on the research of new memory and storage systems, machine learning and neuromorphic computing, and mobile computing systems. He has published 1 book and over 300 technical publications and has been granted 93 U.S. patents.

Dr. Chen was a recipient of the NSF CAREER Award, the ACM SIGDA Outstanding New Faculty Award, and six best paper awards and 14 best paper nominations from international conferences. He is an Associate Editor of the IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, the IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, IEEE D&T, IEEE EMBEDDED SYSTEMS LETTERS, ACM JETC, and ACM TCPS, and served on the technical and organization committees of over 40 international conferences.



**Jianlei Yang** (S'12–M'16) received the B.S. degree in microelectronics from Xidian University, Xi'an, China, in 2009 and the Ph.D. degree in computer science and technology from Tsinghua University, Beijing, China, in 2014.

He joined Beihang University, Beijing, in 2016, where he is currently an Associate Professor with the School of Computer Science and Engineering. From 2014 to 2016, he was a Post-Doctoral Researcher with the Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh,

PA, USA. From 2013 to 2014, he was a Research Intern with Intel Labs China, Beijing, and Intel Corporation, Santa Clara, CA, USA. His current research interests include spintronics and neuromorphic computing systems.

Dr. Yang was a recipient of the First Place prize from the TAU Power Grid Simulation Contest in 2011, the Second Place prize from the TAU Power Grid Transient Simulation Contest in 2012, the IEEE ICCD Best Paper Award in 2013, the IEEE ICESS Best Paper Award in 2017, and the ACM GLSVLSI Best Paper Nomination in 2015.

Weisheng Zhao (M'06–SM'14–F'19) received the Ph.D. degree in physics from the University of Paris Sud, Paris, France, in 2007.

He was a Research Associate with the CEA's Embedded Computing Laboratory, Paris, from 2007 to 2009, and the French National Research Center (CNRS), Paris, as a Tenured Scientist from 2009 to 2014, where he led the Spintronics Integration Group. He is currently a Professor and the Director of the Fert Beijing Research Institute, Beihang University, Beijing, China. He has authored or co-

authored 2 books and over 200 scientific papers in the leading journals, such as *Nature Communications*, *Advanced Materials*, the PROCEEDINGS OF THE IEEE. He holds over 4 international patents and over 50 Chinese patents.

Prof. Zhao is an Associated Editor of the IEEE TRANSACTIONS ON NANOTECHNOLOGY and *IET Electronics Letters*.