www.prismmodelchecker.org

NAND Multiplexing

(von Neumann)

Contents

Related publications: [NPKS05]

Introduction

The case study concerns NAND multiplexing, a technique for constructing reliable computation from unreliable devices. As we move from deep submicron technology to nanotechnology for device manufacture, the need for architectures to be defect-tolerant is gaining importance. This is because, at the nanoscale, devices will be prone to errors due to manufacturing defects, ageing, and transient faults. Micro-architects will be required to design their logic around defect tolerance through redundancy.

In 1952, von Neumann studied the problem of constructing reliable computation from unreliable devices (due to unreliable valve based computers at that time), introducing a redundancy technique called NAND multiplexing [vN56]. He showed that, if the failure probabilities of the gates are sufficiently small and failures are statistically independent, then computations may be done with a high probability of correctness. Later, in [DO77], it was shown that a logarithmic redundancy is necessary for some Boolean function computations, and sufficient for all Boolean functions. In [Pip88], Pippenger showed that von Neumann's construction works only when the probability of failure per gate has a limit strictly less than 1/2, and that computation in the presence of noise (which can be seen as the presence of defect), requires more layers of redundancy. In [HJ02, NSF01], von Neumann's NAND multiplexing was compared to other techniques for fault-tolerance (such as R-fold Modular Redundancy and Reconfiguration) and some theoretical calculations were done to show that the redundancy level must be quite high to obtain acceptable levels of reliability.

NAND Multiplexing

The basic technique of multiplexing is to replace a processing unit by multiplexed units which have N copies of every input and output of the unit. In each multiplexing unit, there are N devices which in parallel process the copies of the inputs to give N outputs which represent the output of the original processing unit. If the inputs and devices are reliable, then each element of the output set should be identical and equal to the output of the corresponding processing unit. However, in the case when there are errors in the inputs and errors due to faulty devices, the outputs will not all take the same value. Instead, after defining some critical level Δ ∈ (0,0.5), the output of the multiplexing unit is considered to be stimulated (taking logical value true or `1') if at least (1 - Δ) ⋅ N of the outputs are stimulated and non-stimulated (taking logical value false or `0') if no more than Δ ⋅ N lines are stimulated. In cases where the number of stimulated outputs does not meet either of these criteria, that is, the number of stimulated outputs is in the interval (Δ ⋅ N , (1 - Δ)⋅ N), then the output is undecided, and hence a malfunction will occur.

The basic design of a multiplexing unit consists of two stages: the executive stage and the restorative stage. The executive stage carries out the basic function of the performance unit to be replaced, while the restorative stage is used to reduce the degradation in the executive stage caused by errors in both the inputs and the faulty devices.

We now consider multiplexing in the case when the processing unit is a single NAND gate. We therefore replace both the inputs and output of the gate with N copies and in the executive stage duplicate the NAND gate N times, as shown below.

picture of NAND muliplexing unit

The unit U represents a random permutation of the input signals, that is, each signal of the first input is randomly paired with a signal from the second input to form an input pair for one of the copies of the NAND gate. Also shown in the figure is the restorative stage which is made using the same technique as the executive stage duplicating the outputs of the executive stage to use as its inputs. Note that, applying this approach only once will invert the result, therefore two steps are required. To give a more effective restoration mechanism the restorative stage can be iterated [vN56].

In [vN56], von Neumann concluded that, for extremely large N, the number of stimulated outputs of the executive stage is a stochastic variable, approximately normally distributed, and he gave an upper bound of 0.0107 for the probability of gate failure that can be tolerated. In other words, according to von Neumann, if the failure probability is greater than this threshold, then the probability of the NAND multiplexing system failing will be larger than a fixed, positive lower bound, no matter how large a bundle size is used. Recently, it was shown that, if each NAND gate fails independently, the tolerable threshold probability of each gate will be 0.08856 [EP98].

Modelling

We now explain the PRISM model of von Neumann's NAND multiplexing system. We first note that it was during this phase that we noticed the error made by [HJ02] in modelling the random permutation made by the unit U. In the analysis technique of [HJ02] the random permutation made by U is instead modelled by a random choice with replacement. More precisely, in the approach of [HJ02], if in the previous stage there are k stimulated outputs, then after passing through the unit U the probability that any one of the resulting inputs of the current stage is stimulated is k/N.

As our analysis will demonstrate, these different approaches to modelling the unit U will lead to different conclusions about the reliability of the system under study. We should note however, that, as the bundle size increases, the differences between the results obtained by these different approaches will converge and, in fact, if the bundle sizes were infinite then these approaches would be equivalent. This can be seen in our results where the difference between the two approaches when the bundle size 20 is greater than when the bundle size is 40.

The first approach to modelling NAND multiplexing in PRISM was to directly model the system given in the figure above: for each stage construct a PRISM module for each of the N NAND gates in the stage and then combine these modules through synchronous parallel composition. However, this approach leads to the well know state space explosion problem. For example, if I/O bundle size equals 20, then just modelling the executive stage of the NAND multiplexing unit required more than 1.0e+14 states.

The important observation, which allowed us to overcome this problem, was that the actual value of each input and output is not important: instead one need only store the total number of stimulated (and non-stimulated) inputs and outputs. Furthermore, to allow us to compute these values, without having to store all the outputs of the NAND gates in each stage, we replace the set of N NAND gates working in parallel with N NAND gates working in sequence. This allows us to fold space into time, or in other words reuse the same NAND gate over time rather than making redundancy over space. We also apply the same methodology to the stages, that is, reuse the same module for each of the stages while keeping a record of the outputs from the previous stage.

Following these observations, the PRISM description of the NAND multiplexing system, for the case when the bundle size equals 20, is given in below. Note that taking this approach does not influence the performance of the system since each NAND gates works independently and the probability of each NAND gate failing is also independent.

// nand multiplex system
// gxn/dxp 20/03/03

// U (correctly) performs a random permutation of the outputs of the previous stage

dtmc

const int N; // number of inputs in each bundle
const int K; // number of restorative stages

const int M = 2*K+1; // total number of multiplexing units

// parameters taken from the following paper
// A system architecture solution for unreliable nanoelectric devices
// J. Han & P. Jonker
// IEEEE trans. on nanotechnology vol 1(4) 2002

const double perr = 0.02; // probability nand works correctly
const double prob1 = 0.9; // probability initial inputs are stimulated

// model whole system as a single module by resuing variables 
// to decrease the state space
module multiplex

	u : [1..M]; // number of stages
	c : [0..N]; // counter (number of copies of the nand done)

	s : [0..4]; // local state
	// 0 - initial state
	// 1 - set x inputs
	// 2 - set y inputs
	// 3 - set outputs
	// 4 - done

	z : [0..N]; // number of new outputs equal to 1
	zx : [0..N]; // number of old outputs equal to 1
	zy : [0..N]; // need second copy for y
	// initially 9 since initially probability of stimulated state is 0.9

	x : [0..1]; // value of first input
	y : [0..1]; // value of second input
	
	[] s=0 & (c<N) -> (s'=1); // do next nand if have not done N yet
	[] s=0 & (c=N) & (u<M) -> (s'=1) & (zx'=z) & (zy'=z) & (z'=0) & (u'=u+1) & (c'=0); // move on to next u if not finished
	[] s=0 & (c=N) & (u=M) -> (s'=4) & (zx'=0) & (zy'=0) & (x'=0) & (y'=0); // finished (so reset variables not needed to reduce state space)

	// choose x permute selection (have zx stimulated inputs)
	// note only need y to be random	
	[] s=1 & u=1  -> prob1 : (x'=1) & (s'=2) + (1-prob1) : (x'=0) & (s'=2); // initially random
	[] s=1 & u>1 & zx>0 -> (x'=1) & (s'=2) & (zx'=zx-1);
	[] s=1 & u>1 & zx=0 -> (x'=0) & (s'=2);

	// choose x randomly from selection (have zy stimulated inputs)
	[] s=2 & u=1 -> prob1 : (y'=1) & (s'=3) + (1-prob1) : (y'=0) & (s'=3); // initially random
	[] s=2 & u>1 & zy<(N-c) & zy>0  -> zy/(N-c) : (y'=1) & (s'=3) & (zy'=zy-1) + 1-(zy/(N-c)) : (y'=0) & (s'=3);
	[] s=2 & u>1 & zy=(N-c) & c<N -> 1 : (y'=1) & (s'=3) & (zy'=zy-1);
	[] s=2 & u>1 & zy=0 -> 1 : (y'=0) & (s'=3);

	// use nand gate
	[] s=3 & z<N & c<N -> (1-perr) : (z'=z+(1-x*y)) & (s'=0) & (c'=c+1) & (x'=0) & (y'=0) // not faulty
	         + perr    : (z'=z+(x*y))    & (s'=0) & (c'=c+1) & (x'=0) & (y'=0); // von neumann fault
	// [] s=3 & z<N -> (1-perr) : (z'=z+(1-x*y)) & (s'=0) & (c'=c+1) & (x'=0) & (y'=0) // not faulty
	//         + perr    : (z'=z+(x*y))    & (s'=0) & (c'=c+1) & (x'=0) & (y'=0); // von neumann fault
	
	[] s=4 -> true;
	
endmodule

// rewards: final value of gate
rewards
	[] s=0 & (c=N) & (u=M) : z/N;
endrewards

View: printable version          Download: nand.pm

In the PRISM model given above we have assumed that the inputs X and Y are identically distributed (with probability 0.9 of being stimulated), and the NAND gates suffer from a von Neumann fault (inverted output) with probability 0.02. Note that, in the executive stage, a random permutation of the inputs cannot be performed as we only know the probability of each individual input being stimulated. However, for a fixed initial input, one can easily modify the PRISM model so that the system performs a random permutation of the input. To change the number of restorative stages, bundle size, input probabilities or probability of the NAND gates failing requires only modification of the constants given at the start of the description.

To model the approach of [HJ02], the only modifications that need to be made are to the probabilities with which the variables x and y are set (when the local state s equals 1 and 2 respectively), that is, replace the random permutation of the inputs with a random choice with replacement.

Model Statistics

The tables below shows statistics and construction times of the models when the probability of gate failure equals 0.02. The tables include:

  • the number of states in the DTMC representing the model;
  • both the number of nodes and leaves of the MTBDD which represents the model;
  • the construct time which equals the time taken for the system description to be parsed and converted to the MTBDD representing it, and the time to perform reachability, identify the non-reachable states and filter MTBDD accordingly;
  • the number of iterations required to find the reachable states (which is performed via a BDD based fixpoint algorithm).

We also give the same statistics when the random permutation performed by the unit U is replaced by a random choice with replacement as used by [HJ02]. The results corresponding to this case are referenced 'UR', and we use 'U' to reference the results corresponding to the case when U does indeed perform a random permutation.

  • bundle size (N) equals 20

    Restorative stages:   U (random permutation)   UR (random choice with replacement)
    States: Nodes: Leaves: Construction: States: Nodes: Leaves: Construction:
    Time (s): Iter.s: Time (s): Iter.s:
    178,311 20,8381311.373241 69,741 1,730230.21241
    2154,92120,8471312.53 401 137,7811,739230.32401
    3231,53120,8481313.706561 205,8211,740230.44561
    4308,14120,8561314.846721 273,8611,748230.56721
    5384,75120,8561316.093881 341,9011,748230.67881
    6461,36120,8571317.2791041409,9411,749230.811041
    7537,97120,8571318.5111201477,9811,749230.901201
    8614,58120,8651319.5981361546,0211,757231.071361
    9691,19120,86513110.941521614,0611,757231.191521
    10767,80120,86513111.981681682,1011,757231.271681
    11844,41120,86513113.441841750,1411,757231.451841
    12921,02120,86613114.432001818,1811,758231.532001
    13997,63120,86613116.012161886,2211,758231.652161
  • bundle size (N) equals 40

    Restorative stages:   U (random permutation)   UR (random choice with replacement)
    States: Nodes: Leaves: Construction: States: Nodes: Leaves: Construction:
    Time (s): Iter.s: Time (s): Iter.s:
    11,004,82146,2218985.42481 534,681 3,262430.61481
    22,003,04146,2308989.97801 1,062,7613,271431.00801
    33,001,26146,23189816.41,1211,590,8413,272431.451,121
    43,999,48146,23989820.51,4412,118,9213,280432.031,441
    54,997,70146,23989824.21,7612,647,0013,280432.441,761
    65,995,92146,24089830.72,0813,175,0813,281432.692,081
    76,994,14146,24089835.12,4013,703,1613,281433.962,401
  • bundle size (N) equals 60

    Restorative stages:   U (random permutation)   UR (random choice with replacement)
    States: Nodes: Leaves: Construction: States: Nodes: Leaves: Construction:
    Time (s): Iter.s: Time (s): Iter.s:
    14,717,531 100,4432,42111.7721 1,778,821 4,328631.91721
    29,420,361 100,4522,42127.91,2013,542,941 4,337632.151,201
    314,123,191100,4532,42137.41,6815,307,061 4,338632.591,681
    418,826,021100,4612,42152.82,1617,071,181 4,346633.582,161
    523,528,851100,4612,42162.02,6418,835,301 4,346634.622,641
    628,231,681100,4622,42178.83,12110,599,4214,347634.963,121
    732,934,511100,4622,42199.23,60112,363,5414,347635.743,601
Model Checking Times

The properties we consider are the probability of there being k stimulated outputs (which in terms of the PRISM model presented in above, corresponds to the probability of reaching the state where z=k, u=3 and c=20), for k=0,...,N where N is the system's I/O bundle size. By performing this analysis we have in fact computed the output distribution of the system, and hence any measure of reliability can be calculated from these results. Note that PRISM can be used directly for computing these measures of reliability, for example, the probability of errors being less than than k% and the expected number of incorrect outputs of the system.

The model checking statistics given below are for calculating the probability of there being 0 stimulated outputs when the probability of gate failure equals 0.01 using PRISm's hybrid engine.

There are two precomputation steps in the model checking procedure involving BDD based fixpoint algorithms: Prob1 and Prob0 which finds those states such that the probability of satisfying the formula is 1 and 0 respectively.

In the main algorithm we stop iterating when the relative error between subsequent iteration vectors is less than 1e-6, that is:

relative error
  • bundle size (N) equals 20

    Restorative stages:   U (random permutation)   UR (random choice with replacement)
    Time: Iterations: Time: Iterations:
    Prob1: Prob0: Main: Prob1: Prob0: Main:
    13.29246 241 20.411246 241 2
    25.30406 401 20.779406 401 2
    37.61566 561 21.091566 561 2
    49.71726 721 21.521726 721 2
    512.3886 881 21.862886 881 2
    614.61,0461,04122.19 1,0461,0412
    716.01,2061,20122.42 1,2061,2012
    819.21,3661,36122.72 1,3661,3612
    921.11,5261,52123.11 1,5261,5212
    1023.91,6861,68123.44 1,6861,6812
    1127.81,8461,84124.00 1,8461,8412
    1228.92,0062,00124.39 2,0062,0012
    1332.02,1662,16124.55 2,1662,1612
  • bundle size (N) equals 40

    Restorative stages:   U (random permutation)   UR (random choice with replacement)
    Time: Iterations: Time: Iterations:
    Prob1: Prob0: Main: Prob1: Prob0: Main:
    128.4486 481 22.78 486 481 2
    246.1806 801 24.53 806 801 2
    364.31,1261,12125.90 1,1261,1212
    476.11,4461,44127.19 1,4461,4412
    588.31,7661,76129.93 1,7661,7612
    6118 2,0862,081212.4 2,0862,0812
    7130 2,4062,401213.6 2,4062,4012
  • bundle size (N) equals 40

    Restorative stages:   U (random permutation)   UR (random choice with replacement)
    Time: Iterations: Time: Iterations:
    Prob1: Prob0: Main: Prob1: Prob0: Main:
    148.7726 721 23.88 726 721 2
    297.11,2061,201210.6 1,2061,2012
    3140 1,6861,681213.6 1,6861,6812
    4170 2,1662,161217.4 2,1662,1612
    5254 2,6462,641220.8 2,6462,6412
    6427 3,1263,121222.7 3,1263,1212
    7608 3,6063,601224.2 3,6063,6012

Experimental Results

We now study the performance of NAND multiplexing systems both when the I/O bundles are of size 20, 40 and 60. In all the experiments, we assume that the inputs X and Y are identical (this is often true in circuits containing similar devices) and that initially 90% of the inputs are stimulated (correct). We suppose that the gate failure in each NAND is a von Neumann fault, that is, when a gate fails, the value of its output is inverted.

Our analysis of the reliability of the NAND multiplexing system using probabilistic model checking concentrates on the effects of the failure probabilities of the NAND gates and of the number of restorative stages. Recall that, to change either of these in the PRISM language description of the system, one needs only change the parameter probz or the parameter M. The results we present show:

  • the shape of the output distribution as the probability of gate failure varies;
  • an analysis of reliability, in terms of the probability that at most 10% of the outputs are incorrect, as the probability of gates failure varies;
  • for different probabilities of gate failure, the resulting change in shape of the output distribution when additional restorative stages are added;
  • how, in the case when the probability of gate failure is very small, the reliability can be improved by increasing the number of restorative stages;
  • by comparing the probability that at most 10% of the outputs are incorrect and the expected percentage of incorrect outputs for different numbers of restorative stages, the maximum probability of gate failure allowed for the system to function reliably.
Output Distribution when bundle size (N) equals 20

In graphs below we present, for systems with a different numbers of restorative stages and both in the case when the unit U performs a random permutation and when it performs a random choice with replacement, the output distribution when the probability of gate failure equals 0.1, 0.04, 0.02 and 0.0001.

Note that the system is supposed to model a correctly functioning NAND gate and we assume that the inputs are correct when stimulated. Hence, the more outputs that are non-stimulated, the greater the reliability of the system.

  • 1 restorative stage (random permutation)

    plots output distribution (1 restorative stage, bundle size 20 and U)
  • 1 restorative stage (random choice with replacement)

    plots output distribution (1 restorative stage, bundle size 20 and UR)
  • 4 restorative stages (random permutation)

    plots output distribution (4 restorative stages, bundle size 20 and U)
  • 4 restorative stages (random choice with replacement)

    plots output distribution (4 restorative stages, bundle size 20 and UR)
  • 7 restorative stages (random permutation)

    plots output distribution (7 restorative stages, bundle size 20 and U)
  • 7 restorative stages (random choice with replacement)

    plots output distribution (7 restorative stages, bundle size 20 and UR)
Output Distribution when bundle size (N) equals 40

In graphs below we present, for systems with a different numbers of stages and both in the case when the unit U performs a random permutation and when it performs a random choice with replacement, the output distribution when the probability of gate failure equals 0.1, 0.04, 0.02 and 0.0001.

Note that the system is supposed to model a correctly functioning NAND gate and we assume that the inputs are correct when stimulated. Hence, the more outputs that are non-stimulated, the greater the reliability of the system.

  • 1 restorative stage (random permutation)

    plots output distribution (1 restorative stage, bundle size 40 and U)
  • 1 restorative stage (random choice with replacement)

    plots output distribution (1 restorative stage, bundle size 40 and UR)
  • 4 restorative stages (random permutation)

    plots output distribution (4 restorative stages, bundle size 40 and U)
  • 4 restorative stages (random choice with replacement)

    plots output distribution (4 restorative stages, bundle size 40 and UR)
  • 7 restorative stages (random permutation)

    plots output distribution (7 restorative stages, bundle size 40 and U)
  • 7 restorative stages (random choice with replacement)

    plots output distribution (7 restorative stages, bundle size 40 and UR)
Output Distribution when bundle size (N) equals 60

In graphs below we present, for systems with a different numbers of stages and both in the case when the unit U performs a random permutation and when it performs a random choice with replacement, the output distribution when the probability of gate failure equals 0.1, 0.04, 0.02 and 0.0001.

Note that the system is supposed to model a correctly functioning NAND gate and we assume that the inputs are correct when stimulated. Hence, the more outputs that are non-stimulated, the greater the reliability of the system.

  • 1 restorative stage (random permutation)

    plots output distribution (1 restorative stage, bundle size 60 and U)
  • 1 restorative stage (random choice with replacement)

    plots output distribution (1 restorative stage, bundle size 60 and UR)
  • 4 restorative stages (random permutation)

    plots output distribution (4 restorative stages, bundle size 60 and U)
  • 4 restorative stages (random choice with replacement)

    plots output distribution (4 restorative stages, bundle size 60 and UR)
  • 7 restorative stages (random permutation)

    plots output distribution (7 restorative stages, bundle size 60 and U)
  • 7 restorative stages (random choice with replacement)

    plots output distribution (7 restorative stages, bundle size 60 and UR)
Reliability versus the probability of gate failure

The following graph plots the probability that at most 10% of the outputs of the overall system are incorrect (stimulated) against the failure probability of the gates.

  • bundle size (N) equals 20

    plots probability at most 10% outputs are incorrect for bundle size 60 (small failure rate)
  • bundle size (N) equals 40

    plots probability at most 10% outputs are incorrect for bundle size 60 (small failure rate)
  • bundle size (N) equals 60

    plots probability at most 10% outputs are incorrect for bundle size 60 (small failure rate)
Reliability versus the number of stages

The following graph plots the probability that at most 10% of the outputs of the overall system are incorrect (stimulated) against the number of stages of the system.

  • bundle size (N) equals 20

    plots probability at most 10% outputs are incorrect for bundle size 20 (large failure rate)
  • bundle size (N) equals 40

    plots probability at most 10% outputs are incorrect for bundle size 40 (large failure rate)
  • bundle size (N) equals 60

    plots probability at most 10% outputs are incorrect for bundle size 60 (large failure rate)

The following graph plots the expected percentage of incorrect (stimulated) inputs against the number of stages of the system.

  • bundle size N equals 20

    plots expected % of incorrect outputs for bundle size 20
  • bundle size N equals 40

    plots expected % of incorrect outputs for bundle size 40
  • bundle size N equals 60

    plots expected % of incorrect outputs for bundle size 60

Case Studies