{"title": "Backprop with Approximate Activations for Memory-efficient Network Training", "book": "Advances in Neural Information Processing Systems", "page_first": 2429, "page_last": 2438, "abstract": "Training convolutional neural network models is memory intensive since back-propagation requires storing activations of all intermediate layers. This presents a practical concern when seeking to deploy very deep architectures in production, especially when models need to be frequently re-trained on updated datasets. In this paper, we propose a new implementation for back-propagation that significantly reduces memory usage, by enabling the use of approximations with negligible computational cost and minimal effect on training performance. The algorithm reuses common buffers to temporarily store full activations and compute the forward pass exactly. It also stores approximate per-layer copies of activations, at significant memory savings, that are used in the backward pass. Compared to simply approximating activations within standard back-propagation, our method limits accumulation of errors across layers. This allows the use of much lower-precision approximations without affecting training accuracy. Experiments on CIFAR-10, CIFAR-100, and ImageNet show that our method yields performance close to exact training, while storing activations compactly with as low as 4-bit precision.", "full_text": "Backprop with Approximate Activations\nfor Memory-ef\ufb01cient Network Training\n\nAyan Chakrabarti\n\nWashington University in St. Louis\n\n1 Brookings Dr., St. Louis, MO 63130\n\nayan@wustl.edu\n\nBenjamin Moseley\n\nCarnegie Mellon University\n\n5000 Forbes Ave., Pittsburgh, PA 15213\n\nmoseleyb@andrew.cmu.edu\n\nAbstract\n\nTraining convolutional neural network models is memory intensive since back-\npropagation requires storing activations of all intermediate layers. This presents a\npractical concern when seeking to deploy very deep architectures in production,\nespecially when models need to be frequently re-trained on updated datasets. In this\npaper, we propose a new implementation for back-propagation that signi\ufb01cantly\nreduces memory usage, by enabling the use of approximations with negligible\ncomputational cost and minimal effect on training performance. The algorithm\nreuses common buffers to temporarily store full activations and compute the for-\nward pass exactly. It also stores approximate per-layer copies of activations, at\nsigni\ufb01cant memory savings, that are used in the backward pass. Compared to sim-\nply approximating activations within standard back-propagation, our method limits\naccumulation of errors across layers. This allows the use of much lower-precision\napproximations without affecting training accuracy. Experiments on CIFAR-10,\nCIFAR-100, and ImageNet show that our method yields performance close to exact\ntraining, while storing activations compactly with as low as 4-bit precision.\n\n1\n\nIntroduction\n\nThe use of deep convolutional neural networks has become prevalent for a variety of visual and other\ninference tasks [9], with a trend to employ increasingly larger and deeper network architectures [8, 10]\nthat are able to express more complex functions. While deeper architectures have delivered signi\ufb01cant\nimprovements in performance, they have also increased demand on computational resources. In\nparticular, training such networks requires a signi\ufb01cant amount of on-device GPU memory\u2014much\nmore so than during inference\u2014in order to store activations of all intermediate layers of the network\nthat are needed for gradient computation during back-propagation.\nThis leads to large memory footprints during training for state-of-the-art deep architectures, especially\nwhen training on high-resolution images with a large number of activations per layer. This in-turn\ncan lead to the computation being inef\ufb01cient and \u201cmemory-bound\u201d: it forces the use of smaller\ntraining batches for each GPU leading to under-utilization of available GPU cores (smaller batches\nalso complicate the use of batch-normalization [12] with batch statistics computed over fewer\nsamples). Consequently, practitioners are forced to either use a larger number of GPUs for parallelism,\nor contend with slower training. This makes it expensive to deploy deep architectures for many\napplications, especially when networks need to be continually trained as more data becomes available.\nPrior work to address this has traded-off memory for computation [2, 4, 5, 14], but with a focus on\nenabling exact gradient computation. However, stochastic gradient descent (SGD) inherently works\nwith noisy gradients at each iteration and, in the context of distributed training, has been shown to\nsucceed when using approximate and noisy parameter gradients when aggregating across multiple\ndevices [3, 16, 19, 20]. Motivated by this, we propose a method that uses approximate activations\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Proposed Approach. Standard backprop requires storing activations of each layer for\ngradient computation during the backward pass, leading to a large memory footprint. Our method\npermits these activations to be approximated to lower memory usage, while preventing errors from\nbuilding up across layers. A common reusable buffer temporarily stores the exact activations for\neach layer during the forward pass, while retaining an approximate copy for the backward pass. Our\nmethod ensures the forward pass is exact and limits errors in gradients \ufb02owing back to earlier layers.\n\nto signi\ufb01cantly reduce memory usage on each GPU during training. Our method has virtually no\nadditional computational cost and, since it introduces only a modest approximation error in computed\nparameter gradients, has minimal effect on training performance.\nA major obstacle to using approximate activations for training is that, with the standard imple-\nmentation of back-propagation (backprop), approximation errors compound across the forward and\nthen backward pass through all layers. This limits the degree of approximation at each layer (e.g.,\nsome libraries support 16- instead of 32-bit \ufb02oats for training). In this work, we propose a new\nbackprop algorithm based on the following key observations: the forward pass through a network\ncan be computed exactly without storing intermediate activations, and approximating activations has\nminimal effect on gradients \u201c\ufb02owing back\u201d to the input of a layer in the backward pass.\nFor the forward pass during training, our method uses a layer\u2019s exact activations to compute those\nfor a subsequent layer. However, once they have been used, it overwrites the exact activations to\nreuse memory, and retains only an approximate copy with a lower memory footprint. During the\nbackward pass, gradients are computed based on these stored approximate copies. This incurs only\na modest error at each layer when computing parameter gradients due to multiplying the incoming\ngradient with the approximate activations, but ensures the error in gradients propagating back to\nthe previous layer is minimal. Experiments show that our method can permit the use of a large\ndegree of approximation without any signi\ufb01cant degradation in training quality. This substantially\nlowers memory usage (by up to a factor of 8) during training without requiring additional expensive\ncomputation, thus making network training ef\ufb01cient and practical for larger and deeper architectures.\n\n2 Related Work\n\nA number of works focus on reducing the memory footprint of a model during inference, e.g.,\nby compression [7] and quantization [11], to ensure that it can be deployed on resource-limited\nmobile devices, while still delivering reasonable accuracy. However, these methods do not address\nthe signi\ufb01cant amount of memory required to train the networks themselves (note [7, 11] require\nstoring full-precision activations during training), which is signi\ufb01cantly larger than what is needed for\ninference. The most common recourse to alleviate memory bottlenecks during training is to simply use\nmultiple GPUs. But this often under-utilizes the available parallelism on each device\u2014computation\nin deeper architectures is distributed more sequentially and, without suf\ufb01cient data parallelism, often\ndoes not saturate all GPU cores\u2014and also adds the overhead of intra-device communication.\nA popular strategy to reduce training memory requirements is \u201ccheckpointing\u201d. Activations for only\na subset of layers are stored at a time, and the rest recovered by repeating forward computations [2,\n5, 14]. This affords memory savings with the trade-off of additional computational cost\u2014e.g., [2]\n\n2\n\n\fpropose a strategy that requires memory proportional to the square-root of the number of layers.\nHowever, it requires additional computation, with cost proportional to that of an additional forward\npass. In a similar vein, [4] considered network architectures with \u201creversible\u201d or invertible layers to\nallow re-computing input activations of such layers from their outputs during the backward pass.\nThese methods likely represent the best possible solutions if the goal is restricted to computing exact\ngradients. But SGD is fundamentally a noisy process, and the exact gradients computed over a batch\nat each iteration are already an approximation\u2014of gradients of the model over the entire training\nset [17]. Researchers have posited that further approximations are possible without degrading training\nability. For distributed training, asynchronous methods [3, 16] delay synchronizing models across\ndevices to mitigate communication latency. Despite each device now working with stale models, there\nis no major degradation in training performance. Other methods quantize gradients to two [19] or\nthree levels [20] so as to reduce communication overhead, and again \ufb01nd that training remains robust\nto such approximation. Our work also adopts an approximation strategy to gradient computation,\nbut targets the problem of memory usage on a each device. We approximate activations, rather than\ngradients, in order to achieve signi\ufb01cant reductions in a model\u2019s memory footprint during training.\nMoreover, since our method makes the underlying backprop engine more ef\ufb01cient, for any group of\nlayers, it can also be used within checkpointing to further improve memory cost.\nIt is worth differentiating our work from those that carry out all training computations at lower-\nprecision [1, 6, 15]. This strategy allows for a modest lowering of precision\u2014from 32- to 16-bit\nrepresentations, and to 8-bit with some loss in training quality for [1]\u2014beyond which training error\nincreases signi\ufb01cantly. In contrast, our approach allows for much greater approximation by limiting\naccumulation of errors across layers, and we are able to replace 32-bit \ufb02oats with 8- and even 4-bit\n\ufb01xed-point approximations, with little effect on training performance. Of course, performing all\ncomputation at lower-precision also has a computational advantage: due to reduction in-device\nmemory bandwidth usage (transferring data from global device memory to registers) in [15], and due\nto the use of specialized hardware in [6]. While the goal of our method is different, it can also be\ncombined with these ideas: compressing intermediate activations to a greater degree, while using\n16-bit precision for computational ef\ufb01ciency.\n\n3 Proposed Method\n\nWe now describe our approach to memory-ef\ufb01cient training. We begin by reviewing the compu-\ntational steps in the forward and backward pass for a typical network layer, and then describe our\napproximation strategy to reducing the memory requirements for storing intermediate activations.\n\n3.1 Background\n\nl + \u0001)\n\n\u22121/2 \u25e6 (Al:i \u2212 \u00b5l), \u00b5l = Mean(Al:i), \u03c32 = Var(Al:i);\n\nA neural network is composition of linear and non-linear functions that map the input to the \ufb01nal\ndesired output. These functions are often organized into \u201clayers\u201d, where each layer consists of\na single linear transformation\u2014typically a convolution or a matrix multiply\u2014and a sequence of\nnon-linearities. We use the \u201cpre-activation\u201d de\ufb01nition of a layer, where we group the linear operation\nwith the non-linearities that immediately preceed it. Consider a typical network whose lth layer\napplies batch-normalization and ReLU to its input Al:i followed by a linear transform:\n\n[B.Norm.] Al:1 = (\u03c32\n[Sc.&B.] Al:2 = \u03b3l \u25e6 Al:1 + \u03b2l;\n[ReLU] Al:3 = max(0, Al:2);\n[Linear] Al:o = Al:3 \u00d7 Wl;\n\n(1)\n(2)\n(3)\n(4)\nto yield the output activations Al:o that are fed into subsequent layers. Here, each activation is a tensor\nwith two or four dimensions: the \ufb01rst indexing different training examples, the last corresponding\nto \u201cchannels\u201d, and others to spatial location. Mean(\u00b7) and Var(\u00b7) aggregate statistics over batch and\nl with per-channel means and variances. Element-wise\nspatial dimensions, to yield vectors \u00b5l and \u03c32\naddition and multiplication (denoted by \u25e6) are carried out by \u201cbroadcasting\u201d when the tensors are not\nof the same size. The \ufb01nal operation represents the linear transformation, with \u00d7 denoting matrix\nmultiplication. This linear transform can also correspond to a convolution.\nNote that (1)-(4) are de\ufb01ned with respect to learnable parameters \u03b3l, \u03b2l, and Wl, where \u03b3l, \u03b2l are\nboth vectors of the same length as the number of channels in Al, and Wl denotes a matrix (for\n\n3\n\n\ffully-connected layers) or elements of a convolution kernel. These parameters are learned iteratively\nusing SGD, where at each iteration, they are updated based on gradients\u2014\u2207\u03b3l, \u2207\u03b2l, and \u2207Wl\u2014of\nsome loss function computed on a batch of training samples.\nTo compute gradients with respect to all parameters for all layers in the network, the training algorithm\n\ufb01rst computes activations for all layers in sequence, ordered such that each layer in the sequence\ntakes as input the output from a previous layer. The loss is computed with respect to activations of the\n\ufb01nal layer, and then the training algorithm goes through all layers again in reverse sequence, using\nthe chain rule to back-propagate gradients of this loss. For the lth layer, given the gradients \u2207Al:o\nof the loss with respect to the output, this involves computing gradients \u2207\u03b3l,\u2207\u03b2l, and \u2207Wl with\nrespect to the layer\u2019s learnable parameters, as well as gradients \u2207Al:i with respect to its input for\nfurther propagation. These gradients are given by:\n\nl:3 \u00d7 (\u2207Al:o), \u2207Al:3 = (\u2207Al:o) \u00d7 W T\nl ;\n\n\u22121/2 \u25e6(cid:2)\u2207Al:1 \u2212 Mean(\u2207Al:1) \u2212 Al:1 \u25e6 Mean(Al:1 \u25e6 \u2207Al:1)(cid:3);\n\n[Linear] \u2207W = AT\n[ReLU] \u2207Al:2 = \u03b4(Al:2 > 0) \u25e6 (\u2207Al:3);\n[Sc.&B.] \u2207\u03b2l = Sum(\u2207Al:2), \u2207\u03b3l = Sum (Al:1 \u25e6 (\u2207Al:2)) , \u2207Al:1 = \u03b3l \u25e6 \u2207Al:2;\n[B.Norm.] \u2207Al:i = (\u03c32\n\n(5)\n(6)\n(7)\n(8)\nwhere Sum(\u00b7) and Mean(\u00b7) again aggregate over all but the last dimension, and \u03b4(A > 0) is a tensor\nthe same size as A that is 1 where the values in A are positive, and 0 otherwise.\nWhen the goal is to just compute the \ufb01nal output of the network, the activations of an intermediate\nlayer can be discarded during the forward pass as soon as we \ufb01nish processing the subsequent layer\nor layers that use it as input. However, we need to store all intermediate activations during training\nbecause they are needed to compute gradients during back-propagation: (5)-(8) involve not just the\nvalues of the incoming gradient, but also the values of the activations themselves. Thus, training\nrequires enough available memory to hold the activations of all layers in the network.\n\nl + \u0001)\n\n3.2 Back-propagation with Approximate Activations\n\nWe begin by observing we do not necessarily need to store all intermediate activations Al:1, Al:2,\nand Al:3 within a layer. For example, it is suf\ufb01cient to store the activation values Al:2 right before\nthe ReLU, along with the variance vector \u03c32\nl (which is typically much smaller than the activations\nthemselves). Given Al:2, we can reconstruct the other activations Al:3 and Al:3 needed in (5)-(8)\nusing element-wise operations at negligible cost. Some libraries already use such \u201cfused\u201d layers to\nconserve memory, and we use this to measure memory usage for exact training.\nStoring one activation tensor at full-precision for every layer still requires a considerable amount\nof memory. We therefore propose retaining an approximate low-precision version \u02dcAl:2 of Al:2,\nthat requires much less memory for storage, for use in (5)-(8) during back-propagation. As shown\nin Fig. 2, we use full-precision versions of all activations during the forward pass to compute\nAl:o from Al:i as per (1)-(4), and use Al:2 to compute its approximation \u02dcAl:2. The full precision\napproximations are discarded (i.e., over-written) as soon they have been used\u2014the intermediate\nactivations Al:1, Al:2, Al:3 are discarded as soon as the approximation \u02dcAl:2 and output Al:o have\nbeen computed (Al:o is itself discarded after it has been used by a subsequent layer). Thus, only the\napproximate activations \u02dcAl:2 and variance vector \u03c32\nl for each layer are retained for back-propagation,\nwhere it is also used to compute corresponding approximate versions \u02dcAl:1 and \u02dcAl:3, in (5)-(8).\nOur method allows for the use of any generic approximation strategy to derive \u02dcAl:2 from Al:2 that\nleads to memory-savings, with the only requirement being that the approximation preserve the sign of\nthese activations (as will be discussed below). In our experiments, we use quantization to K-bit \ufb01xed\npoint representations as a simple and computationally-inexpensive approximation strategy to validate\nour method. However, we believe that future work on more sophisticated and data-driven approaches\nto per-layer approximation can yield even more favorable memory-performance trade-offs.\nSpeci\ufb01cally, given desired bit-size K and using the fact that Al:1 is batch-normalized and thus Al:2\nhas mean \u03b2l and variance \u03b32\n\u02dcA\u2217\nl:2 = ClipK\n\n(cid:0)(cid:98)Al:2 \u25e6 2K(6 \u2217 \u03b3l)\u22121(cid:99) + 2K\u22121 \u2212 (cid:98)\u03b2l \u25e6 2K(6 \u2217 \u03b3l)\u22121(cid:99)(cid:1),\n\n(9)\nwhere (cid:98)\u00b7(cid:99) indicates the \u201c\ufb02oor\u201d operator, and ClipK(x) = max(0, min(2K \u2212 1, x)). The resulting\nintegers (between 0 and 2K \u2212 1) can be directly stored with K-bits. When needed during back-\n\nl , we compute an integer tensor \u02dcA\u2217\n\nl:2 from Al:2 as:\n\n4\n\n\fFigure 2: Details of computations involved in the forward and backward pass during network training\nwith our method, for a single \u201cpre-activation\u201d layer with residual connections. We use two shared\nglobal buffers (for the straight and residual outputs) to store full-precision activations, ensuring the\nforward pass is exact. For each layer, we store approximate copies of the activations to save memory,\nand use these during back-propagation. Since our approximation preserves signs (needed to compute\nthe ReLU gradient), most computation for the gradient back to a layer\u2019s input are exact\u2014with only\nthe backprop through the variance-computation in batch-normalization being approximate.\n\n\u02dcAl:2 = 2\u2212K(6 \u2217 \u03b3l) \u25e6(cid:0) \u02dcA\u2217\n\nl:2 + 0.5 \u2212 2K\u22121 + (cid:98)\u03b2l \u25e6 2K(6 \u2217 \u03b3l)\u22121(cid:99)(cid:1)\n\npropagation, we recover a \ufb02oating-point tensor holding the approximate activations \u02dcAl:2 as:\n\n(10)\nThis simply has the effect of clipping Al:2 to the range \u03b2l\u00b1 3\u03b3l (the range may be slightly asymmetric\naround \u03b2l because of rounding), and quantizing values in 2K \ufb01xed-size intervals (to the median\nof each interval) within that range. It is easy to see that for values that are not clipped, the upper-\nbound on the absolute error between the true and approximate activations is 3/2K\u03b3l for Al:2 and\nAl:3, and 3/2K for Al:1. Moreover, this approximation ensures that the sign of Al:2 is preserved, i.e.,\n\u03b4(Al:2 > 0) = \u03b4( \u02dcAl:2 > 0).\n\n3.3 Approximation Error in Training\n\nSince the forward computations happen in full-precision, there is no error introduced in any of the\nactivations Al prior to approximation. To analyze errors in the backward pass, we examine how\napproximation errors in the activations affect the accuracy of gradient computations in (5)-(8). During\nthe \ufb01rst back-propagation step in (5) through the linear transform, the gradient \u2207W to the learnable\ntransform weights will be affected by the approximation error in \u02dcAl:3. However, the gradient \u2207Al:2\ncan be computed exactly (as a function of the incoming gradient to the layer \u2207Al:o), because it\ndoes not depend on the activations. Back-propagation through the ReLU in (7) is also not affected,\nbecause it depends only on the sign of the activations, which is preserved by our approximation.\nWhen back-propagating through the scale and bias in (6), only the gradient \u2207\u03b3 to the scale depends\non the activations, but gradients to the bias \u03b2l and to Al:1 can be computed exactly.\nAnd so, while our approximation introduces some error in the computations of \u2207W and \u2207\u03b3, the\ngradient \ufb02owing towards the input of the layer is exact, until it reaches the batch-normalization\n\n5\n\n\foperation in (8). Here, we do incur an error, but note that this is only in one of the three terms of\nthe expression for \u2207Al:i\u2014which accounts for back-propagating through the variance computation,\nand is the only term that depends on the activations. Hence, while our activation approximation does\nintroduce some errors in the gradients for the learnable weights, we limit the accumulation of these\nerrors across layers because a majority of the computations for back-propagation to the input of each\nlayer are exact. This is illustrated in Fig. 2, with the use of green arrows to show computations that\nare exact, and red arrows for those affected by the approximation.\n\n3.4 Network Architectures and Memory Usage\n\nOur full training algorithm applies our approximation strategy to every layer (de\ufb01ned by grouping\nlinear transforms with preceding non-linear activations) during the forward and backward pass.\nSkip and residual connections are handled easily, since back-propagation through these connections\ninvolves simply copying to and adding gradients from both paths, and doesn\u2019t involve the activations\nthemselves. We assume the use of ReLU activations, or other such non-linearities such as \u201cleaky\u201d-\nReLUs whose gradient depends only on the sign of the activations. Other activations (like sigmoid)\nmay incur additional errors\u2014in particular, we do not approximate the activations of the \ufb01nal output\nlayer in classi\ufb01er networks that go through a Soft-Max. However, since this is typically at the \ufb01nal\nlayer for most convolutional networks, and computing these activations is immediately followed by\nback-propagating through that layer, approximating these activations offers no savings in memory.\nNote that our method has limited utility for architectures where a majority of layers have saturating\nnon-linearities (as is the case for most recurrent networks).\nOur approach also handles average pooling by simply folding it in with the linear transform. For\nmax-pooling, exact back-propagation through the pooling operation would require storing the arg-max\nindices (the number of bits required to store these would depend on the max-pool receptive \ufb01eld\nsize). However, since max-pool layers are used less often in recent architectures in favor of learned\ndownsampling (ResNet architectures for image classi\ufb01cation use max-pooling only in one layer), we\ninstead choose not to approximate layers with max-pooling for simplicity.\nGiven a network with L layers, our memory usage depends on connectivity for these layers. Our\napproach requires storing the approximate activations for each layer, each occupying reduced memory\nrate at a fractional rate of \u03b1 < 1. During the forward pass, we also need to store, at full-precision,\nthose activations that are yet to be used by subsequent layers. This is one layer\u2019s activations for\nfeed-forward networks, and two layers\u2019 for standard residual architectures. More generally, we will\nneed to store activations for upto W layers, where W is the \u201cwidth\u201d of the architecture\u2014which we\nde\ufb01ne as the maximum number of outstanding layer activations that remain to be used as process\nlayers in sequence\u2014this width is one for simple feed-forward architectures and two for standard\nresidual networks, but may be higher for DenseNet architectures [10]. During back-propagation, the\nsame amount of space is required for storing gradients till they are used by previous layers. We also\nneed space to re-create a layer\u2019s approximate activations as full-precision tensors from the low-bit\nstored representation, for use in computation.\nThus, assuming that all activations of layers are the same size, our algorithm requires O(W + 1 + \u03b1L)\nmemory, compared to the standard requirement of O(L). For our simple quantized \ufb01xed-point\napproximation strategy, this leads to substantial savings for deep networks with large L since\n\u03b1 = 1/4, 1/8 when approximating 32-bit \ufb02oating point activations with K = 8, 4 bits.\n\n4 Experiments\n\nIn this section, we present experimental results which demonstrate that:\n\u2022 Our algorithm enables up to 8x memory savings, with negligible drop in training accuracy\n\u2022 The lower memory footprint allows training to fully exploit available parallelism on each GPU,\n\ncompared to exact training, and signi\ufb01cantly superior performance over other baselines.\n\nleading to faster training for deeper architectures.\n\nWe implement the proposed approximate memory-ef\ufb01cient training method as a general library\nthat accepts speci\ufb01cations for feed-forward architectures with possible residual connections (i.e.,\nW = 2). As illustrated in Fig. 2, it allocates a pair of common global buffers for the direct and\nresidual paths. At any point during the forward pass, these buffers hold the full-precision activations\n\n6\n\n\fFigure 3: Approximate Training on CIFAR and ImageNet. We show the evolution of training losses\nfor ResNet-164 models trained on CIFAR-10 and CIFAR-100, and ResNet-152 models trained on\nImageNet with exact training and using our approximation strategy. CIFAR results are summarized\nacross ten random initializations with bands depicting minimum and maximum loss values. We \ufb01nd\nthat the loss using our method closely follow that of exact training through all iterations. For CIFAR,\nwe also include results for training when using a \u201cnaive\u201d 8-bit approximation baseline\u2014where the\napproximate activations are also used in the forward pass. In this case, errors accumulate across\nlayers and we \ufb01nd that training fails.\n\nthat are needed for computation of subsequent layers. The same buffers are used to store gradients\nduring the back-ward pass. Beyond these common buffers, the library only stores the low-precision\napproximate activations for each layer for use during the backward-pass.\nWe compare our approximate training approach, with 8- and 4-bit activations, to exact training\nwith full-precision activations. For fair comparison, we only store one set of activations (like our\nmethod, but with full precision) for a group of batch-normalization, ReLU, and linear (convolution)\noperations. As baselines, we consider exact training with fewer activations per-layer, as well as\na naive approximation method with standard backprop. This second baseline replaces activations\nwith low-precision versions directly during the forward pass. We do this conservatively and all\ncomputations are carried out in full precision. For each layer, the activations are only approximated\nright before the ReLU like in our method, but use this approximated-version as input to the subsequent\nconvolution operation. Note that all methods use exact computation at test time.\nCIFAR-10 and CIFAR-100. We begin with comparisons on 164-layer pre-activation residual\nnetworks [9] on CIFAR-10 and CIFAR-100 [13], using three-layer \u201cbottlneck\u201d residual units and\nparameter-free shortcuts for all residual connections. We train the network for 64k iterations with a\nbatch size of 128, momentum of 0.9, and weight decay of 2 \u00d7 10\u22124. Following [9], the learning rate\nis set to 10\u22122 for the \ufb01rst 400 iterations, then increased to 10\u22121, and dropped by a factor of 10 at\n32k and 48k iterations. We use standard data-augmentation with random translation and horizontal\n\ufb02ips. We train these networks with our approach using K = 8 and K = 4 bit approximations, and\nmeasure degradation in accuracy with respect to exact training\u2014repeating training for all cases with\nten random seeds. We also include comparisons to exact training with 1/4th the number of feature\nchannels in each layer, and to the naive approximation baseline (with K = 8 bits, i.e., \u03b1 = 1/4).\nWe visualize the evolution of training losses in Fig. 3, and report test set errors of the \ufb01nal model in\nTable 1. We \ufb01nd that the training loss when using our low-memory approximation strategy closely\nfollow those of exact back-propagation, throughout the training process. Moreover, the \ufb01nal mean\ntest errors of models trained with even 4-bit approximations (i.e., corresponding to \u03b1 = 1/8) are only\nslightly higher than those trained with exact computations, with the difference being lower than the\nstandard deviation across different initializations. In contrast, training with fewer features per layer\nleads to much lower accuracy, while the naive approximation baseline simply fails, highlighting the\nimportance of preventing accumulation of errors across layers using our approach.\nImageNet. For ImageNet [18], we train models with a 152-layer residual architecture, again using\nthree-layer bottleneck units and pre-activation parameter-free shortcuts. We use a batch size of 256\nfor a total of 640k iterations with a momentum of 0.9, weight decay of 10\u22124, and standard scale,\ncolor, \ufb02ip, and translation augmentation. The initial learning rate is set to 10\u22121 with drops by factor\nof 10 every 160k iterations. Figure 3 shows the evolution of training loss in this case as well, and\nTable 1 reports top-5 validation accuracy (using 10 crops at a scale of 256) for models trained using\n\n7\n\n\fTable 1: Accuracy Comparisons on CIFAR-10, CIFAR-100, and ImageNet with 164- (for CIFAR-10\nand CIFAR-100) and 152- (for ImageNet) layer ResNet architectures. CIFAR results show mean \u00b1\nstd over training with ten random initializations for each case. ImageNet results use 10-crop testing.\n\nExact\nExact w/ fewer features\nNaive 8-bit Approx.\nProposed Method\n8-bit\n4-bit\n\n(\u03b1 = 1)\n(\u03b1 = 1/4)\n(\u03b1 = 1/4)\n\n(\u03b1 = 1/4)\n(\u03b1 = 1/8)\n\nCIFAR-10\nTest Set Error\n5.36%\u00b10.15\n9.49%\u00b10.12\n75.49%\u00b19.09\n5.48%\u00b10.13\n5.49%\u00b10.16\n\nCIFAR-100\nTest Set Error\n23.44%\u00b10.26\n33.47%\u00b10.50\n95.41%\u00b12.16\n23.63%\u00b10.32\n23.58%\u00b10.30\n\nImageNet\n\nVal Set Top-5 Error\n\n7.20%\n\n-\n-\n\n7.70%\n7.72%\n\nTable 2: Comparison of maximum batch-size and wall-clock time per training example (i.e., training\ntime per-iteration divided by batch size) for different ResNet architectures on CIFAR-10.\n\n# Layers\nMaximum Exact\n4-bit\nBatch-size\nExact\nRun-time\nper Sample\n4-bit\n\n1001 (4x)\n26\n182\n130.8 ms\n101.6 ms\n\n1001\n134\n876\n31.3 ms\n26.0 ms\n\n488\n264\n1468\n13.3 ms\n12.7 ms\n\n254\n474\n2154\n6.5 ms\n6.7 ms\n\n164\n688\n2522\n4.1 ms\n4.3 ms\n\nexact computation, and our approach with K = 8 and K = 4 bit approximations. As with the CIFAR\nexperiments, training losses using our strategy closely follow that of exact training (interestingly, the\nloss using our method is slightly lower than that of exact training during the \ufb01nal iterations, although\nthis is likely due to random initialization), and the drop in validation set accuracy is again relatively\nsmall: at 0.5% for a memory savings factor of \u03b1 = 1/8 with K = 4 bit approximations.\nMemory and Computational Ef\ufb01ciency. For the CIFAR experiments, we were able to \ufb01t the full\n128-size batch on a single 1080Ti GPU for both exact training and our method. For ImageNet training,\nwe parallelized computation across two GPUs, and while our method was able to \ufb01t half a batch (size\n128) on each GPU, exact training required two forward-backward passes (followed by averaging\ngradients) with 64\u2212sized batches per-GPU per-pass. In both cases, the per-iteration run-times were\nnearly identical. However, these represent comparisons restricted to having the same total batch\nsize (needed to evaluate relative accuracy). For a more precise evaluation of memory usage, and the\nresulting computational ef\ufb01ciency from parallelism, we considered residual networks for CIFAR-10\nof various depths up to 1001 layers\u2014and additionally for the deepest network, a version with four\ntimes as many feature channels in each layer. For each network, we measured the largest batch size\nthat could be \ufb01t in memory with our method (with K = 4) vs exact training, i.e., b such that a batch\nof b + 1 caused an out-of-memory error on a 1080Ti GPU. We also measured the corresponding\nwall-clock training time per sample, computed as the training time per-iteration divided by this\nbatch size. These results are summarized in Table 2. We \ufb01nd that in all cases, our method allows\nsigni\ufb01cantly larger batches to be \ufb01t in memory. Moreover for larger networks, our method yields a\nnotable computational advantage since larger batches permit full exploitation of available GPU cores.\nVisualizing Accuracy of Parameter Gradients. To examine the reason behind the robustness of\nour method, Fig. 4 visualizes the error in the \ufb01nal parameter gradients used to update the model.\nSpeci\ufb01cally, we take two models for CIFAR-100\u2014at the start and end of training\u2014and then compute\ngradients for a 100 batches with respect to the convolution kernels of all layers exactly, and using\nour approximate strategy. We plot the average squared error between these gradients. We compare\nthis approximation error to the \u201cnoise\u201d inherent in SGD, due to the fact that each iteration considers\na random batch of training examples. This is measured by average variance between the (exact)\ngradients computed in the different batches. We see that our approximation error is between one and\ntwo orders of magnitude below the SGD noise for all layers, both at the start and end of training. So\nwhile we do incur an error due to approximation, this is added to the much higher error that already\nexists due to SGD even in exact training, and hence further degradation is limited.\n\n8\n\n\fFigure 4: We visualize errors in the computed gradients of learnable parameters (convolution kernels)\nfor different layers for two snapshots of a CIFAR-100 model at the start and end of training. We\nplot errors between the true gradients and those computed by our approximation, averaged over a\n100 batches. We compare to the errors from SGD itself: the variance between the (exact) gradients\ncomputed from different batches. We \ufb01nd this SGD noise to be 1-2 orders of magnitude higher,\nexplaining why our approximation does not signi\ufb01cantly impact training performance.\n\n5 Conclusion\n\nWe introduced a new method for implementing back-propagation that is able to effectively use\napproximations to signi\ufb01cantly reduces the amount of required on-device memory required for neural\nnetwork training. Our experiments show that this comes at a minimal cost in terms of quality of\nthe learned models. With a lower memory footprint, our method allows training with larger batches\nin each iteration and better utilization of available GPU resources, making it more practical and\neconomical to deploy and train very deep architectures in production environments. Our reference\nimplementation is available at http://projects.ayanc.org/blpa/.\nOur method shows that SGD is reasonably robust to working with approximate activations. While\nwe used an extremely simple approximation strategy\u2014uniform quantization\u2014in this work, we are\ninterested in exploring whether more sophisticated techniques\u2014e.g., based on random projections or\nvector quantization\u2014can provide better trade-offs, especially if informed by statistics of gradients\nand errors from prior iterations.\nIt is also worth investigating whether our approach to partial\napproximation can be utilized in other settings, for example, to reduce inter-device communication\nfor distributed training with data or model parallelism.\n\nAcknowledgments\n\nA. Chakrabarti acknowledges support from NSF grant IIS-1820693. B. Moseley was supported in\npart by a Google Research Award and NSF grants CCF-1830711, CCF-1824303, and CCF-1733873.\n\nReferences\n[1] Ron Banner, Itay Hubara, Elad Hoffer, and Daniel Soudry. Scalable methods for 8-bit training\n\nof neural networks. In Advances in Neural Information Processing Systems, 2018.\n\n[2] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear\n\nmemory cost. arXiv preprint arXiv:1604.06174, 2016.\n\n[3] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew\nSenior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In\nAdvances in Neural Information Processing Systems, 2012.\n\n[4] Aidan N. Gomez, Mengye Ren, Raquel Urtasun, and Roger B. Grosse. The reversible residual\nnetwork: Backpropagation without storing activations. In Advances in Neural Information\nProcessing Systems, pages 2211\u20132221, 2017.\n\n9\n\n\f[5] Audrunas Gruslys, R\u00e9mi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. Memory-\nef\ufb01cient backpropagation through time. In Advances in Neural Information Processing Systems,\npages 4125\u20134133, 2016.\n\n[6] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning\nwith limited numerical precision. In Proc. International Conference on Machine Learning\n(ICML), pages 1737\u20131746, 2015.\n\n[7] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural net-\nworks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149,\n2015.\n\n[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR),,\npages 770\u2013778, 2016.\n\n[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual\n\nnetworks. In Proc. the European Conference on Computer Vision (ECCV), 2016.\n\n[10] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected\nconvolutional networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR),, 2017.\n\n[11] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quan-\ntized neural networks: Training neural networks with low precision weights and activations.\nJournal of Machine Learning Research (JMLR), 2017.\n\n[12] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. In Proc. International Conference on Machine Learning\n(ICML),, pages 448\u2013456, 2015.\n\n[13] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, 2009.\n\n[14] James Martens and Ilya Sutskever. Training deep and recurrent networks with hessian-free\noptimization. In Neural Networks: Tricks of the Trade - Second Edition, pages 479\u2013535. 2012.\n[15] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia,\nBoris Ginsburg, Michael Houston, Oleksii Kuchaev, Ganesh Venkatesh, et al. Mixed precision\ntraining. arXiv preprint arXiv:1710.03740, 2017.\n\n[16] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free\nIn Advances in neural information\n\napproach to parallelizing stochastic gradient descent.\nprocessing systems, 2011.\n\n[17] Herbert Robbins and Sutton Monro. A stochastic approximation method. In Herbert Robbins\n\nSelected Papers, pages 102\u2013109. Springer, 1985.\n\n[18] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual\nrecognition challenge. International Journal of Computer Vision (IJCV), 115(3), 2015.\n\n[19] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit Stochastic Gradient Descent\nand its Application to Data-parallel Distributed Training of Speech DNNs. In Proc. Interspeech,\n2014.\n\n[20] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad:\nTernary gradients to reduce communication in distributed deep learning. In Advances in neural\ninformation processing systems, 2017.\n\n10\n\n\f", "award": [], "sourceid": 1419, "authors": [{"given_name": "Ayan", "family_name": "Chakrabarti", "institution": "Washington University in St. Louis"}, {"given_name": "Benjamin", "family_name": "Moseley", "institution": "Carnegie Mellon University"}]}