1 Background and Notation
In reinforcement learning, the sequential decisionmaking problem is modeled using the Markov Decision Process formulation defined by the tuple
. In this formulation, an agent—which is both a learner and an actor—and an environment interact over a sequence of discrete time steps . At every time step the agent observes a state , which encodes information about the environment. Based on that information, the agent chooses and executes an action . As a consequence of the executed action, the environment sends back to the agent a new state and a reward. The reward and new state are modeled jointly by the transition dynamics probability function
, which defines the probability of observing given .Actions are selected according to a policy
—a probability distribution over the actions given the current state. The goal of the agent is to maximize the expected sum of discounted rewards
, where the expectation is with respect to and , and is a discount factor in the interval. In order to make informed decisions, agents often estimate the actionvalue function
which maps states and actions to a value in and is defined as(1) 
Then, through the process of policy iteration Sutton2018Book, the agent can learn the optimal actionvalue function for all stateaction pairs .
The optimal actionvalue function obeys an important relationship called the Bellman equation, which relates the actionvalue for a stateaction pair at time to the actionvalues at the next timestep. Using this identity, we can model stochastic approximation algorithms for estimating . One of the most popular of these algorithms is the QLearning algorithm Watkins1989QLearn which iteratively computes estimates of the actionvalue function, , using the update rule:
(2) 
where for all and is initialized arbitrarily.
1.1 Function Approximation
The update rule in Equation (2
) can be easily implemented with a lookup table representation. However, we are interested in the case where the actionvalue is approximated as a function of a vector of parameters
, , and a parameterized representation . In such a case, we approximate as:(3) 
Specifically, we want to be a sparse representation with very few active (nonzero) features for any given stateaction pair.
In their paper, Liu2018SRUtility (Liu2018SRUtility) used the last layer of a fullyconnected neural network as the representation and learned the parameters
using stochastic gradient descent to minimize the MeanSquared Temporal Difference Error of a fixed policy. After learning the representation, they learned the weights
w using the semigradient version of the Sarsa(0) algorithm—an onpolicy alternative to QLearning Rummery1995Sarsa, Singh2000SarsaZ, Sutton1996Generalization, Sutton1998Book.In this work, we are interested in learning both the representation and the weight vector w simultaneously. Hence, we will model both sets of parameters and w
as part of a single feedforward neural network and learn them using the wellknown DQN architecture Mnih2015HumanLevel, Mnih2013PlayingAtari. DQN seeks to minimize the loss function:
(4) 
where is a neural network parameterized by —the policy network—and is a separate set of parameters—the target network—that is updated every certain number of training steps by setting it equal to . It is important to emphasize that is used when selecting actions and is updated at every training step, whereas is exclusively used to compute the loss function. To minimize the loss function we compute stochastic gradient descent updates on a minibatch of transitions sampled from the experience replay buffer, which stores transitions of the form for .
2 Regularization Techniques
In order to learn a sparse representation while training the DQN architecture, we will employ similar regularization techniques as in Liu2018SRUtility (Liu2018SRUtility) with a few modifications.
2.1 L1 and L2 regularization
We employed L1 and L2 regularization in two different ways: on the weights of the network or on the activations of the hidden layers. In both cases, this involves modifying the loss function in Equation (4) to include a penalty that is a function of the size of the weights or the activations. For example, for a neural network with one matrix of parameters and no bias term, the L1 and L2weightregularized losses are:
(5) 
where is defined as in Equation (4), , and and correspond to the L1 and L2norm, respectively. We will refer to these two different type of regularization techniques as and .
In the case of the L1 and L2 regularization on the activations, consider a neural network with input , weights
, no bias term, and activation function
. In such a case, the activations of the hidden layer are computed as , where is applied componentwise. In this case, we define the L1 and L2regularized losses as:(6) 
where everything is the same as in the previous equations except for the norm which is applied to the activations . We will refer to these two regularization techniques as and .
2.2 Distributional Regularizers
An alternative to normbased regularizers are distributional regularizers, which were introduced by Nguyen2011Sparse (Nguyen2011Sparse) and then further developed by Liu2018SRUtility (Liu2018SRUtility). In this section, we propose another way to use this regularization method with a different type of distribution.
The main idea of this type of regularization is to model the activations of the neurons of each layer after a target exponential family distribution with natural parameter
that specifies the level of sparsity of the layer (e.g., for layer , neuron , and an exponential family distribution ). To encourage this, a regularization penalty is added according to how far the empirical distribution of the activation of a neuron, , is from the target distribution. The regularization penalty is proportional to the KLdivergence between the two distributions, . However, since it is very difficult for the empirical distribution to exactly match the target distribution, Liu2018SRUtility (Liu2018SRUtility) relaxed this condition by comparing the distance of the empirical distribution to a set of target distributions, e.g., , and defined such a distance as the Set KLdivergence . They showed that, if is a convex set, then the SKLdivergence has the form:(7) 
We can then add this regularization term weighted by a positive regularization factor to the DQN loss in Equation (4) to induce a sparsity level between and . The loss function is well defined since can be estimated for each neuron from the minibatch sampled from the experience replay buffer and is differentiable with respect to the parameters of the network,
. If we model the activations of the neurons as an Exponential distribution and use the convex set
, then the set KLdivergence is:(8) 
We refer to this method as
. We also test a another type of distributional regularizer where instead of modeling each individual neuron as an Exponential distribution, we model each layer as a Gamma distribution with natural parameter
and shape parameter equal to the size of the layer . This should encourage the entire layer to have an average activation between and , but not enforce a specific level of sparsity for each individual neuron. The SKLdivergence is the same as for but multiplied by , and can be estimated by averaging all the activations in the layer. We refer to this regularization method as .2.3 Dropout
We also study the effect of Dropout Hinton2012Dropout1 on the representation learned by DQN. In this type of regularization, a random number of units is dropped from a layer with certain probability. In practice, this means that each neuron is set to zero with a probability of for each minibatch of data during training. During evaluation, all the neurons are active and weighted by , which is equivalent to using the average activation of the corresponding neuron. We distinguish between the two different ways to process the data as training and evaluation. In our DQN architecture, the target network, in Equation (4), is always set to evaluation. On the other hand, the policy network, in Equation (4), is set to evaluation when choosing actions and is set to training when computing a training step.
3 Experiments
Our goal is to investigate whether it is possible to learn a sparse representation incrementally and whether there is a benefit from doing so. To accomplish this goal, we studied three main hypotheses that were formulated based on the work by Liu2018SRUtility (Liu2018SRUtility):

and will learn a denser representation than DQN, whereas , , Dropout, , and will learn a sparser representation than DQN.

Methods that learned a sparse representation will perform better than methods that learned a dense representation.

The performance of the methods that learned a sparse representation will be more robust to the size of the experience replay buffer than the performance of methods that learned a dense representation.
To test these hypotheses, we used the benchmark domains mountain car and 4dimensional catcher. We trained each agent for 200k steps in mountain car and 500k steps in catcher without resets. The measure of performance was the cumulative reward over the whole training period. For this reason, we modified the mountain car environment by giving a reward of 0, instead of 1, when the agent reaches the terminal state; this way, the cumulative reward is informative of the learning progress. We chose these environments so that our results can be directly compared to the results from Liu2018SRUtility (Liu2018SRUtility) and because they are light enough to allow for a large number of runs, which allows us to make statistical arguments about the performance of each algorithm.
We used the same architecture in all of our experiments consisting of two hidden layers with 32 and 256 units, respectively, ReLU activations, and a linear output layer with no bias term. We initialized the weights of each layer of size
in the network according to a zeromean Gaussian distribution and variance of
; the bias terms of the hidden layers are all initialized to zero He2015Init. To minimize the loss function, we used the Adam optimizer King2015Adam with , and . The minibatch size was set to 32 for all the experiments.For each different method, we found the parameter combination that maximized the cumulative reward by performing a grid search over the learning rate and the method’s parameters using 30 samples for each parameter combination. For DQN, we tested buffer size values in {100, 1k, 5k, 20k, 80k} and target network update frequencies in {10, 50, 100, 200, 400}. All the other methods used the same buffer size and target network as the best combination found for DQN. For more details about the values of each parameter used in the grid search, see Appendix A. Finally, in the case of , , and dropout, regularization was applied to all the parameters, or all the activations in the case of dropout, of the representation . On the other hand, for , , , and , regularization was applied only to the activations of the last layer of the representation. This was done to emulate the experimental setup of Liu2018SRUtility (Liu2018SRUtility).
3.1 Hypothesis 1: Learning Sparse Representations
To test our first hypothesis, we first found the best combination of buffer size and target network update frequency for a DQN agent (5k and 10, respectively, for mountain car and 80k and 400, respectively, for catcher). Then, we fixed the buffer size and target network update frequency to be the same as for DQN and swept over each of the parameters of each regularization method to find the best parameter combination. After finding the best parameter combination for each different method, we ran another 500 runs to eliminate possible maximization bias. Our analyses were performed on the second hidden layer of the network at the end of training.
(A) Measures of Sparsity  (B) Performance  

Overlap  Neurons  Normalized Overlap  Cumulative Reward  
Method  Avg  ME  Avg  ME  Avg  ME  Avg  ME 
Mountain Car  
DQN  17.92  0.64  29.21  1.14  0.64  0.01  198 884.57  12.61 
Dropout  85.8  2.01  164.35  1.08  0.53  0.014  198 970.4  14.27 
13.26  0.47  21.04  0.86  0.65  0.01  198 869.49  12.28  
12.96  0.55  21.05  0.91  0.63  0.01  198 870.35  11.92  
11.2  0.43  18.25  0.74  0.63  0.011  198 872.44  10.09  
93.66  1.94  207.75  1.54  0.45  0.007  198 593.53  8.25  
4.52  0.11  22.21  0.73  0.22  0.005  198 598.9  4.48  
39.52  0.73  116.92  1.25  0.34  0.005  198 633.16  6.04  
Catcher  
DQN  58.42  0.81  154.51  1.09  0.38  0.004  11 657.88  42.23 
Dropout  101.26  0.99  243.33  0.48  0.42  0.004  9 565.69  97.22 
53.7  0.92  158.48  1.15  0.34  0.005  11 730.06  41.05  
49.17  0.6  197.97  0.9  0.25  0.003  11 868.85  69.21  
5.9  0.45  37.86  0.84  0.15  0.008  10 666.23  114.84  
90.28  1.6  161.39  1.77  0.56  0.008  11 370.72  99.72  
35.94  0.94  118.66  1.35  0.3  0.006  11 874.5  68.22  
72.34  0.81  182.43  1.37  0.4  0.004  11 746.97  44.56 
Performance: cumulative reward over the entire training period (200k steps for Mountain car and 500k for Catcher). The sample average (Avg) and margin of error of the 95% confidence interval (ME) were computed based on 500 independent runs.
To study the sparsity of the learned representation, we computed the version of activation overlap proposed by Liu2018SRUtility (Liu2018SRUtility). For two observations and and a hidden layer with neurons, i.e., , the activation overlap is:
(9) 
To compute this measure, we covered the state space with a grid with 10k vertices by partitioning each dimension in the mountain car environment into 100 equal partitions and each dimension in the catcher environment into 10 equal partitions. We computed the activation overlap on each pair of vertices in the grid and averaged over 500 runs. As we were computing the activation overlap we found that many methods had a large number of dead neurons (neurons that were zero for every observation in the data set) and noticed that the measure in Equation (9) did not capture this. Consequently, a method can appear to have low activation overlap because it retained a small number of live neurons. In Table 1A, we present the average activation overlap, the number of live neurons, and the normalized activation overlap—normalized by the number of live neurons—along with the margin of error of the 95% confidence interval.
In both environments, we found that a higher activation overlap corresponded to a higher number of live neurons. On the other hand, the normalized activation overlap did not show any correspondence to the number of live neurons. This is problematic since depending on the measure, we can draw different conclusions about the sparsity of the learned representation of each algorithm, which raises the question: what measure of overlap should we use?
To corroborate the results in Table 1A, we computed the instance sparsity measure Liu2018SRUtility for each different method using the same samples used to compute the activation overlap. The instance sparsity corresponds to the percentage of active neurons (excluding dead neurons) for each instance in a data set. A sparse representation should result in small percentage of active neurons for each instance. Figure 1 shows the instance sparsity of each different method for each different environment aggregated over 500 runs; we used light colours for catcher and dark colours for mountain car.
The results show that there is not a clear relationship between the activation overlap and the instance sparsity measures. For example, in mountain car, both and resulted in higher activation overlap than DQN, which indicates that both of this methods learned a denser representation than DQN if we accept activation overlap as a measure of sparsity. However, the instance sparsity plot shows that the representation learned by and is sparser than the representation learned by DQN, contradicting the conclusion drawn from the activation overlap. On the other hand, the normalized activation overlap shows a strong relationship with the instance sparsity plots. The clearest example is , which shows a similar level of sparsity as DQN in mountain car, but a higher level of sparsity than DQN in catcher according to the instance sparsity plots. The normalized activation overlap corroborates this conclusion, unlike the activation overlap without normalization. Consequently, we will use the normalized activation overlap as the main measure of sparsity.
Overall, the results show that it is possible to learn a sparse representation incrementally by using appropriate regularization. However, and do not necessarily result in a denser representation than DQN. Moreover, Dropout, , , and do not consistently result in a sparser representation than DQN. The only method that resulted in a sparser representation than DQN in both environments was . Since it seems difficult to learn a sparse representation incrementally, one must ask: is there any benefit from learning sparse representations?
3.2 Hypothesis 2: The utility of sparse representations
To test hypothesis 2, we took a closer look at the performance of the algorithms from the previous experiment. If we accept hypothesis 2 to be true, then we would expect the methods with a smaller normalized activation overlap to have the best performance among all the algorithms. In other words, we would expect to perform the best in mountain car, and to perform the best in catcher. Table 1B shows that this is true in mountain car, where and had the best performance. However, we can already see evidence of a more complex effect. For instance, learned a denser representation than in mountain car, yet it resulted in better performance. Similarly, in catcher, —the method with the lowest normalized activation overlap—performed worse than many of the methods that resulted in denser representations.
The results indicate that learning a sparse representation can improve performance, but only if this does not result in a large number of dead neurons. Conversely, learning a slightly denser representation, as in the case of compared to in mountain car, can result in good performance as long as many neurons stay alive. This suggests that methods that learn a sparse representation while preserving as many live neurons as possible would perform better than methods that solely learn a sparse representation or solely preserve as many live neurons as possible. We postpone the investigation of this hypothesis for future work.
3.3 Hypothesis 3: Robustness of Sparse Representations to the Replay Buffer Size
Beyond improving performance in terms of cumulative reward, sparse representations may also be useful for overcoming the catastrophic interference problem often encountered in DNNs. Since the experience replay buffer mitigates the catastrophic interference suffered by a DNN, it should be possible to control the amount of interference by adjusting the size of the buffer. In fact, previous results have shown that either a buffer too small or too big can have a negative effect in performance Zhang2017ER, LiuR2017Effects suggesting the occurrence of catastrophic interference at either extreme. Consequently, if sparse representations help mitigate catastrophic interference, we would expect that the performance of those methods that learned a sparser representation to be more robust to the size of the experience replay buffer.
To test this hypothesis, we implemented several agents with buffer size values of 100, 1k, 2k, 5k, 20k, and 80k. For each of these values, we found the best parameter combination for DQN and each of the different regularization methods. The regularization methods used the same target network update frequency as DQN to eliminate possible confounding effects. After finding the best parameter combination for each different method, we ran each method for another 500 runs to eliminate possible maximization bias.
Our results—Figure 2B—provide evidence in favour of our hypothesis. Methods that learn a sparser representation were more robust to the size of the experience replay buffer. This is most evident in mountain car where the performances of and , the two methods with the lowest normalized activation overlap, are more robust to the effect of the buffer size. A similar effect can be observed in catcher to a lesser degree; the performance of is more robust to the effect of the buffer size. However, once again we found evidence of a more complex effect. If learning a sparse representation was solely responsible for the robustness of each method to the size of the experience replay buffer, then we would expect to be more robust to the effect of the buffer size in catcher, yet its performance is one of the worse among all the methods. We hypothesize that this effect is the result of regularization killing too many neurons during learning.
4 Conclusions
In this paper we empirically demonstrated that it is possible to learn a sparse representation and the actionvalue function simultaneously. Moreover, we corroborated the results from Liu2018SRUtility (Liu2018SRUtility) by showing that sparse representations are useful for improving performance and for overcoming catastrophic interference in reinforcement learning. Most importantly, we found that how we learn is just as important as what we learn; learning a sparse representation seems to be useful for improving performance, but killing too many neurons in the process could be counterproductive. This insight suggests that we should strive for methods that learn a sparse representation while retaining as many live neurons as possible; however, further work is needed to confirm this hypothesis.
Appendix A: Grid Search
In order to find the best parameters in our experiments we performed a grid search with a sample size of 30. To evaluate each parameter combination we compared the 95% confidence interval of the cumulative reward over the whole training period and selected the parameter combination that resulted in the highest lower confidence bound. This criteria selects methods that achieve the highest cumulative reward and also has a small variance. Once we found the best parameter combination, we reran every method for another 500 runs in order to eliminate maximization bias.
These are the values of the parameters that we used in our grid search:
Hyperparameter  Method  Values 

Learning Rate  All  Mountain Car: 0.01, 0.004, 0.001, 0.00025 
Catcher: 0.001, 0.0005, 0.00025, 0.000125,  
0.0000625, 0.00003125, 0.000015625  
[3pt/3pt] Experience Replay Buffer Size  DQN  100, 1k, 5k, 20k, 80k 
[3pt/3pt] Target Network Update Frequency  DQN  10, 50, 100, 200, 400 
[3pt/3pt] Dropout Probability ()  Dropout  0.1, 0.2, 0.3, 0.4, 0.5 
[3pt/3pt] Beta Upper Bound ()  ,  0.1, 0.2, 0.5 
[3pt/3pt] Regularization Factor ():  ,  0.1, 0.01, 0.001 
Distributional Regularizers  
[3pt/3pt] Regularization Factor ():  , ,  0.1, 0.05, 0.01, 0.005, 0.001, 
NormedBased Regularizers  ,  0.0005, 0.0001 
Appendix B: Extended Tables
We omitted the standard deviation in Table
1 because of space and for concreteness. For similar reasons, in Figure 2 we did not provide any details about the specific values of the average performance measure and their confidence intervals. In this appendix we present extended results to facilitate reproducibility and allow the reader to double check our results.Mountain Car  

Overlap  Neurons  Normalized Overlap  
Method  Avg  SD  ME  Avg  SD  ME  Avg  SD  ME 
DQN  17.92  7.26  0.64  29.21  12.97  1.14  0.64  0.12  0.01 
Dropout  85.8  22.83  2.01  164.35  12.25  1.08  0.53  0.16  0.014 
13.26  5.36  0.47  21.04  9.77  0.86  0.65  0.11  0.01  
12.96  6.27  0.55  21.05  10.36  0.91  0.63  0.12  0.01  
11.2  4.95  0.43  18.25  8.42  0.74  0.63  0.12  0.011  
93.66  22.04  1.94  207.75  17.55  1.54  0.45  0.08  0.007  
4.52  1.2  0.11  22.21  8.27  0.73  0.22  0.06  0.005  
39.52  8.34  0.73  116.92  14.21  1.25  0.34  0.05  0.005  
Catcher  
Overlap  Neurons  Normalized Overlap  
Method  Avg  SD  ME  Avg  SD  ME  Avg  SD  ME 
DQN  58.42  9.17  0.81  154.51  12.35  1.09  0.38  0.05  0.004 
Dropout  101.26  11.3  0.99  243.33  5.52  0.48  0.42  0.05  0.004 
53.7  10.49  0.92  158.48  13.12  1.15  0.34  0.05  0.005  
49.17  6.82  0.6  197.97  10.2  0.9  0.25  0.03  0.003  
5.9  5.14  0.45  37.86  9.57  0.84  0.15  0.1  0.008  
90.28  18.16  1.6  161.39  20.16  1.77  0.56  0.09  0.008  
35.94  10.67  0.94  118.66  15.39  1.35  0.3  0.07  0.006  
72.34  9.22  0.81  182.43  15.54  1.37  0.4  0.05  0.004 
Method  Buffer Size  Avg  SD  ME  C.I. 

DQN  100  199 856.86  150.98  13.27  (199 870.12, 199 843.59) 
1 K  199 026.66  244.65  21.5  (199 048.15, 199 005.16)  
5 K  198 884.57  143.48  12.61  (198 897.17, 198 871.96)  
20 K  198 937.06  141.23  12.41  (198 949.47, 198 924.65)  
80 K  199 304.41  314.35  27.62  (199 332.03, 199 276.79)  
DRE  100  199 346.84  364.26  32.01  (199 378.84, 199 314.83) 
1 K  199 009.05  236.99  20.82  (199 029.87, 198 988.23)  
5 K  198 869.49  139.73  12.28  (198 881.77, 198 857.21)  
20 K  198 895.09  184.29  16.19  (198 911.29, 198 878.9)  
80 K  199 128.19  254.1  22.33  (199 150.52, 199 105.87)  
DRG  100  199 722.63  271.86  23.89  (199 746.52, 199 698.74) 
1 K  198 998.54  192.22  16.89  (199 015.43, 198 981.65)  
5 K  198 870.35  135.71  11.92  (198 882.27, 198 858.42)  
20 K  198 928.31  171.58  15.08  (198 943.39, 198 913.23)  
80 K  199 280.11  276.14  24.26  (199 304.38, 199 255.85)  
L1A  100  199 778.29  221.7  19.48  (199 797.77, 199 758.81) 
1 K  198 993.98  200.59  17.63  (199 011.61, 198 976.36)  
5 K  198 872.44  114.8  10.09  (198 882.52, 198 862.35)  
20 K  198 940.82  124.59  10.95  (198 951.77, 198 929.87)  
80 K  199 214.54  278.61  24.48  (199 239.02, 199 190.06)  
L1W  100  199 246.67  376.84  33.11  (199 279.79, 199 213.56) 
1 K  198 714.28  57.01  5.01  (198 719.29, 198 709.27)  
5 K  198 593.53  93.91  8.25  (198 601.78, 198 585.28)  
20 K  198 592.51  94.1  8.27  (198 600.78, 198 584.24)  
80 K  198 652.34  200.2  17.59  (198 669.93, 198 634.75)  
L2A  100  199 468.76  328.71  28.88  (199 497.64, 199 439.88) 
1 K  198 666.12  58.59  5.15  (198 671.26, 198 660.97)  
5 K  198 598.9  50.96  4.48  (198 603.38, 198 594.42)  
20 K  198 576.14  74.58  6.55  (198 582.69, 198 569.59)  
80 K  198 562.71  55.41  4.87  (198 567.58, 198 557.84)  
L2W  100  199 465.72  339.12  29.8  (199495.52, 199435.92) 
1 K  198 678.4  75.54  6.64  (198 685.04, 198 671.76)  
5 K  198 633.16  68.7  6.04  (198 639.2, 198 627.13)  
20 K  198 683.92  73.24  6.44  (198 690.35, 198 677.48)  
80 K  198 896.84  356.71  31.34  (198 928.19, 198 865.5)  
Dropout  100  199 560.45  115.59  10.16  (199 570.6, 199 550.29) 
1 K  198 875.45  80.23  7.05  (198 882.5, 198 868.4)  
5 K  198 970.4  162.35  14.27  (198 984.66, 198 956.13)  
20 K  199 090.34  232.55  20.43  (199 110.78, 199 069.91)  
80 K  199 194.09  211.28  18.56  (199 212.65, 199 175.52) 
Method  Buffer Size  Avg  SD  ME  C.I. 

DQN  100  1529.36  992.14  87.17  (1442.19, 1616.54) 
1 K  3374.82  1606.29  141.14  (3233.68, 3515.96)  
5 K  8711.13  813.98  71.52  (8639.61, 8782.65)  
20 K  11090.68  638.03  56.06  (11034.62, 11146.74)  
80 K  11657.88  480.58  42.23  (11615.65, 11700.11)  
DRE  100  3995.81  1522.67  133.79  (3862.02, 4129.6) 
1 K  4377.34  1704.83  149.8  (4227.54, 4527.14)  
5 K  8968.46  1247.77  109.64  (8858.82, 9078.1)  
20 K  11308.25  612.84  53.85  (11254.4, 11362.1)  
80 K  11730.06  467.21  41.05  (11689.01, 11771.12)  
DRG  100  2610.76  1316.8  115.7  (2495.06, 2726.47) 
1 K  3612.21  1819.77  159.89  (3452.32, 3772.11)  
5 K  9047.32  1174.89  103.23  (8944.09, 9150.56)  
20 K  11178.37  576.01  50.61  (11127.76, 11228.98)  
80 K  11868.85  787.65  69.21  (11799.64, 11938.06)  
L1A  100  1273.31  1597.59  140.37  (1132.94, 1413.68) 
1 K  2237.3  2092.8  183.89  (2053.41, 2421.18)  
5 K  7121.2  2104.6  184.92  (6936.28, 7306.12)  
20 K  10376.92  1225.84  107.71  (10269.21, 10484.63)  
80 K  10666.23  1307.04  114.84  (10551.39, 10781.08)  
L1W  100  894.34  1368.01  120.2  (1014.54, 774.14) 
1 K  2989.15  1163.98  102.27  (2886.87, 3091.42)  
5 K  8353.5  763.06  67.05  (8286.45, 8420.54)  
20 K  10655.52  910.01  79.96  (10575.56, 10735.47)  
80 K  11370.72  1134.97  99.72  (11270.99, 11470.44)  
L2A  100  3606.2  2138.51  187.9  (3418.3, 3794.1) 
1 K  5481.86  2317.08  203.59  (5278.27, 5685.45)  
5 K  9276.56  1209.6  106.28  (9170.28, 9382.84)  
20 K  11417.48  832.52  73.15  (11344.33, 11490.63)  
80 K  11874.5  776.4  68.22  (11806.29, 11942.72)  
L2W  100  1302.85  987.5  86.77  (1216.09, 1389.62) 
1 K  3082.81  1098.62  96.53  (2986.28, 3179.34)  
5 K  8853.96  766.19  67.32  (8786.63, 8921.28)  
20 K  11167.58  580.9  51.04  (11116.54, 11218.62)  
80 K  11746.97  507.19  44.56  (11702.41, 11791.53)  
Dropout  100  1779.72  1580.31  138.85  (1640.86, 1918.57) 
1 K  4505.58  1891.71  166.22  (4339.36, 4671.79)  
5 K  5762.51  1299.27  114.16  (5648.35, 5876.67)  
20 K  7967.46  1058.12  92.97  (7874.49, 8060.43)  
80 K  9565.69  1106.42  97.22  (9468.47, 9662.9) 
Comments
There are no comments yet.