4.1 Experimental Setups
To answer the research questions above, we design our evaluation from multiple perspectives listed in the following.
Subject Datasets and Repairing Scenarios. Given a deployed DNN trained on a training dataset
\(\mathcal {D}^\text{t}\), we can evaluate it on a testing dataset
\(\mathcal {D}^\text{v}\). In the real world, there are a lot of scenes that cannot be covered by
\(\mathcal {D}^\text{v}\) and the DNN’s performance may decrease significantly after the DNN is deployed in its operational environment. For example, there are common corruptions (i.e., noise patterns) in the real world that can affect the DNN significantly [
26]:
Gaussian noise (
GN),
shot noise (
SN),
impulse noise (
IN),
defocus blur (
DB),
Gaussian blur (
GB),
motion blur (
MB),
zoom blur (
ZB),
snow (
SNW),
frost (
FRO),
fog (
FOG),
brightness (
BR),
contrast (
CTR),
elastic transform (
ET),
pixelate (
PIX), and JPEG compression (JPEG).
According to the aftermentioned situations, we consider two repairing scenarios that commonly occur in practice:
—
Repairing the accuracy drift on the testing dataset. When we evaluate the DNN on the testing dataset
\(\mathcal {D}^\text{v}\), we can collect a few failure examples (i.e., 1,000 examples) denoted as
\(\mathcal {D}^\text{v}_{\text{fail}}\). Then, we set
\(\mathcal {D}^{\text{repair}}=\mathcal {D}^\text{v}_{\text{fail}}\cup \mathcal {D}^\text{t}\) and use the proposed or baseline repairing methods to enhance the deployed DNNs. We evaluate the accuracy on the testing dataset where
\(\mathcal {D}^\text{v}_{\text{fail}}\) is excluded (i.e.,
\(\mathcal {D}^\text{v}\setminus \mathcal {D}^\text{v}_{\text{fail}}\)). Note that, the context of repairing DNN with only a few testing data is meaningful and important, which is adopted by recent works [
54,
72]. In addition, there could be many practical scenarios, where collecting buggy examples is very difficult or at very high costs, with only a few buggy examples collected entirely. Hence, we follow the common choice in recent works [
54,
72] to select only 1,000 failure examples from testing data.
—
Repairing the robustness on corrupted datasets. When we evaluate the DNN on a corrupted testing dataset \(\mathcal {D}^\text{c}\), we can also collect a few failure examples (i.e., 1,000 examples) denoted as \(\mathcal {D}^\text{c}_{\text{fail}}\) and set \(\mathcal {D}^{\text{repair}}=\mathcal {D}^\text{c}_{\text{fail}}\,\cup \,\mathcal {D}^\text{t}\). The repairing goal is to enhance the accuracy on \(\mathcal {D}^\text{c}\setminus \mathcal {D}^\text{c}_{\text{fail}}\) and other corrupted datasets while maintaining the accuracy on the clean testing dataset (i.e., \(\mathcal {D}^\text{v}\setminus \mathcal {D}^\text{v}_{\text{fail}}\)).
We choose CIFAR-10 [33], CIFAR-100 [33], Tiny-ImageNet [36], and ImageNet [11] as the evaluation datasets. They are commonly used datasets in recent DNN repair studies, enabling us to perform comparative studies in a relatively fair way. Each dataset contains its respective training dataset
\(\mathcal {D}^\text{t}\) and testing dataset
\(\mathcal {D}^\text{v}\). CIFAR-10 contains a total of 60,000 images in 10 categories, in which 50,000 images are for
\(\mathcal {D}^\text{t}\) and the other 10,000 are for
\(\mathcal {D}^\text{v}\).
CIFAR-100 has 100 classes containing 600 images each. There are 500 images in the training dataset \(\mathcal {D}^{\bf t}\) and 100 images in the testing dataset \(\mathcal {D}^{\bf v}\) for each class. Tiny-ImageNet has a training dataset
\(\mathcal {D}^\text{t}\) with the size of 100,000 images, and a testing dataset
\(\mathcal {D}^\text{v}\) with the size of 10,000 images.
ImageNet contains over 14 million images. In our experiment, the training dataset \(\mathcal {D}^\text{t}\) uses 1.3 million images, and the testing dataset \(\mathcal {D}^\text{v}\) uses over 50,000 images. Therefore, we have corrupted testing datasets
\(\lbrace \mathcal {D}^\text{c}_i\rbrace\) where
\(i=1,2,\dots , 15\) corresponding to the above fifteen corruptions [
26].
DNN architectures. We select six different architectures of DNN, i.e., VGGNet-16 [58], ResNet-18, ResNet-50, ResNet-101 [25], DenseNet-121 [27], and EfficientNet-B0 [60]. Given that
ArchRepair is a block-based repairing method, the block-like architecture, ResNet, turns out to be a perfect research subject. For a broad comparison, we also choose a non-block-like architecture, DenseNet-121, to examine the repairing capability of
ArchRepair.
1 For each architecture, we first pre-train them with the original training dataset \(\mathcal {D}^\text{t}\) (from CIFAR-10, CIFAR-100, Tiny-ImageNet or ImageNet), the model with the highest accuracy in testing dataset \(\mathcal {D}^\text{v}\) (from CIFAR-10, CIFAR-100, Tiny-ImageNet or ImageNet) will be saved as pre-trained model \(\phi _\theta\). As the original ResNet and DenseNet are not designed for CIFAR-10 and Tiny-ImageNet datasets, we use the unofficial architecture code offered by a popular GitHub project,
2 which has more than 4.1 K stars.
Block definition. We divide each of the six selected DNN architectures into several blocks. For each of ResNet-18, ResNet-50, and ResNet-101, we follow its block structures and divide it into four blocks, as shown in Table 2. For DenseNet-121, we divide it by every two convolutional layers as one block. For VGGNet-16, we manually divide it into six blocks by maxpool layer as Table 2 shows and select five of them as repairing targets (i.e., Block 1~5, Block 6 is used for getting output, so we left it out of repairing). For EfficientNet-B0, we follow its block structures and divide them into seven blocks (see Table 2). NAS method. We select PC-DARTS [70] as our NAS solution for ArchRepair . While all popular NAS techniques should fit into ArchRepair (e.g., DARTS, SNAS, and BayesNAS), these techniques use more time than PC-DARTS in searching for a better network architecture. Given that DNN repair is a time-sensitive task, we choose PC-DARTS [70] as it is among the fastest NAS methods. Though ArchRepair remains the interface for switching to other NAS techniques when the task cares more about the performance of repaired models than the time cost of repairing. Hyper-parameters. Regarding the training setup, we employ stochastic gradient descent (SGD) as the optimizer, setting batch size as 128, the initial learning rate as 0.1, and the weight decay as 0.0005. We use the cross-entropy loss as the loss function. The maximum number of epochs is 500, and an early-stop function will terminate the training phase when the validation loss no longer decreases in 10 epochs.
Baselines. To demonstrate the repairing capability of the proposed
ArchRepair, we select six SOTA DNN repair methods from two different categories as baselines: neuron-level repairing methods and network-level repairing methods. The neuron-level repairing methods focus on fixing certain neurons’ weights in order to repair the DNNs. Representative methods from this category are MODE [
45], Apricot [
74], and Arachne [
59]. While network-level repairing methods mainly repair DNNs by using augmented datasets to fine-tune the whole network, where SENSEI [
19], Few-Shot [
54], and DeepRepair [
72] are the most popular ones. For a fair comparison, we employ the same settings on all six repairing methods and
ArchRepair. In order to fully evaluate the effectiveness of the proposed method, we apply all methods (six baselines and
ArchRepair) to fix four different DNN architectures on large-scale datasets, including the clean version and 15 corrupted versions from CIFAR-10 and Tiny-ImageNet, to assess the repairing capability.
Other configurations. We implement ArchRepair in Python 3.9 based on PyTorch framework. All the experiments were performed on a server with a 12-core 3.60 GHz Xeon CPU E5-1650, 128 GB RAM, and four NVIDIA GeForce RTX 3,090 GPUs (each has 4 GB memory), which runs Ubuntu 18.04.
In summary, for each baseline method and
ArchRepair,
our evaluation consists of 96 configurations (6 DNN architectures \(\times\) 16 versions of a dataset3) on four datasets (i.e., CIFAR-10, CIFAR-100, Tiny-InageNet, and ImageNet. For CIFAR-10 dataset, an execution of training and repairing a model under one specific configuration costs about 12 hours on average (the maximum one is about 50 hours); while for Tiny-ImageNet dataset, an execution of training and repairing a model takes about 18 hours on average (the maximum one is about 64 hours).
We measured the execution time of repairing six different DNN architectures (i.e., VGGNet-16, ResNet-18, ResNet-50, ResNet-101, DenseNet-121, and EfficientNet-B0) repaired on CIFAR-10 within 100 epochs by different repairing methods. The results are reported in Table 4. According to Table 4, the Neuron-lv’s methods use less execution time than other repairing methods (The cell with green background), and the Network-lv’s methods use more execution time than others. Our method, ArchRepair, uses more execution time than Neuron-lv’s methods but less than Network-lv’s methods. This is because ArchRepair repairs DNN models on the block level, which works at a larger size than the neuron level but at a smaller size than the network level. This is also consistent with our expectation in Section 2.2, i.e., block-level repairing makes a good tradeoff between effectiveness and efficiency. Overall, the total execution time of our experiments takes more than two months.