As deep learning is applied to high stakes scenarios, it is increasingly important that a model is not only making accurate decisions, but doing so for the right reasons. Common explainability methods provide pixel attributions as an explanation for a model's decision on a single image. However, using these input-level explanations to understand patterns in model behavior is challenging for large datasets as it requires selecting and analyzing an interesting subset of inputs. By utilizing the human generated bounding boxes that represent ground truth object locations, we introduce metrics for scoring and ranking inputs based on the correspondence between the input’s ground truth object location and the explainability method’s explanation region. Our methodology is agnostic to model architecture, explanation method, and modality allowing it to be applied to many tasks and domains. By aligning model explanations with human annotations, our method surfaces patterns in model behavior when applied to two high profile case studies: a widely used image classification model and a cancer prediction model.
Try the demo here!
In AI applications such as cancer diagnosis, autonomous driving, and facial recognition, it is crucial to
not only understand model performance, but also the reason behind model decisions. Various prior work has
demonstrated weaknesses in these models — even highly accurate ones — including reliance on
In this work, we explore model decisions using saliency methods in conjunction with the ground truth object
bounding boxes provided in
many computer vision datasets. We apply three scoring functions — explanation
coverage, ground truth coverage, and intersection over union — to sort images based on the overlap between the explanation
the ground truth object location. By sorting images in this way, we discover insights into when and why the
model was “right for the right reasons”, “wrong for the wrong reasons”, or perhaps most interestingly “right
for the wrong reasons”. We show our methodology is applicable to various model architectures, explanatory
methods, and input data by evaluating on two representative tasks:
Image datasets from domains such as object
In conjunction with ground truth annotations, our method relies on explanation methods (e.g.,
Aside from explainability methods, a growing number of techniques have been developed to help users interpret
In our method, we leverage the ground truth annotations along with instance-level explanations to compute coverage scores for each image. By sorting the images using these scores, we are able to query for instances that give us insight into model behavior. Our method only assumes a set of inputs with ground truth regions and explanations regions, making it agnostic to model architecture, dataset, and explanation technique.
We compute three coverage metrics to allow a breadth of exploration: explanation coverage, ground truth coverage, and intersection over union (IoU). Each score takes as input a set of pixels $$GT$$ corresponding to the known ground truth region and a set of pixels $$E$$ corresponding to the explanation region.
As shown in Figure 1, a low score under all three metrics indicates that an image’s explanation region and ground truth region are disjoint. In Figure 2, we show example scenarios using an image classification task. When a correctly classified image has a low score, it often indicates the model was relying on background information such as a snowmobile helmet to make the prediction of snowmobile or train tracks to make the prediction of electric locomotive. When an image has a low score and is incorrectly classified, it can indicate the model is focusing on a secondary object in the image or incorrectly relying on background context (e.g., using snow to predict arctic fox).
Explanation coverage represents the proportion of the explanation region covered by the ground truth region. High explanation coverage indicates the entire explanation region lies within the ground truth region, meaning the model is relying on a subset of salient features to make its prediction. Filtering for correctly classified inputs with high explanation coverage can surface instances where a subset of the object, such as the dog’s face, was sufficient to make a correct prediction. Looking at incorrectly classified images with high explanation coverage can help us find instances where the model uses an insufficient portion of the object to make a prediction (e.g., using a small region of black and white spots to predict dalmatian).
Ground truth coverage represents the proportion of the ground truth region covered by the explanation region. High ground truth coverage indicates that the model is using the entire object to make its decision. In Figure 2, we see filtering for correctly classified images with high ground truth coverage uncovers instances where the model relies on the object and relevant background pixels (e.g., the cab and the street), to make a correct prediction. Looking at incorrectly classified instances with high ground truth coverage shows examples where the model overrelies on contextual information such as using the keyboard and person’s lap to predict laptop.
IoU is the strictest metric. A high IoU score indicates the explanation and ground truth are very similar and IoU is maximized when the explanation and ground truth regions are identical. Looking at correctly classified images with high IoU scores can help identify instances where the model was right for the exactly the right reasons. Incorrectly classified images with high IoU scores can surface examples where the image labels are ambiguous such as moped and motor scooter.4
In our first case study, we apply our methodology to an image classification task using a publicly
In this case study, we apply the model to an ImageNet
We begin our exploration by looking at instances where the model performs well. In particular, we choose to look at images labeled as Jeep that are classified correctly. To see if the explanations correspond to the ground truth regions, we look at images with high IoU scores. We see the model explanations have high agreement with the ground truth regions, suggesting its performance on these images is due to having learned salient features of Jeeps.
Looking at the other end of the score distribution, we filter for correctly classified Jeep images with low IoU scores. Many of the images still have explanations focused on salient features of Jeeps such as their wheels and distinct body shape; however, we notice an example where the explanation region for the class Jeep is focused on a black dog. This may indicate that the model has memorized the existence of the black dog in the image, raising the question of whether the pixels of the dog contain adversarial properties that could cause the model to predict Jeep for any image edited to contain the dog.
Finally, we look at incorrectly classified images with low explanation coverage to determine what causes the model to fail. Images with low explanation coverage have disjoint explanation and ground truth regions. Without looking at the results, we may hypothesize the model makes the incorrect prediction and has a disjoint explanation because it is guessing at random. However, looking at the images we see the main failure mode occurs when the model predicts a secondary object in the image. Despite each image only having a single label and ground truth annotation, we see a large number of images contain multiple objects.
Our method gives insight into the pretrained PyTorch model predictions on vehicle images from ImageNet, showing that the model uses human interpretable explanations for some classes. Further sorting by our metrics allows us to discover the dataset contains images with multiple objects which is unexpected given the images contain a single label and ground truth annotation.5
In our second case study, we evaluate our method using a melanoma diagnostic task. This case study represents
a real-world scenario where AI-based melanoma classification
In this case study, we use dermoscopic image data from the ISIC Skin Lesion Analysis Towards Melanoma
Detection 2016 Challenge
We begin our exploration by analyzing correctly classified images with the highest IoU scores. These examples show the instances where the lesion segmentation and explanation region are most similar. We see there are a number of images for which the explanation is focused on the lesion, suggesting the model has learned a relationship between lesion characteristics and malignancy.
Since our model seems to be learning some salient features, we next filter to malignant lesions incorrectly classified as benign. Sorting by low ground truth coverage, we see there are instances where our model makes incorrect predictions relying only on peripheral skin regions. This is particularly concerning in the case of at home risk assessment where a cancerous lesion could be classified as benign due to skin surrounding the lesion.
Since our model incorrectly classified malignant lesions using non-salient background information, we explore if the model can also correctly classify lesions without looking at the lesion. We filter to correctly classified benign lesions and look for images with low explanation coverage. We find a number of images where the model relies on the existence of in-frame dermatological tools to make a benign prediction. While not salient, these dermatological tools only exist in benign images and are sufficient to make a correct classification.
Using our methodology reveals insight into melanoma model behavior showing that while the model uses salient pixels to make some decisions, it dangerously misclassifies malignant tumors due to peripheral skin regions and latches onto spurious dataset features.6
In this work, we present a methodology that enables humans to understand model behavior using the alignment between ground truth object labels and saliency method explanations. Our method is agnostic to model architecture, explanation method, and image dataset, allowing it to be used in a range of applications. Using real world case studies, we show our method allows users to identify where the model is “right for the right reasons”, when the model makes correct predictions using non-salient features, and when the dataset contains unexpected features.7