Artificial Neural Networks using Multimodal Neurons
The discovered
multimodal neurons exist in the human brain. Rather than any specific visual
aspect, these neurons respond to clusters of abstract thoughts based around a
similar high-level topic. The "Halle Berry" neuron, which has been
published in both Scientific American and The New York Times, responds to
images, doodles, and the text "Halle Berry" (but not other names).
CLIP is a
general-purpose vision system developed by OpenAI that matches the performance
of a ResNet-502 but exceeds existing vision systems on some of the most
difficult datasets. Each of these challenge datasets, ObjectNet, ImageNet
Rendition, and ImageNet Sketch, puts the model through its paces by requiring it
to recognize not only simple distortions or changes in lighting or pose, but
also complete abstraction and reconstruction—sketches, cartoons, and even
statues of the objects.
CLIP
Multimodal Neurons
The study is the result
of nearly a decade of research on reading convolutional networks, which began
with the discovery that many of these traditional techniques are directly
relevant to CLIP. To further comprehend the model's activations, we use two
tools: feature visualization, which uses gradient-based optimization to
maximize the neuron's firing, and dataset examples, which examines the
distribution of maximal activating images for a neuron in a dataset.
Bias
and Overgeneralization
Despite being trained on a restricted fraction of the internet, the model nevertheless inherits many biases and associations that are unchecked. Although many of the associations we found appear to be innocuous, we found numerous instances where CLIP holds links that potentially lead to representational injury, such as denigration of specific individuals or groups.
These connections pose
clear obstacles to the use of such sophisticated visual systems. These biases
and associations are likely to remain in the system, with their consequences
emerging in both apparent and practically undetectable ways during deployment,
whether fine-tuned or employed zero-shot. Because many biased actions are
difficult to predict a priori, measuring and correcting them can be difficult.
We believe that by identifying some of these linkages and ambiguities ahead of
time, our interpretability tools can help practitioners anticipate future
problems.
The understanding of
CLIP is constantly developing, and we're still deciding if and how large
versions of CLIP should be released. We believe that continued community study
of the available versions, as well as the tools we're presenting today, will
contribute to a better understanding of multimodal systems and help us make
better decisions.
Innovative information
ReplyDelete