Derechos:
Atribución-NoComercial-SinDerivadas 3.0 España
Resumen:
In this thesis we study loss functions that allow to train Convolutional Neural
Networks (CNNs) under noisy datasets for the particular task of Content-
Based Image Retrieval (CBIR). In particular, we propose two novel losses to fit
models that generate gloIn this thesis we study loss functions that allow to train Convolutional Neural
Networks (CNNs) under noisy datasets for the particular task of Content-
Based Image Retrieval (CBIR). In particular, we propose two novel losses to fit
models that generate global image representations. First, a Soft-Matching (SM)
loss, exploiting both image content and meta data, is used to specialized general
CNNs to particular cities or regions using weakly annotated datasets. Second,
a Bag Exponential (BE) loss inspired by the Multiple Instance Learning (MIL)
framework is employed to train CNNs for CBIR under noisy datasets.
The first part of the thesis introduces a novel training framework that, relying
on image content and meta data, learns location-adapted deep models that
provide fine-tuned image descriptors for specific visual contents. Our networks,
which start from a baseline model originally learned for a different task, are specialized
using a custom pairwise loss function, our proposed SM loss, that uses
weak labels based on image content and meta data.
The experimental results show that the proposed location-adapted CNNs
achieve an improvement of up to a 55% over the baseline networks on a landmark
discovery task. This implies that the models successfully learn the visual
clues and peculiarities of the region for which they are trained, and generate
image descriptors that are better location-adapted. In addition, for those landmarks
that are not present on the training set or even other cities, our proposed
models perform at least as well as the baseline network, which indicates a good
resilience against overfitting.
The second part of the thesis introduces the BE Loss function to train CNNs
for image retrieval borrowing inspiration from the MIL framework. The loss
combines the use of an exponential function acting as a soft margin, and a MILbased
mechanism working with bags of positive and negative pairs of images.
The method allows to train deep retrieval networks under noisy datasets, by
weighing the influence of the different samples at loss level, which increases the
performance of the generated global descriptors. The rationale behind the improvement
is that we are handling noise in an end-to-end manner and, therefore,
avoiding its negative influence as well as the unintentional biases due to fixed
pre-processing cleaning procedures. In addition, our method is general enough
to suit other scenarios requiring different weights for the training instances (e.g.
boosting the influence of hard positives during training). The proposed bag exponential
function can bee seen as a back door to guide the learning process
according to a certain objective in a end-to-end manner, allowing the model to
approach such an objective smoothly and progressively.
Our results show that our loss allows CNN-based retrieval systems to be
trained with noisy training sets and achieve state-of-the-art performance. Furthermore,
we have found that it is better to use training sets that are highly
correlated with the final task, even if they are noisy, than training with a clean set that is only weakly related with the topic at hand. From our point of view,
this result represents a big leap in the applicability of retrieval systems and help
to reduce the effort needed to set-up new CBIR applications: e.g. by allowing
a fast automatic generation of noisy training datasets and then using our bag
exponential loss to deal with noise. Moreover, we also consider that this result
opens a new line of research for CNN-based image retrieval: let the models decide
not only on the best features to solve the task but also on the most relevant
samples to do it.[+][-]