A thesis entitled Robust Video-Text Retrieval via Noisy Pair Calibration (DOI: 10.1109/TMM.2023.3239183) was published recently in IEEE Transactions on Multimedia, a first-class international journal in the field of multimedia by the project team of Research Fellow Zhang Huaiwen from the National and Local Joint Engineering Research Center of Intelligent Information Processing Technology for Mongolian (Inner Mongolia Key Laboratory of Mongolian Information Processing Technology) of the College of Computer Science (College of Software) of IMU and the team of Research Fellow Xu Changsheng from the Institute of Automation of the Chinese Academy of Sciences.
With the popularization of mobile equipment and the ever-expanding scale of video data, video-text retrieval is becoming increasingly important. The present mainstream method is to shine the samples of video and text upon a common representation space in which samples of similar semantic meanings are close to each other in distance. However, the prevailing methods may be influenced by the following noises in the process of the construction of common representation space: one is the fact that the video-text of the positive sample pairs may not necessarily be precisely matched. Since a great many of the present data sets are marked with crowdsourcing, the participation of non-professional note makers has inevitably caused label noises; the other is the negative sample pairs randomly selected for the learning of video-text representation. Samples similar in meaning to inquiries may be mistakenly classified as negative samples.

Fig. 1: Two kinds of noise pairs in video-text retrieval
To alleviate the negative impacts of these noise data on training, the thesis proposes a new method of robust video-text retrieval.

Fig. 2: Framework of robust video-text retrieval
An uncertain estimation model is first designed to identify noise data through uncertainty grades, and subsequently the comparative loss function of the three-dimensional loss and weighting of self-adaption boundary is proposed. The two types of noise data are corrected through their uncertainties to alleviate the negative effects they produce.

Table 1: Comparison of performance at different noise ratios in multiple video-text retrieval methods
To verify the effectiveness of the methods proposed, the thesis presents large amounts of experiments on the data sets of video-text retrieval extensively used. The result of the experiment presented in Table 1 indicates that the methods mentioned can successfully alleviate the negative effects produced by noise data and improve the performance of video-text retrieval.

Fig 3: (a)-(d): Visualized result of the identification of noise data; (e): Result of retrieval
The thesis presents the distribution of noise data in data sets (a), distribution of noise data identified through the methods proposed (b), comparison of the actual uncertainties of non-noise data and the predictions thereof (c) and the actual uncertainties of noise data and the predictions thereof (d). It can be seen that the methods proposed can distinguish with considerable accuracy the noise in the training data so as to ensure the performance of the model (Fig. e).
IEEE Transactions on Multimedia is a first-class international journal in studies of the technology and application of multimedia , which is listed as a TOP JCR Section I journal with influence factor of 8.182. Research Fellow Zhang Huaiwen from the College of Computer Science (College of Software) of IMU is the first author and Yang Yang, a doctoral candidate enrolled in the College of Computer Science (College of Software) in 2021 is the second author of the thesis. The project is sponsored by the Steed Plan of IMU.
URL for the thesis: https://ieeexplore.ieee.org/document/10024790