:::
The key for the contemporary deep learning-based object and action localization algorithms to work is the large-scale annotated data. However, in real-world scenarios, since there are infinite amounts of unlabeled data beyond the categories of publicly available datasets, it is not only time- and manpower-consuming to annotate all the data but also requires a lot of computational resources to train the detectors. To address these issues, we show a simple and reliable baseline that can be easily obtained and work directly for the zero-shot text-guided object and action localization tasks without introducing additional training costs by using Grad-CAM, the widely used class visual saliency map generator, with the help of the recently released Contrastive Language-Image Pre-Training (CLIP) model by OpenAI, which is trained contrastively using the dataset of 400 million image-sentence pairs with rich cross-modal information between text semantics and image appearances. With extensive experiments on the Open Images and HICO-DET datasets, the results demonstrate the effectiveness of the proposed approach for the text-guided unseen object and action localization tasks for images.
Item concept modeling is commonly achieved by leveraging textual information. However, many existing models do not leverage the inferential property of concepts to capture word meanings, which therefore ignores the relatedness between correlated concepts, a phenomenon which we term conceptual “correlation sparsity.” In this paper, we distinguish between word modeling and concept modeling and propose an item concept modeling framework centering around the item concept network (ICN). ICN models and further enriches item concepts by leveraging the inferential property of concepts and thus addresses the correlation sparsity issue. Specifically, there are two stages in the proposed framework: ICN construction and embedding learning. In the first stage, we propose a generalized network construction method to build ICN, a structured network which infers expanded concepts for items via matrix operations. The second stage leverages neighborhood proximity to learn item and concept embeddings. With the proposed ICN, the resulting embedding facilitates both homogeneous and heterogeneous tasks, such as item-to-item and concept-to-item retrieval, and delivers related results which are more diverse than traditional keyword-matching-based approaches. As our experiments on two real-world datasets show, the framework encodes useful conceptual information and thus outperforms traditional methods in various item classification and retrieval tasks.
Predicting extreme weather events such as tropical and extratropical cyclones is of vital scientific and societal importance. Of late, machine learning methods have found their way to weather analysis and prediction, but mostly, these methods use machine learning merely as a complement to traditional numerical weather prediction models. Although some pure machine learning and data-driven approaches for weather prediction have been developed, they mainly formulate the problem similar to pattern recognition or follow the train of thought of traditional time-series models for extreme weather event forecasting; for the former, this usually yields only single-step ahead prediction, and for the latter, this lacks the flexibility to account for observed weather features as such methods concern only the patterns of the extreme weather occurrences. In this paper, we depart from the typical practice of pattern recognition and time-series approaches and focus on employing machine learning to estimate the probabilities of extreme weather occurrences in a multi-step-ahead (MSA) fashion given information on both weather features and the realized occurrences of extreme weather. Specifically, we propose a Markov conditional forward (MCF) model that adopts the Markov property between the occurrences of extreme weather for MSA extreme weather forecasting. Moreover, for better long-term prediction, we propose three novel cube perturbation methods to address error accumulation in our model. Experimental results on a real-world extreme weather dataset show the superiority of the proposed MCF model in terms of prediction accuracy for both short-term and long-term forecasting; moreover, the three cube perturbation methods successfully increase the fault tolerance and generalization ability of the MCF model, yielding significant improvements for long-term prediction.
Textual data is common and informative auxiliary information for recommender systems. Most prior art utilizes text for rating predic- tion, but rare work connects it to top- N recommendation. Moreover, although advanced recommendation models capable of incorporating auxiliary information have been developed, none of these are specifically designed to model textual information, yielding a limited usage scenario for typical user-to-item recommendation. In this work, we present a framework of text-aware preference ranking (TPR) for top- N recommendation, in which we comprehensively model the joint association of user-item interaction and relations between items and associated text. Using the TPR framework, we construct a joint likelihood function that explicitly describes two ranking structures: 1) item preference ranking (IPR) and 2) word relatedness ranking (WRR), where the former captures the item preference of each user and the latter captures the word relatedness of each item. As these two explicit structures are by nature mutually dependent, we propose TPR-OPT, a simple yet effective learning criterion that additionally includes implicit structures, such as relatedness between items and relatedness between words for each user for model optimization. Such a design not only successfully describes the joint association among users, words, and text comprehensively but also naturally yields powerful representations that are suitable for a range of recommendation tasks, including user-to-item, item-to-item, and user-to-word recommendation, as well as item-to-word reconstruction. In this paper, extensive experiments have been conducted on eight recommendation datasets, the results of which demonstrate that by including textual information from item descriptions, the proposed TPR model consistently outperforms state-of-the-art baselines on various recommendation tasks.
In this paper, we propose a novel optimization criterion that leverages features of the skew normal distribution to better model the problem of personalized recommendation. Specifically, the developed criterion borrows the concept and the flexibility of the skew normal distribution, based on which three hyperparameters are attached to the optimization criterion. Furthermore, from a theoretical point of view, we not only establish the relation between the maximization of the proposed criterion and the shape parameter in the skew normal distribution, but also provide the analogies and asymptotic analysis of the proposed criterion to maximization of the area under the ROC curve. Experimental results conducted on a range of large-scale real-world datasets show that our model significantly outperforms the state of the art and yields consistently best performance on all tested datasets.
In recent years, the research community has approached the problem of vehicle re-identification (re-id) with attention-based models, specifically focusing on regions of a vehicle containing discriminative information. These re-id methods rely on expensive key-point labels, part annotations, and additional attributes including vehicle make, model, and color. Given the large number of vehicle re-id datasets with various levels of annotations, strongly-supervised methods are unable to scale across different domains. In this paper, we present Self-supervised Attention for Vehicle Re-identification (SAVER), a novel approach to effectively learn vehicle-specific discriminative features. Through extensive experimentation, we show that SAVER improves upon the state-of-the-art on challenging VeRi, VehicleID, Vehicle-1M and VERI-Wild datasets.
Recent advances in deep convolutional neural networks (DCNNs) and generative adversarial networks (GANs) have significantly improved the performance of single image blind deblurring algorithms. However, most of the existing algorithms require paired training data. In this paper, we present an unsupervised method for single-image deblurring without paired training images. We introduce a disentangled framework to split the content and blur features of a blurred image, which yields improved deblurring performance. To handle the unpaired training data, a blurring branch and the cycle-consistency loss are added to guarantee that the content structures of the restored results match the original images. We also add a perceptual loss to further mitigate the artifacts. For natural image deblurring, we introduce a color loss to reduce color distortions in outputs. Extensive experiments on both domain-specific and natural image deblurring show the proposed method achieves competitive results compared to recent state-of-the-art deblurring approaches.

To exploit rich information from unlabeled data, in this work, wepropose  a  novel  self-supervised  framework  for  visual  trackingwhich can easily adapt the state-of-the-art supervised Siamese-based trackers into unsupervised ones by utilizing the fact thatan image and any cropped region of it can form a natural pairfor self-training. Besides common geometric transformation-baseddata augmentation and hard negative mining, we also propose ad-versarial masking which helps the tracker to learn other contextinformation by adaptively blacking out salient regions of the tar-get. The proposed approach can be trained offline using imagesonly without any requirement of manual annotations and tempo-ral information from multiple consecutive frames. Thus, it can beused with any kind of unlabeled data, including images and videoframes. For evaluation, we take SiamFC as the base tracker andname the proposed self-supervised method as푆2SiamFC. Extensiveexperiments and ablation studies on the challenging VOT2016 andVOT2018 datasets are provided to demonstrate the effectivenessof the proposed method which not only achieves comparable per-formance to its supervised counterpart and other unsupervisedmethods requiring multiple frames.
I n recent years, waveform-mapping-based speech enhancement (SE) methods have garnered significant attention. These methods generally use a deep learning model to directly process and reconstruct speech waveforms. Because both the input and output are in waveform format, the waveform-mapping-based SE methods can overcome the distortion caused by imperfect phase estimation, which may be encountered in spectral-mapping-based SE systems. So far, most waveform-mapping-based SE methods have focused on single-channel tasks. In this paper, we propose a novel fully convolutional network (FCN) with Sinc and dilated convolutional layers (termed SDFCN) for multichannel SE that operates in the time domain. We also propose an extended version of SDFCN, called the residual SDFCN (termed rSDFCN). The proposed methods are evaluated on three multichannel SE tasks, namely the dual-channel inner-ear microphones SE task, the distributed microphones SE task, and the CHiME-3 dataset. The experimental results confirm the outstanding denoising capability of the proposed SE systems on both tasks and the benefits of using the residual architecture on the overall SE performance.