The key for the contemporary deep learning-based object and action localization algorithms to work is the large-scale annotated data. However, in real-world scenarios, since there are infinite amounts of unlabeled data beyond the categories of publicly available datasets, it is not only time- and manpower-consuming to annotate all the data but also requires a lot of computational resources to train the detectors. To address these issues, we show a simple and reliable baseline that can be easily obtained and work directly for the zero-shot text-guided object and action localization tasks without introducing additional training costs by using Grad-CAM, the widely used class visual saliency map generator, with the help of the recently released Contrastive Language-Image Pre-Training (CLIP) model by OpenAI, which is trained contrastively using the dataset of 400 million image-sentence pairs with rich cross-modal information between text semantics and image appearances. With extensive experiments on the Open Images and HICO-DET datasets, the results demonstrate the effectiveness of the proposed approach for the text-guided unseen object and action localization tasks for images.
Vehicle positioning is a key component of autonomous driving. The global positioning system (GPS) is the most commonly used vehicle positioning system currently. However, its accuracy will be affected by environmental differences and thus fails to meet the requirements of meter-level accuracy. We consider a coordinate neighboring vehicle positioning system (CNVPS) based on GPS, omnidirectional radar, and V2V communication ability to obtain additional information from neighboring vehicles to improve the GPS positioning accuracy of vehicles in various environments. We further use the concept of transfer learning (TL) wherein an adversarial mechanism is designed to eliminate the deviation of multiple environments to optimize vehicle positioning accuracy in multiple environments using one model. The simulation results show that, compared with the existing methods, the proposed system architecture not only improves the performance but also effectively reduces the amount of data required for training.
Device-free wireless indoor localization is an essential technology for the Internet of Things (IoT), and fingerprint-based methods are widely used. A common challenge to fingerprint-based methods is data collection and labeling. This paper proposes a few-shot transfer learning system that uses only a small amount of labeled data from the current environment and reuses a large amount of existing labeled data previously collected in other environments, thereby significantly reducing the data collection and labeling cost for localization in each new environment. The core method lies in graph neural network (GNN) based few-shot transfer learning and its modifications. Experimental results conducted on real-world environments show that the proposed system achieves comparable performance to a convolutional neural network (CNN) model, with 40 times fewer labeled data.
Millimeter wave (mmWave) is a key technology for fifth-generation (5G) and beyond communications. Hybrid beamforming has been proposed for large-scale antenna systems in mmWave communications. Existing hybrid beamforming designs based on infinite-resolution phase shifters (PSs) are impractical due to hardware cost and power consumption. In this paper, we propose an unsupervised-learning-based scheme to jointly design the analog precoder and combiner with low-resolution PSs for multiuser multiple-input multiple-output (MU-MIMO) systems. We transform the analog precoder and combiner design problem into a phase classification problem and propose a generic neural network architecture, termed the phase classification network (PCNet), capable of producing solutions of various PS resolutions. Simulation results demonstrate the superior sum-rate and complexity performance of the proposed scheme, as compared to state-of-the-art hybrid beamforming designs for the most commonly used low-resolution PS configurations.
Item concept modeling is commonly achieved by leveraging textual information. However, many existing models do not leverage the inferential property of concepts to capture word meanings, which therefore ignores the relatedness between correlated concepts, a phenomenon which we term conceptual “correlation sparsity.” In this paper, we distinguish between word modeling and concept modeling and propose an item concept modeling framework centering around the item concept network (ICN). ICN models and further enriches item concepts by leveraging the inferential property of concepts and thus addresses the correlation sparsity issue. Specifically, there are two stages in the proposed framework: ICN construction and embedding learning. In the first stage, we propose a generalized network construction method to build ICN, a structured network which infers expanded concepts for items via matrix operations. The second stage leverages neighborhood proximity to learn item and concept embeddings. With the proposed ICN, the resulting embedding facilitates both homogeneous and heterogeneous tasks, such as item-to-item and concept-to-item retrieval, and delivers related results which are more diverse than traditional keyword-matching-based approaches. As our experiments on two real-world datasets show, the framework encodes useful conceptual information and thus outperforms traditional methods in various item classification and retrieval tasks.
Recent advances in Transformer models allow for unprecedented sequence lengths, due to linear space and time complexity. In the meantime, relative positional encoding (RPE) was proposed as beneficial for classical Transformers and consists in exploiting lags instead of absolute positions for inference. Still, RPE is not available for the recent linear-variants of the Transformer, because it requires the explicit computation of the attention matrix, which is precisely what is avoided by such methods. In this paper, we bridge this gap and present Stochastic Positional Encoding as a way to generate PE that can be used as a replacement to the classical additive (sinusoidal) PE and provably behaves like RPE. The main theoretical contribution is to make a connection between positional encoding and cross-covariance structures of correlated Gaussian processes. We illustrate the performance of our approach on the Long-Range Arena benchmark and on music generation.
Predicting extreme weather events such as tropical and extratropical cyclones is of vital scientific and societal importance. Of late, machine learning methods have found their way to weather analysis and prediction, but mostly, these methods use machine learning merely as a complement to traditional numerical weather prediction models. Although some pure machine learning and data-driven approaches for weather prediction have been developed, they mainly formulate the problem similar to pattern recognition or follow the train of thought of traditional time-series models for extreme weather event forecasting; for the former, this usually yields only single-step ahead prediction, and for the latter, this lacks the flexibility to account for observed weather features as such methods concern only the patterns of the extreme weather occurrences. In this paper, we depart from the typical practice of pattern recognition and time-series approaches and focus on employing machine learning to estimate the probabilities of extreme weather occurrences in a multi-step-ahead (MSA) fashion given information on both weather features and the realized occurrences of extreme weather. Specifically, we propose a Markov conditional forward (MCF) model that adopts the Markov property between the occurrences of extreme weather for MSA extreme weather forecasting. Moreover, for better long-term prediction, we propose three novel cube perturbation methods to address error accumulation in our model. Experimental results on a real-world extreme weather dataset show the superiority of the proposed MCF model in terms of prediction accuracy for both short-term and long-term forecasting; moreover, the three cube perturbation methods successfully increase the fault tolerance and generalization ability of the MCF model, yielding significant improvements for long-term prediction.
Device-free wireless indoor localization is a key enabling technology for the Internet of Things (IoT). Fingerprint-based indoor localization techniques are a commonly used solution. This paper proposes a semi-supervised, generative adversarial network (GAN)-based device-free fingerprinting indoor localization system. The proposed system uses a small amount of labeled data and a large amount of unlabeled data (i.e., semi-supervised), thus considerably reducing the expensive data labeling effort. Experimental results show that, as compared to the state-of-the-art supervised scheme, the proposed semi-supervised system achieves comparable performance with equal, sufficient amount of labeled data, and significantly superior performance with equal, highly limited amount of labeled data. Besides, the proposed semi-supervised system retains its performance over a broad range of the amount of labeled data. The interactions between the generator, discriminator, and classifier models of the proposed GAN-based system are visually examined and discussed. A mathematical description of the proposed system is also presented.
Current peripheral execution approaches for intermittently-powered systems require full access to the internal hardware state for checkpointing or rely on application-level energy estimation for task partitioning to make correct forward progress. Both requirements present significant practical challenges for energy-harvesting, intelligent edge IoT devices, which perform hardware accelerated DNN inference. Sophisticated compute peripherals may have inaccessible internal state, and the complexity of DNN models makes it difficult for programmers to partition the application into suitably sized tasks that fit within an estimated energy budget. This paper presents the concept of inference footprinting for intermittent DNN inference, where accelerator progress is accumulatively preserved across power cycles. Our middleware stack, HAWAII, tracks and restores inference footprints efficiently and transparently to make inference forward progress, without requiring access to the accelerator internal state and application-level energy estimation. Evaluations were carried out on a Texas Instruments device, under varied energy budgets and network workloads. Compared to a variety of task-based intermittent approaches, HAWAII improves the inference throughput by 5.7% to 95.7%, particularly achieving higher performance on heavily accelerated DNNs.
Abstract—Sim-to-real, a term that describes where a model is trained in a simulator then transferred to the real world, is a technique that enables faster deep reinforcement learning (DRL) training. However, differences between the simulator and the real world often cause the model to perform poorly in the real world. Domain randomization is a way to bridge the sim-to-real gap by exposing the model to a wide range of scenarios so that it can generalize to real-world situations. However, following domain randomization to train an autonomous car racing model with DRL can lead to undesirable outcomes. Namely, a model trained with randomization tends to run slower; a higher completion rate on the testing track comes at the expense of longer lap times. This paper aims to boost the robustness of a trained race car model without compromising racing lap times. For a training track and a testing track having the same shape (and same optimal paths), but with different lighting, background, etc., we first train a model (teacher model) that overfits the training track, moving along a near optimal path. We then use this model to teach a student model the correct actions along with randomization. With our method, a model with 18.4% completion rate on the testing track is able to help teach a student model with 52% completion. Moreover, over an average of 50 trials, the student is able to finish a lap 0.23 seconds faster than the teacher. This 0.23 second gap is significant in tight races, with lap times of about 10 to 12 seconds. Index Terms—reinforcement learning, deep learning, miniature autonomous race car, sim-to-real, AWS DeepRacer