:::
One of the most exciting but challenging endeavors in music research is to develop a computational model that comprehends the affective content of music signals and organizes a music collection according to emotion. In this paper, we propose a novel acoustic emotion Gaussians (AEG) model that defines a proper generative process of emotion perception in music. As a generative model, AEG permits easy and straightforward interpretations of the model learning processes. To bridge the acoustic feature space and music emotion space, a set of latent feature classes, which are learned from data, is introduced to perform the end-to-end semantic mappings between the two spaces. Based on the space of latent feature classes, the AEG model is applicable to both automatic music emotion annotation and emotion-based music retrieval. To gain insights into the AEG model, we also provide illustrations of the model learning process. A comprehensive performance study is conducted to demonstrate the superior accuracy of AEG over its predecessors, using two emotion annotated music corpora MER60 and MTurk. Our results show that the AEG model outperforms the state-of-the-art methods in automatic music emotion annotation. Moreover, for the first time a quantitative evaluation of emotion-based music retrieval is reported.
A major challenge in the design of multicore embedded systems is how to tackle the communications among tasks with performance requirements and precedence constraints. In this paper, we consider the problem of scheduling real-time tasks over multilayer bus systems with the objective of minimizing the communication cost. We show that the problem is NP-hard and determine the best possible approximation ratio of approximation algorithms. First, we propose a polynomial-time optimal algorithm for a restricted case where one multilayer bus, and the unit execution time and communication time are considered. The result is then extended as a pseudopolynomial-time optimal algorithm to consider multiple multilayer buses with arbitrary execution and communication times, as well as different timing constraints and objective functions. We compare the performance of the proposed algorithm with that of some popular heuristics, and provide further insights into the multilayer bus system design.
Layer-based video coding, together with adaptive modulation and coding, is a promising technique for providing real-time video multicast services on heterogeneous mobile devices. With the rapid growth of data communications for emerging applications, reducing the energy consumption of mobile devices is a major challenge. This paper addresses the problem of resource allocation for video multicast in fourth-generation wireless systems, with the objective of minimizing the total energy consumption for data reception. First, we consider the problem when scalable video coding is applied. We prove that the problem is NP-hard and propose a 2-approximation algorithm to solve it. Then, we investigate the problem under multiple description coding, and show that it is also NP-hard and cannot be approximated in polynomial time with a ratio better than 2, unless P=NP. To solve this case, we develop a pseudopolynomial time 2-approximation algorithm. The results of simulations conducted to compare the proposed algorithms with a brute-force optimal algorithm and a conventional approach are very encouraging.
Most existing studies on music mood classification have been focusing on Western music while little research has investigated whether mood categories, audio features, and classification models developed from Western music are applicable to non-Western music. This paper attempts to answer this question through a comparative study on English and Chinese songs. Specifically, a set of Chinese pop songs were annotated using an existing mood taxonomy developed for English songs. Six sets of audio features commonly used on Western music (e.g., timbre, rhythm) were extracted from both Chinese and English songs, and mood classification performances based on these feature sets were compared. In addition, experiments were conducted to test the generalizability of classification models across English and Chinese songs. Results of this study shed light on cross-cultural applicability of research results on music mood classification.
We present a framework to count the number of people in an environment where multiple cameras with different angles of view are available. We consider the visual cues captured by each camera as a knowledge source, and carry out cross-camera knowledge transfer to alleviate the difficulties of people counting, such as partial occlusions, low-quality images, clutter backgrounds, and so on. Specifically, this work can distinguish itself with the following contributions. First, we overcome the variations of multiple heterogeneous cameras with different perspective settings by matching the same groups of pedestrians taken by these cameras, and present an algorithm for accomplishing cross-camera correspondence. Second, the proposed counting model is composed of a pair of collaborative regressors. While one regressor measures the people count by features extracted from the intra-camera visual evidences, the other recovers the yielded residual by taking the conflicts among inter-camera predictions into account. The two regressors are elegantly coupled, and jointly lead to an accurate counting system. Besides, we provide a set of manually annotated pedestrian labels on PETS 2010 videos for performance evaluation. Our approach is comprehensively tested in various settings and compared with competitive baselines. The significant improvement in performance manifests the effectiveness of the proposed approach.
The Gaussian mixture model (GMM)-based method has dominated the field of voice conversion (VC) for last decade. However, the converted spectra are excessively smoothed and thus produce muffled converted sound. In this study, we improve the speech quality by enhancing the dependency between the source (natural sound) and converted feature vectors (converted sound). It is believed that enhancing this dependency can make the converted sound closer to the natural sound. To this end, we propose an integrated maximum a posteriori and mutual information (MAPMI) criterion for parameter generation on spectral conversion. Experimental results demonstrate that the quality of converted speech by the proposed MAPMI method outperforms that by the conventional method in terms of formal listening test.
We aim to resolve the difficulties of action recognition arising from the large intra-class variations. These unfavorable variations make it infeasible to represent one action instance by other ones of the same action. We hence propose to extract both instance-specific and class-consistent features to facilitate action recognition. Specifically, the instance-specific features explore the self-similarities among frames of each video instance, while class-consistent features summarize withinclass similarities. We introduce a generative formulation to combine the two diverse types of features. The experimental results demonstrate the effectiveness of our approach.
Most digital cameras capture one primary color at each pixel by a single sensor overlaid with a color filter array. To re- cover a full color image from incomplete color samples, one needs to restore the two missing color values for each pixel. This restoration process is known as color demosaicking. In this paper, we present a novel self-learning approach to this problem via support vector regression. Unlike prior learning-based demosaicking methods, our approach aims at extract- ing image-dependent information in constructing the learning model, and we do not require any additional training data. Experimental results show that our proposed method outperforms many state-of-the-art techniques in both subjective and objective image quality measures.
Rain removal from a single image is one of the challenging image denoising problems. In this paper, we present a learning-based framework for single image rain removal, which focuses on the learning of context information from an input image, and thus the rain patterns present in it can be automatically identified and removed. We approach the single image rain removal problem as the integration of image decomposition and self-learning processes. More precisely, our method first performs context-constrained image segmentation on the input image, and we learn dictionaries for the highfrequency components in different context categories via sparse coding for reconstruction purposes. For image regions with rain streaks, dictionaries of distinct context categories will share common atoms which correspond to the rain patterns. By utilizing PCA and SVM classifiers on the learned dictionaries, our framework aims at automatically identifying the common rain patterns present in them, and thus we can remove rain streaks as particular high-frequency components from the input image. Different from prior works on rain removal from images/videos which require image priors or training image data from multiple frames, our proposed selflearning approach only requires the input image itself, which would save much pre-training effort. Experimental results demonstrate the subjective and objective visual quality improvement with our proposed method.
Improving the endurance of PCM is a fundamental issue when the technology is considered as an alternative to main memory usage. In the design of memory-based wear leveling approaches, a major challenge is how to efficiently determine the appropriate memory pages for allocation or swapping. In this paper, we present an efficient wear-leveling design that is compatible with existing virtual memory management. Two implementations, namely, bucket-based and array-based wear leveling, with nearly zero search cost are proposed to tradeoff time and space complexity. The results of experiments conducted based on popular benchmarks to evaluate the efficacy of the proposed design are very encouraging.