:::
In this paper, we present an automatic foreground object detection method for videos captured by freely moving cameras. While we focus on extracting a single foreground object of interest throughout a video sequence, our approach does not require any training data nor the interaction by the users. Based on the SIFT correspondence across video frames, we construct robust SIFT trajectories in terms of the calculated foreground feature point probability. Our foreground feature point probability is able to determine candidate foreground feature points in each frame, without the need of user interaction such as parameter or threshold tuning. Furthermore, we propose a probabilistic consensus foreground object template (CFOT), which is directly applied to the input video for moving object detection via template matching. Our CFOT can be used to detect the foreground object in videos captured by a fast moving camera, even if the contrast between the foreground and background regions is low. Moreover, our proposed method can be generalized to foreground object detection in dynamic backgrounds, and is robust to viewpoint changes across video frames. The contribution of this paper is trifold: (1) we provide a robust decision process to detect the foreground object of interest in videos with contrast and viewpoint variations; (2) our proposed method builds longer SIFT trajectories, and this is shown to be robust and effective for object detection tasks; and (3) the construction of our CFOT is not sensitive to the initial estimation of the foreground region of interest, while its use can achieve excellent foreground object detection results on real-world video data.
Multimedia Broadcast/Multicast Service (MBMS) is a bandwidth efficient broadcast scheme for multimedia communications. To support prioritized transmissions, the unequal error protection (UEP) for multi-resolution multimedia sources can be realized through MBMS. Nevertheless, the enhancement on the transmission fidelity in base layer typically sacrifices the fidelity of enhancement layers. Herein, a novel dual diversity space-time coding (DDSTC) is proposed to exploit the intrinsic UEP capability of space-time codes by utilizing a constellation mapping duo for two consecutive transmission periods in multiple-input multiple-output (MIMO) systems. As compared with Alamouti coding, the DDSTC achieves coding gains on the transmission error rates of base layer without significant degradations on the enhancement layers. At the transmission rates of base and enhancement layers equal to 2 bits per transmission, the DDSTC obtains 1.3 dB and 3.0 dB coding gains for base layer in $2 \times 2$ and $2 \times 3$ MIMO systems respectively. Besides, analytical analysis on symbol error probabilities verifies that 6 dB asymptotic coding gain is reachable in rich transmit diversity scenarios. While attaining the considerable improvements on error rates , the DDSTC avoids the high decoding complexity by adopting our proposed decoding schemes. Simulation results show that DDSTC outperforms conventional UEP schemes based on hierarchical modulations or power allocations.
Linear discriminant analysis (LDA) is a popular supervised dimension reduction algorithm, which projects the data into an effective low-dimensional linear subspace while the separation between the projected data from different classes is improved. While this subspace is typically determined by solving a generalized eigenvalue decomposition problem, its high computation costs prohibit the use of LDA especially when the scale and the dimensionality of the data are large. Based on the recent success of least squares LDA (LSLDA), we propose a novel rank-one update method with a simplified class indicator matrix. Using the proposed algorithm, we are able to derive the LSLDA model efficiently. Moreover, our LSLDA model can be extended to address the learning task of concept drift, in which the recently received data exhibit with gradual or abrupt changes in distribution. In other words, our LSLDA is able to observe and model the data distribution changes, while the dependency on outdated data will be suppressed. This proposed LSLDA will benefit applications of streaming data classification or mining, and it can recognize data with newly added class labels during the learning process. Experimental results on both synthetic and real datasets (with and without concept drift) confirm the effectiveness of our propose LSLDA.
One of the most exciting but challenging endeavors in music research is to develop a computational model that comprehends the affective content of music signals and organizes a music collection according to emotion. In this paper, we propose a novel acoustic emotion Gaussians (AEG) model that defines a proper generative process of emotion perception in music. As a generative model, AEG permits easy and straightforward interpretations of the model learning processes. To bridge the acoustic feature space and music emotion space, a set of latent feature classes, which are learned from data, is introduced to perform the end-to-end semantic mappings between the two spaces. Based on the space of latent feature classes, the AEG model is applicable to both automatic music emotion annotation and emotion-based music retrieval. To gain insights into the AEG model, we also provide illustrations of the model learning process. A comprehensive performance study is conducted to demonstrate the superior accuracy of AEG over its predecessors, using two emotion annotated music corpora MER60 and MTurk. Our results show that the AEG model outperforms the state-of-the-art methods in automatic music emotion annotation. Moreover, for the first time a quantitative evaluation of emotion-based music retrieval is reported.
A major challenge in the design of multicore embedded systems is how to tackle the communications among tasks with performance requirements and precedence constraints. In this paper, we consider the problem of scheduling real-time tasks over multilayer bus systems with the objective of minimizing the communication cost. We show that the problem is NP-hard and determine the best possible approximation ratio of approximation algorithms. First, we propose a polynomial-time optimal algorithm for a restricted case where one multilayer bus, and the unit execution time and communication time are considered. The result is then extended as a pseudopolynomial-time optimal algorithm to consider multiple multilayer buses with arbitrary execution and communication times, as well as different timing constraints and objective functions. We compare the performance of the proposed algorithm with that of some popular heuristics, and provide further insights into the multilayer bus system design.
We present a framework to count the number of people in an environment where multiple cameras with different angles of view are available. We consider the visual cues captured by each camera as a knowledge source, and carry out cross-camera knowledge transfer to alleviate the difficulties of people counting, such as partial occlusions, low-quality images, clutter backgrounds, and so on. Specifically, this work can distinguish itself with the following contributions. First, we overcome the variations of multiple heterogeneous cameras with different perspective settings by matching the same groups of pedestrians taken by these cameras, and present an algorithm for accomplishing cross-camera correspondence. Second, the proposed counting model is composed of a pair of collaborative regressors. While one regressor measures the people count by features extracted from the intra-camera visual evidences, the other recovers the yielded residual by taking the conflicts among inter-camera predictions into account. The two regressors are elegantly coupled, and jointly lead to an accurate counting system. Besides, we provide a set of manually annotated pedestrian labels on PETS 2010 videos for performance evaluation. Our approach is comprehensively tested in various settings and compared with competitive baselines. The significant improvement in performance manifests the effectiveness of the proposed approach.
Most existing studies on music mood classification have been focusing on Western music while little research has investigated whether mood categories, audio features, and classification models developed from Western music are applicable to non-Western music. This paper attempts to answer this question through a comparative study on English and Chinese songs. Specifically, a set of Chinese pop songs were annotated using an existing mood taxonomy developed for English songs. Six sets of audio features commonly used on Western music (e.g., timbre, rhythm) were extracted from both Chinese and English songs, and mood classification performances based on these feature sets were compared. In addition, experiments were conducted to test the generalizability of classification models across English and Chinese songs. Results of this study shed light on cross-cultural applicability of research results on music mood classification.
Layer-based video coding, together with adaptive modulation and coding, is a promising technique for providing real-time video multicast services on heterogeneous mobile devices. With the rapid growth of data communications for emerging applications, reducing the energy consumption of mobile devices is a major challenge. This paper addresses the problem of resource allocation for video multicast in fourth-generation wireless systems, with the objective of minimizing the total energy consumption for data reception. First, we consider the problem when scalable video coding is applied. We prove that the problem is NP-hard and propose a 2-approximation algorithm to solve it. Then, we investigate the problem under multiple description coding, and show that it is also NP-hard and cannot be approximated in polynomial time with a ratio better than 2, unless P=NP. To solve this case, we develop a pseudopolynomial time 2-approximation algorithm. The results of simulations conducted to compare the proposed algorithms with a brute-force optimal algorithm and a conventional approach are very encouraging.
The Gaussian mixture model (GMM)-based method has dominated the field of voice conversion (VC) for last decade. However, the converted spectra are excessively smoothed and thus produce muffled converted sound. In this study, we improve the speech quality by enhancing the dependency between the source (natural sound) and converted feature vectors (converted sound). It is believed that enhancing this dependency can make the converted sound closer to the natural sound. To this end, we propose an integrated maximum a posteriori and mutual information (MAPMI) criterion for parameter generation on spectral conversion. Experimental results demonstrate that the quality of converted speech by the proposed MAPMI method outperforms that by the conventional method in terms of formal listening test.
Most digital cameras capture one primary color at each pixel by a single sensor overlaid with a color filter array. To re- cover a full color image from incomplete color samples, one needs to restore the two missing color values for each pixel. This restoration process is known as color demosaicking. In this paper, we present a novel self-learning approach to this problem via support vector regression. Unlike prior learning-based demosaicking methods, our approach aims at extract- ing image-dependent information in constructing the learning model, and we do not require any additional training data. Experimental results show that our proposed method outperforms many state-of-the-art techniques in both subjective and objective image quality measures.