資訊科技創新研究中心 | 近期研究成果

L. Su, C.-C. M. Yeh, J.-Y. Liu, J.-C. Wang, And Y.-H. Yang

A systematic evaluation of the bag-of-frames representation for music information retrieval

IEEE Transactions on Multimedia

August 2014

There has been an increasing attention on learning feature representations from the complex, high-dimensional audio data applied in various music information retrieval (MIR) problems. Unsupervised feature learning techniques, such as sparse coding and deep belief networks have been utilized to represent music information as a term-document structure comprising of elementary audio codewords. Despite the widespread use of such bag-of-frames (BoF) model, few attempts have been made to systematically compare different component settings. Moreover, whether techniques developed in the text retrieval community are applicable to audio codewords is poorly understood. To further our understanding of the BoF model, we present in this paper a comprehensive evaluation that compares a large number of BoF variants on three different MIR tasks, by considering different ways of low-level feature representation, codebook construction, codeword assignment, segment-level and song-level feature pooling, tf-idf term weighting, power normalization, and dimension reduction. Our evaluations lead to the following findings: 1) modeling music information by two levels of abstraction improves the result for difficult tasks such as predominant instrument recognition, 2) tf-idf weighting and power normalization improve system performance in general, 3) topic modeling methods such as latent Dirichlet allocation does not work for audio codewords.

Y.-H. Tsai*, H.-M. Hsu*, C.-A. Hou, And Y.-C. F. Wang

Person-Specific Domain Adaptation with Applications to Heterogeneous Face Recognition

IEEE International Conference on Image Processing (ICIP)

October 2014

Heterogeneous face recognition (HFR) is considered to be more and more important nowadays. However, even though common subspace learning serves as an effective techniques of HFR, we could not simply trust the common subspace constructed by external people since the huge infra-person differences. In this paper, we proposed a person-specific domain adaptation model for each testing image. Our model combines with common subspace constructed by eliminating the heterogeneous components of images in different domains. In our experiment, we take CUFS database and NIR-VIS 2.0 database for evaluation, and it shows high effectiveness of our proposed model.

Y. H. Lai, F. Chen, And Y. Tsao

An Adaptive Envelope Compression Strategy for Speech Processing in Cochlear Implants

Interspeech 2014, Poster Session

September 2014

Hearing-impaired patients have limited hearing dynamic range for speech perception, which partially accounts for their poor speech understanding abilities, particularly in noise. Wide dynamic range compression aims to compress speech signal into the usable hearing dynamic range of hearing-impaired listeners; however, it normally uses a static compression based strategy. This work proposed a strategy to continuously adjust the envelope compression ratio for speech processing in cochlear implants. This adaptive envelope compression (AEC) strategy aims to keep the compression processing as close to linear as possible, while still confine the compressed amplitude envelope within the pre-set dynamic range. Vocoder simulation experiments showed that, when narrowed down to a small dynamic range, the intelligibility of AEC-processed sentences was significantly better than those processed by static envelope compression. This makes the proposed AEC strategy a promising way to improve speech recognition performance for implanted patients in the future.

K.-W. Liang, Y.-C. Wu, Y.-A. Huang, J.-H. Chen, And Y.-C. F. Wang

MR Image Enhancement via Adaptive Guided Filtering

IEEE Engineering in Medicine and Biology Society (EMBC), Poster Session

August 2014

Recently, numerous imaging methods have been proposed for fast MR imaging, including Single Carrier Wideband MRI proposed by Huang et al. [1] and other techniques performed in the k-space. However, these methods suffer from blurring and ringing artifacts due to insufficient k-space sampling numbers. In order to take advantages of fast imaging techniques while preserving high-resolution MR images, we propose an image filtering based algorithm, adaptive guided filtering , which is able to suppress the above artifacts while enhancing/preserving the contrast of image details.

C.-A. Hou, M.-C. Yang, Y.-C. F. Wang

Domain Adaptive Self-Taught Learning for Heterogeneous Face Recognition

IEEE/IAPR International Conference on Pattern Recognition (ICPR), Poster Session

August 2014

Recognizing image data across different domains has been a challenging task. For biometrics, heterogeneous face recognition (HFR) deals with recognition problems in which training/gallery images are collected in terms of one modality (e.g., photos), while test/probe images are observed in the other (e.g., sketches). In this paper, we present a domain adaptation approach for solving HFR problems. By utilizing external face images (i.e., those collected from the subjects not of interest) from both source and target domains, we propose a novel Domainindependent Component Analysis (DiCA) algorithm for deriving a common subspace for relating and representing cross-domain image data. In order to introduce improved representation ability, we further advance the self-taught learning strategy for learning a domain-independent dictionary in our DiCA subspace, which can be applied to both gallery and probe images of interest to improve representation and recognition. Different from some prior domain-adaptation approaches, we do not require the data correspondences (i.e., data pairs) when collecting external crossdomain image data, nor the label information is needed for learning the common feature space when associating different domains. Thus, our method is practical for real-world crossdomain classification problems. In our experiments, we consider sketch-to-photo and near-infrared (NIR) to visible spectrum (VIS) face recognition problems for evaluating the performance of our proposed approach.

W.-Y. Chang, C.-P. Wei, Y.-C. F. Wang

Multi-View Nonnegative Matrix Factorization for Clothing Image Characterization

IEEE/IAPR International Conference on Pattern Recognition (ICPR), Poster Session

August 2014

Due to the ambiguity in describing and discriminating between clothing images of different styles, it has been a challenging task to solve clothing image characterization problems. Based on the use of multiple types of visual features, we propose a novel multi-view nonnegative matrix factorization (NMF) algorithm for solving the above task. Our multi-view NMF not only observes image representations for describing clothing images in terms of visual appearances, an optimal combination of such features for each clothing image style would also be learned, while the separation between different image styles can be preserved. To verify the effectiveness of our method, we conduct experiments on two image datasets, and we confirm that our method produces satisfactory performance in terms of both clustering and categorization.

C.-P. Wei, C.-F. Chen, And Y.-C. F. Wang

Robust Face Recognition with Structurally Incoherent Low-Rank Matrix Decomposition

IEEE Transactions on Image Processing

August 2014

For the task of robust face recognition, we particularly focus on the scenario in which training and test image data are corrupted due to occlusion or disguise. Prior standard face recognition methods like Eigenfaces or state-of-the-art approaches such as sparse representation-based classification did not consider possible contamination of data during training, and thus their recognition performance on corrupted test data would be degraded. In this paper, we propose a novel face recognition algorithm based on low-rank matrix decomposition to address the aforementioned problem. Besides the capability of decomposing raw training data into a set of representative bases for better modeling the face images, we introduce a constraint of structural incoherence into the proposed algorithm, which enforces the bases learned for different classes to be as independent as possible. As a result, additional discriminating ability is added to the derived base matrices for improved recognition performance. Experimental results on different face databases with a variety of variations verify the effectiveness and robustness of our proposed method.

Chun-Rong Huang, Yun-Jung Chang, Zhi-Xiang Yang, And Yen-Yu Lin

Video Saliency Map Detection by Dominant Camera Motion Removal

IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)

August 2014

We present a trajectory based approach to detect salient regions in videos by dominant camera motion removal. Our approach is designed in a general way so that it can be applied to videos taken by either stationary or moving cameras without any prior information. Moreover, multiple salient regions of different temporal lengths can also be detected. To this end, we extract a set of spatially and temporally coherent trajectories of keypoints in a video. Then, velocity and acceleration entropies are proposed to represent the trajectories. In this way, long-term object motions are exploited to filter out short-term noises, and object motions of various temporal lengths can be represented in the same way. On the other hand, we are inspired by the observation that the trajectories in backgrounds, i.e., the nonsalient trajectories, are usually consistent with the dominant camera motion no matter whether the camera is stationary or not. We make use of this property to develop a unified approach to saliency generation for both stationary and moving cameras. Specifically, one-class SVM is employed to remove the consistent trajectories in motion. It follows that the salient regions could be highlighted by applying a diffusion process to the remaining trajectories. In addition, we create a set of manually annotated ground truth on the collected videos. The annotated videos are then used for performance evaluation and comparison. The promising results on various types of videos demonstrate the effectiveness and great applicability of our approach.

X. Luo, Y. Chang, C.-Y. Lin, And R. Y. Chang

Contribution of Bimodal Hearing to Lexical Tone Normalization in Mandarin-speaking Cochlear Implant Users

Hearing Research

June 2014

Native Mandarin normal-hearing (NH) listeners can easily perceive lexical tones even under conditions of great voice pitch variations across speakers by using the pitch contrast between context and target stimuli. It is however unclear whether cochlear implant (CI) users with limited access to pitch cues can make similar use of context pitch cues for tone normalization. In this study, native Mandarin NH listeners and pre-lingually deafened unilaterally implanted CI users were asked to recognize a series of Mandarin tones varying from Tone 1 (high-flat) to Tone 2 (mid-rising) with or without a preceding sentence context. Most of the CI subjects used a hearing aid (HA) in the non-implanted ear (i.e., bimodal users) and were tested both with CI alone and CI + HA. In the test without context, typical S-shaped tone recognition functions were observed for most CI subjects and the function slopes and perceptual boundaries were similar with either CI alone or CI + HA. Compared to NH subjects, CI subjects were less sensitive to the pitch changes in target tones. In the test with context, NH subjects had more (resp. fewer) Tone-2 responses in a context with high (resp. low) fundamental frequencies, known as the contrastive context effect. For CI subjects, a similar contrastive context effect was found statistically significant for tone recognition with CI + HA but not with CI alone. The results suggest that the pitch cues from CIs may not be sufficient to consistently support the pitch contrast processing for tone normalization. The additional pitch cues from aided residual acoustic hearing can however provide CI users with a similar tone normalization capability as NH listeners.

C.-Y. Tsai, T.-C. Lin, C.-P. Wei, And Y.-C. F. Wang

Extended-Bag-of-Features for Translation, Rotation, and Scale-Invariant Image Retrieval

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

May 2014

While bag-of-features (BOF) models have been widely applied for addressing image retrieval problems, the resulting performance is typically limited due to its disregard of spatial information of local image descriptors (and the associated visual words). In this paper, we present a novel spatial pooling scheme, called extended bag-of-features (EBOF), for solving the above task. Besides improving image representation capability, the incorporation of the our EBOF model with a proposed circular-correlation based similarity measure allows us to perform translation, rotation, and scale-invariant image retrieval. We conduct experiments on two benchmark image datasets, and the performance confirms the effectiveness and robustness of our proposed approach.