Speech quality estimation has recently undergone a paradigm shift from human-hearing expert designs to machine-learning models. However, current models rely mainly on supervised learning, which is time-consuming and expensive for label collection. To solve this problem, we propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantized-variational autoencoder (VQ-VAE). The training of VQ-VAE relies on clean speech; hence, large quantization errors can be expected when the speech is distorted. To further improve correlation with real quality scores, domain knowledge of speech processing is incorporated into the model design. We found that the vector quantization mechanism could also be used for self-supervised speech enhancement (SE) model training. To improve the robustness of the encoder for SE, a novel self-distillation mechanism combined with adversarial training is introduced. In summary, the proposed speech quality estimation method and enhancement models require only clean speech for training without any label requirements. Experimental results show that the proposed VQScore and enhancement model are competitive with supervised baselines.
This study investigates importance sampling under the scheme of mini-batch stochastic gradient descent, under which the contributions are twofold. First, theoretically, we develop a neat tilting formula, which can be regarded as a general device for asymptotically optimal importance sampling. Second, practically, guided by the formula, we present an effective algorithm for importance sampling which accounts for the effects of minibatches and leverages the Markovian property of the gradients between iterations. Experiments conducted on artificial data confirm that our algorithm consistently delivers superior performance in terms of variance reduction. Furthermore, experiments carried out on real-world data demonstrate that our method, when paired with relatively straightforward models like multilayer perceptron (MLP) and convolutional neural networks (CNN), outperforms in terms of training loss and testing error.
In this paper, we address the challenge of discovering financial signals in narrative financial reports. As these documents are often lengthy and tend to blend routine information with new information, it is challenging for professionals to discern critical financial signals. To this end, we leverage the inherent nature of the year-to-year structure of reports to define a novel signal-highlighting task; more importantly, we propose a compare-and-contrast multistage pipeline that recognizes different relationships between the reports and locates relevant rationales for these relationships. We also create and publicly release a human-annotated dataset for our task. Our experiments on the dataset validate the effectiveness of our pipeline, and we provide detailed analyses and ablation studies to support our findings.
Transfer learning is known to perform efficiently in many applications empirically, yet limited literature reports the mechanism behind the scene. This study establishes both formal derivations and heuristic analysis to formulate the theory of transfer learning in deep learning. Our framework utilizing layer variational analysis proves that the success of transfer learning can be guaranteed with corresponding data conditions. Moreover, our theoretical calculation yields intuitive interpretations towards the knowledge transfer process. Subsequently, an alternative method for network-based transfer learning is derived. The method shows an increase in efficiency and accuracy for domain adaptation. It is particularly advantageous when new domain data is sufficiently sparse during adaptation. Numerical experiments over diverse tasks validated our theory and verified that our analytic expression achieved better performance in domain adaptation than the gradient descent method.
D4AM: A General Denoising Framework for Downstream Acoustic Models Chi-Chang Lee , Yu Tsao , Hsin-Min Wang , Chu-Song Chen Published: 02 Feb 2023, Last Modified: 06 Mar 2023 ICLR 2023 poster Readers: Everyone Show Bibtex Show Revisions Keywords: audio processing, speech enhancement, robust automatic speech recognition, auxiliary task learning TL;DR: We propose a general denoising framework for various downstream acoustic models (D4AM) by adopting an effective joint training scheme with the regression (denoising) objective and the classification (ASR) objective. Abstract: The performance of acoustic models degrades notably in noisy environments. Speech enhancement (SE) can be used as a front-end strategy to aid automatic speech recognition (ASR) systems. However, existing training objectives of SE methods are not fully effective at integrating speech-text and noise-clean paired data for training toward unseen ASR systems. In this study, we propose a general denoising framework, D4AM, for various downstream acoustic models. Our framework fine-tunes the SE model with the backward gradient according to a specific acoustic model and the corresponding classification objective. In addition, our method aims to consider the regression objective as an auxiliary loss to make the SE model generalize to other unseen acoustic models. To jointly train an SE unit with regression and classification objectives, D4AM uses an adjustment scheme to directly estimate suitable weighting coefficients rather than undergoing a grid search process with additional training costs. The adjustment scheme consists of two parts: gradient calibration and regression objective weighting. The experimental results show that D4AM can consistently and effectively provide improvements to various unseen acoustic models and outperforms other combination setups. Specifically, when evaluated on the Google ASR API with real noisy data completely unseen during SE training, D4AM achieves a relative WER reduction of 24.65% compared with the direct feeding of noisy input. To our knowledge, this is the first work that deploys an effective combination scheme of regression (denoising) and classification (ASR) objectives to derive a general pre-processor applicable to various unseen ASR systems.
High-fidelity kinship face synthesis is a challenging task due to the limited amount of kinship data available for training and low-quality images. In addition, it is also hard to trace the genetic traits between parents and children from those low-quality training images. To address these issues, we leverage the pre-trained state-of-the-art face synthesis model, StyleGAN2, for kinship face synthesis. To handle large age, gender and other attribute variations between the parents and their children, we conduct a thorough study of its rich latent spaces and different encoder architectures for an optimized encoder design to repurpose StyleGAN2 for kinship face synthesis. The obtained latent representation from our developed encoder pipeline with stage-wise training strikes a better balance of editability and synthesis fidelity for identity preserving and attribute manipulations than other compared approaches. With extensive subjective, quantitative, and qualitative evaluations, the proposed approach consistently achieves better performance in terms of facial attribute heredity and image generation fidelity than other compared state-of-the-art methods. This demonstrates the effectiveness of the proposed approach which can yield promising and satisfactory kinship face synthesis using only a single and straightforward encoder architecture.
This paper proposes an encoder-decoder architecture for kidney segmentation. A hyperparameter optimization process is implemented, including the development of a model architecture, selecting a windowing method and a loss function, and data augmentation. The model consists of EfficientNet-B5 as the encoder and a feature pyramid network as the decoder that yields the best performance with a Dice score of 0.969 on the 2019 Kidney and Kidney Tumor Segmentation Challenge dataset. The proposed model is tested with different voxel spacing, anatomical planes, and kidney and tumor volumes. Moreover, case studies are conducted to analyze segmentation outliers. Finally, five-fold cross-validation and the 3D-IRCAD-01 dataset are used to evaluate the developed model in terms of the following evaluation metrics: the Dice score, recall, precision, and the Intersection over Union score. A new development and application of artificial intelligence algorithms to solve image analysis and interpretation will be demonstrated in this paper. Overall, our experiment results show that the proposed kidney segmentation solutions in CT images can be significantly applied to clinical needs to assist surgeons in surgical planning. It enables the calculation of the total kidney volume for kidney function estimation in ADPKD and supports radiologists or doctors in disease diagnoses and disease progression.
Previously, doctors interpreted computed tomography (CT) images based on their experience in diagnosing kidney diseases. However, with the rapid increase in CT images, such interpretations were required considerable time and effort, producing inconsistent results. Several novel neural network models were proposed to automatically identify kidney or tumor areas in CT images for solving this problem. In most of these models, only the neural network structure was modified to improve accuracy. However, data pre-processing was also a crucial step in improving the results. This study systematically discussed the necessary pre-processing methods before processing medical images in a neural network model. The experimental results were shown that the proposed pre-processing methods or models significantly improve the accuracy rate compared with the case without data pre-processing. Specifically, the dice score was improved from 0.9436 to 0.9648 for kidney segmentation and 0.7294 for all types of tumor detections. The performance was suitable for clinical applications with lower computational resources based on the proposed medical image processing methods and deep learning models. The cost efficiency and effectiveness were also achieved for automatic kidney volume calculation and tumor detection accurately.
In visual search, the gallery set could be incrementally growing and added to the database in practice. However, In visual se In visual search, the gallery set could be incrementally growing and added to the database in practice. However, existing methods rely on the model trained on the entire dataset, ignoring the continual updating of the model. Be- sides, as the model updates, the new model must re-extract features for the entire gallery set to maintain compatible feature space, imposing a high computational cost for a large gallery set. To address the issues of long-term visual search, we introduce a continual learning (CL) approach that can handle the incrementally growing gallery set with backward embedding consistency. We enforce the losses of inter-session data coherence, neighbor-session model co- herence, and intra-session discrimination to conduct a con- tinual learner. In addition to the disjoint setup, our CL so- lution also tackles the situation of increasingly adding new classes for the blurry boundary without assuming all cat- egories known in the beginning and during model update. To our knowledge, this is the first CL method both tackling the issue of backward-consistent feature embedding and al- lowing novel classes to occur in the new sessions. Extensive experiments on various benchmarks show the efficacy of our approach under a wide range of setups arch, the gallery set could be incrementally growing and added to the database in practice. However, existing methods rely on the model trained on the entire dataset, ignoring the continual updating of the model. Be- sides, as the model updates, the new model must re-extract features for the entire gallery set to maintain compatible feature space, imposing a high computational cost for a large gallery set. To address the issues of long-term visual search, we introduce a continual learning (CL) approach that can handle the incrementally growing gallery set with backward embedding consistency. We enforce the losses of inter-session data coherence, neighbor-session model co- herence, and intra-session discrimination to conduct a con- tinual learner. In addition to the disjoint setup, our CL so- lution also tackles the situation of increasingly adding new classes for the blurry boundary without assuming all cat- egories known in the beginning and during model update. To our knowledge, this is the first CL method both tackling the issue of backward-consistent feature embedding and al- lowing novel classes to occur in the new sessions. Extensive experiments on various benchmarks show the efficacy of our approach under a wide range of setups existing methods rely on the model trained on the entire dataset, ignoring the continual updating of the model. Be- sides, as the model updates, the new model must re-extract features for the entire gallery set to maintain compatible feature space, imposing a high computational cost for a large gallery set. To address the issues of long-term visual search, we introduce a continual learning (CL) approach that can handle the incrementally growing gallery set with backward embedding consistency. We enforce the losses of inter-session data coherence, neighbor-session model co- herence, and intra-session discrimination to conduct a con- tinual learner. In addition to the disjoint setup, our CL so- lution also tackles the situation of increasingly adding new classes for the blurry boundary without assuming all cat- egories known in the beginning and during model update. To our knowledge, this is the first CL method both tackling the issue of backward-consistent feature embedding and al- lowing novel classes to occur in the new sessions. Extensive experiments on various benchmarks show the efficacy of our approach under a wide range of setupsIn visual search, the gallery set could be incrementally growing and added to the database in practice. However, existing methods rely on the model trained on the entire dataset, ignoring the continual updating of the model. Be- sides, as the model updates, the new model must re-extract features for the entire gallery set to maintain compatible feature space, imposing a high computational cost for a large gallery set. To address the issues of long-term visual search, we introduce a continual learning (CL) approach thatIn visual search, the gallery set could be incrementally growing and added to the database in practice. However, existing methods rely on the model trained on the entire dataset, ignoring the continual updating of the model. Be- In visual search, the gallery set could be incrementally growing and added to the database in practice. However, existing methods rely on the model trained on the entire dataset, ignoring the continual updating of the model. Be- sides, as the model updates, the new model must re-extract features for the entire gallery set to maintain compatible feature space, imposing a high computational cost for a large gallery set. To address the issues of long-term visual search, we introduce a continual learning (CL) approach that can handle the incrementally growing gallery set with backward embedding consistency. We enforce the losses of inter-session data coherence, neighbor-session model co- herence, and intra-session discrimination to conduct a con- tinual learner. In addition to the disjoint setup, our CL so- lution also tackles the situation of increasingly adding new classes for the blurry boundary without assuming all cat- egories known in the beginning and during model update. To our knowledge, this is the first CL method both tackling the issue of backward-consistent feature embedding and al- lowing novel classes to occur in the new sessions. Extensive experiments on various benchmarks show the efficacy of our approach under a wide range of setups sides, as the model updates, the new model must re-extract features for the entire gallery set to maintain compatible feature space, imposing a high computational cost for a large gallery set. To address the issues of long-term visual search, we introduce a continual learning (CL) approach that can handle the incrementally growing gallery set with backward embedding consistency. We enforce the losses of inter-session data coherence, neighbor-session model co- herence, and intra-session discrimination to conduct a con- tinual learner. In addition to the disjoint setup, our CL so- lution also tackles the situation of increasingly adding new classes for the blurry boundary without assuming all cat- egories known in the beginning and during model update. To our knowledge, this is the first CL method both tackling the issue of backward-consistent feature embedding and al- lowing novel classes to occur in the new sessions. Extensive experiments on various benchmarks show the efficacy of our approach undeIn visual search, the gallery set could be incrementally growing and added to the database in practice. However, existing methods rely on the model trained on the entire dataset, ignoring the continual updating of the model. Be- sides, as the model updates, the new model must re-extract features for the entire gallery set to maintain compatible feature space, imposing a high computational cost for a large gallery set. To address the issues of long-term visual search, we introduce a continual learning (CL) approach that can handle the incrementally growing gallery set with backward embedding consistency. We enforce the losses of inter-session data coherence, neighbor-session model co- herence, and intra-session discrimination to conduct a con- tinual learner. In addition to the disjoint setup, our CL so- lution also tackles the situation of increasingly adding new classes for the blurry boundary without assuming all cat- egories known in the beginning and during model update. To our knowledge, this is the first CL method both tackling the issue of backward-consistent feature embedding and al- lowing novel classes to occur in the new sessions. Extensive experiments on various benchmarks show the efficacy of our approach under a wide range of setupsr a wide range of setups can handle the incrementally growing gallery set with backward embedding consistency. We enforce the losses of In visual search, the gallery set could be incrementally growing and added to the database in practice. However, existing methods rely on the model trained on the entire dataset, ignoring the continual updating of the model. Be- sides, as the model updates, the new model must re-extract features for the entire gallery set to maintain compatible feature space, imposing a high computational cost for a large gallery set. To address the issues of long-term visual search, we introduce a continual learning (CL) approach that can handle the incrementally growing gallery set with backward embedding consistency. We enforce the losses of inter-session data coherence, neighbor-session model co- herence, and intra-session discrimination to conduct a con- tinual learner. In addition to the disjoint setup, our CL so- lution also tackles the situation of increasingly adding new classes for the blurry boundary without assuming all cat- egories known in the beginning and during model update. To our knowledge, this is the first CL method both tackling the issue of backward-consistent feature embedding and al- lowing novel classes to occur in the new sessions. Extensive experiments on various benchmarks show the efficacy of our approach under a wide range of setups inter-session data coherence, neighbor-session model co- herence, and intra-session discrimination to conduct a con- tinual learner. In addition to the disjoint setup, our CL so- lution also tackles the situation of increasingly adding new classes for the blurry boundary without assuming all cat- egories known in the beginning and during model update. To our knowledge, this is the first CL method both tackling the issue of backward-consistent feature embedding and al- lowing novel classes to occur in the new sessions. Extensive experiments on various benchmarks show the efficacy of our approach under a wide range of setups