In-memory techniques keep data into faster and more expensive storage media for improving performance of big data processing. However, existing mechanisms do not consider how to expedite the data processing applications that access the input datasets only once. Another problem is how to reclaim memory without affecting other running applications. In this paper, we provide scheduling-aware data prefetching and eviction mechanisms based on Spark, Alluxio, and Hadoop. The mechanisms prefetch data and release memory resources based on the scheduling information. A mathematical method is proposed for maximizing the reduction of data access time. To make the mechanisms applicable in large-scale environments, we propose a heuristic algorithm to reduce the computational time. Furthermore, an enhanced version of the heuristic algorithm is also proposed to increase the amount of prefetched data. Finally, we perform real-testbed and simulation experiments to show the effectiveness of the proposed mechanisms.
Anomaly detection has been an important research topic in data mining and machine learning. Many real-world applications such as intrusion or credit card fraud detection require an effective and efficient framework to identify deviated data instances. However, most anomaly detection methods are typically implemented in batch mode, and thus cannot be easily extended to large-scale problems without sacrificing computation and memory requirements. In this paper, we propose an online oversampling principal component analysis (osPCA) algorithm to address this problem, and we aim at detecting the presence of outliers from a large amount of data via an online updating technique. Unlike prior principal component analysis (PCA)-based approaches, we do not store the entire data matrix or covariance matrix, and thus our approach is especially of interest in online or large-scale problems. By oversampling the target instance and extracting the principal direction of the data, the proposed osPCA allows us to determine the anomaly of the target instance according to the variation of the resulting dominant eigenvector. Since our osPCA need not perform eigen analysis explicitly, the proposed framework is favored for online applications which have computation or memory limitations. Compared with the well-known power method for PCA and other popular anomaly detection algorithms, our experimental results verify the feasibility of our proposed method in terms of both accuracy and efficiency.
In this paper, we improve the efficiency of kernelized support vector machine (SVM) for image classification using linearized kernel data representation. Inspired by Nystrom approximation, we propose a decomposition technique for converting the kernel data matrix into an approximated primal form. This allows us to perform data classification in the kernel space using linear SVMs. Using our method, several benefits can be observed. First, we advance basis matrix selection for decomposing our proposed Nystrom approximation, which can be considered as feature/instance selection for performance guarantees. As a result, classifying approximated kernelized data in primal form using linear SVMs is able to achieve comparable recognition performance as nonlinear SVMs do. More importantly, the proposed selection technique significantly reduces the computation complexity for both training and testing, and thus makes our computation time be comparable to that of linear SVMs. Experiments on two benchmark datasets will support the use of our approach for solving the tasks of image classification.