Most transactional memory (TM) research focused on multi-core processors, and others investigated at the clusters, leaving the area of non-uniform memory access (NUMA) system unexplored. The existing TM implementation...Most transactional memory (TM) research focused on multi-core processors, and others investigated at the clusters, leaving the area of non-uniform memory access (NUMA) system unexplored. The existing TM implementations made significant performance degradation on NUMA system because they ignored the slower remote memory access. To solve this problem, a latency-based conflict detection and a forecasting-based conflict prevention method were proposed. Using these techniques, the NUMA aware TM system was presented. By reducing the remote memory access and the abort rate of transaction, the experiment results show that the NUMA aware strategies present good practical TM performance on NUMA system.展开更多
Data deduplication, as a compression method, has been widely used in most backup systems to improve bandwidth and space efficiency. As data exploded to be backed up, two main challenges in data deduplication are the C...Data deduplication, as a compression method, has been widely used in most backup systems to improve bandwidth and space efficiency. As data exploded to be backed up, two main challenges in data deduplication are the CPU-intensive chunking and hashing works and the I/0 intensive disk-index access latency. However, CPU-intensive works have been vastly parallelized and speeded up by multi-core and many-core processors; the I/0 latency is likely becoming the bottleneck in data deduplication. To alleviate the challenge of I/0 latency in multi-core systems, multi-threaded deduplication (Multi-Dedup) architecture was proposed. The main idea of Multi-Dedup was using parallel deduplication threads to hide the I/0 latency. A prefix based concurrent index was designed to maintain the internal consistency of the deduplication index with low synchronization overhead. On the other hand, a collisionless cache array was also designed to preserve locality and similarity within the parallel threads. In various real-world datasets experiments, Multi-Dedup achieves 3-5 times performance improvements incorporating with locality-based ChunkStash and local-similarity based SiLo methods. In addition, Multi-Dedup has dramatically decreased the synchronization overhead and achieves 1.5-2 times performance improvements comparing to traditional lock-based synchronization methods.展开更多
Chip multiprocessors(CMPs) allow thread level parallelism,thus increasing performance.However,this comes with the cost of temperature problem.CMPs require more power,creating non uniform power map and hotspots.Aiming ...Chip multiprocessors(CMPs) allow thread level parallelism,thus increasing performance.However,this comes with the cost of temperature problem.CMPs require more power,creating non uniform power map and hotspots.Aiming at this problem,a thread scheduling algorithm,the greedy scheduling algorithm,was proposed to reduce the thermal emergencies and to improve the throughput.The greedy scheduling algorithm was implemented in the Linux kernel on Intel's Quad-Core system.The experimental results show that the greedy scheduling algorithm can reduce 9.6%-78.5% of the hardware dynamic thermal management(DTM) in various combinations of workloads,and has an average of 5.2% and up to 9.7% throughput higher than the Linux standard scheduler.展开更多
基金Projects(61003075, 61170261) supported by the National Natural Science Foundation of China
文摘Most transactional memory (TM) research focused on multi-core processors, and others investigated at the clusters, leaving the area of non-uniform memory access (NUMA) system unexplored. The existing TM implementations made significant performance degradation on NUMA system because they ignored the slower remote memory access. To solve this problem, a latency-based conflict detection and a forecasting-based conflict prevention method were proposed. Using these techniques, the NUMA aware TM system was presented. By reducing the remote memory access and the abort rate of transaction, the experiment results show that the NUMA aware strategies present good practical TM performance on NUMA system.
基金Project(IRT0725)supported by the Changjiang Innovative Group of Ministry of Education,China
文摘Data deduplication, as a compression method, has been widely used in most backup systems to improve bandwidth and space efficiency. As data exploded to be backed up, two main challenges in data deduplication are the CPU-intensive chunking and hashing works and the I/0 intensive disk-index access latency. However, CPU-intensive works have been vastly parallelized and speeded up by multi-core and many-core processors; the I/0 latency is likely becoming the bottleneck in data deduplication. To alleviate the challenge of I/0 latency in multi-core systems, multi-threaded deduplication (Multi-Dedup) architecture was proposed. The main idea of Multi-Dedup was using parallel deduplication threads to hide the I/0 latency. A prefix based concurrent index was designed to maintain the internal consistency of the deduplication index with low synchronization overhead. On the other hand, a collisionless cache array was also designed to preserve locality and similarity within the parallel threads. In various real-world datasets experiments, Multi-Dedup achieves 3-5 times performance improvements incorporating with locality-based ChunkStash and local-similarity based SiLo methods. In addition, Multi-Dedup has dramatically decreased the synchronization overhead and achieves 1.5-2 times performance improvements comparing to traditional lock-based synchronization methods.
基金Projects(2009AA01Z124,2009AA01Z102) supported by the National High Technology Research and Development Program of ChinaProjects(60970036,61076025) supported by the National Natural Science Foundation of China
文摘Chip multiprocessors(CMPs) allow thread level parallelism,thus increasing performance.However,this comes with the cost of temperature problem.CMPs require more power,creating non uniform power map and hotspots.Aiming at this problem,a thread scheduling algorithm,the greedy scheduling algorithm,was proposed to reduce the thermal emergencies and to improve the throughput.The greedy scheduling algorithm was implemented in the Linux kernel on Intel's Quad-Core system.The experimental results show that the greedy scheduling algorithm can reduce 9.6%-78.5% of the hardware dynamic thermal management(DTM) in various combinations of workloads,and has an average of 5.2% and up to 9.7% throughput higher than the Linux standard scheduler.