In the era of modern high performance computing, GPUs have been considered an excellent accelerator for general purpose data-intensive parallel applications. To achieve application speedup from GPUs, many of performan...In the era of modern high performance computing, GPUs have been considered an excellent accelerator for general purpose data-intensive parallel applications. To achieve application speedup from GPUs, many of performance-oriented optimization techniques have been proposed. However, in order to satisfy the recent trend of power and energy consumptions, power/energy-aware optimization of GPUs needs to be investigated with detailed analysis in addition to the performance-oriented optimization. In this work, in order to explore the impact of various optimization strategies on GPU performance, power and energy consumptions, we evaluate performance and power/energy consumption of a well-known application running on different commercial GPU devices with the different optimization strategies. In particular, in order to see the more generalized performance and power consumption patterns of GPU based accelerations, our evaluations are performed with three different Nvdia GPU generations(Fermi, Kepler and Maxwell architectures), various core clock frequencies and memory clock frequencies. We analyze how a GPU kernel execution is affected by optimization and what GPU architectural factors have much impact on its performance and power/energy consumption. This paper also categorizes which optimization technique primarily improves which metric(i.e., performance, power or energy efficiency). Furthermore, voltage frequency scaling(VFS) is also applied to examine the effect of changing a clock frequency on these metrics. In general, our work shows that effective GPU optimization strategies can improve the application performance significantly without increasing power and energy consumption.展开更多
Considering premature convergence in the searching process of genetic algorithm, a chaotic migration-based pseudo parallel genetic algorithm (CMPPGA) is proposed, which applies the idea of isolated evolution and infor...Considering premature convergence in the searching process of genetic algorithm, a chaotic migration-based pseudo parallel genetic algorithm (CMPPGA) is proposed, which applies the idea of isolated evolution and information exchanging in distributed Parallel Genetic Algorithm by serial program structure to solve optimization problem of low real-time demand. In this algorithm, asynchronic migration of individuals during parallel evolution is guided by a chaotic migration sequence. Information exchanging among sub-populations is ensured to be efficient and sufficient due to that the sequence is ergodic and stochastic. Simulation study of CMPPGA shows its strong global search ability, superiority to standard genetic algorithm and high immunity against premature convergence. According to the practice of raw material supply, an inventory programming model is set up and solved by CMPPGA with satisfactory results returned.展开更多
As a typical representative of the NP-complete problem, the traveling salesman problem(TSP) is widely utilized in computer networks, logistics distribution, and other fields. In this paper, a discrete lion swarm optim...As a typical representative of the NP-complete problem, the traveling salesman problem(TSP) is widely utilized in computer networks, logistics distribution, and other fields. In this paper, a discrete lion swarm optimization(DLSO) algorithm is proposed to solve the TSP. Firstly, we introduce discrete coding and order crossover operators in DLSO. Secondly, we use the complete 2-opt(C2-opt) algorithm to enhance the local search ability.Then in order to enhance the efficiency of the algorithm, a parallel discrete lion swarm optimization(PDLSO) algorithm is proposed.The PDLSO has multiple populations, and each sub-population independently runs the DLSO algorithm in parallel. We use the ring topology to transfer information between sub-populations. Experiments on some benchmarks TSP problems show that the DLSO algorithm has a better accuracy than other algorithms, and the PDLSO algorithm can effectively shorten the running time.展开更多
A methodology for topology optimization based on element independent nodal density(EIND) is developed.Nodal densities are implemented as the design variables and interpolated onto element space to determine the densit...A methodology for topology optimization based on element independent nodal density(EIND) is developed.Nodal densities are implemented as the design variables and interpolated onto element space to determine the density of any point with Shepard interpolation function.The influence of the diameter of interpolation is discussed which shows good robustness.The new approach is demonstrated on the minimum volume problem subjected to a displacement constraint.The rational approximation for material properties(RAMP) method and a dual programming optimization algorithm are used to penalize the intermediate density point to achieve nearly 0-1 solutions.Solutions are shown to meet stability,mesh dependence or non-checkerboard patterns of topology optimization without additional constraints.Finally,the computational efficiency is greatly improved by multithread parallel computing with OpenMP.展开更多
Remarks on a benchmark nonlinear constrained optimization problem are made. Due to a citation error, two absolutely different results for the benchmark problem are obtained by independent researchers. Parallel simulat...Remarks on a benchmark nonlinear constrained optimization problem are made. Due to a citation error, two absolutely different results for the benchmark problem are obtained by independent researchers. Parallel simulated annealing using simplex method is employed in our study to solve the benchmark nonlinear constrained problem with mistaken formula and the best-known solution is obtained, whose optimality is testified by the Kuhn Tucker conditions.展开更多
Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at t...Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at this problem,a parallelization approach was proposed with six memory optimization schemes for CG,four schemes of them aiming at all kinds of sparse matrix-vector multiplication (SPMV) operation. Conducted on IBM QS20,the parallelization approach can reach up to 21 and 133 times speedups with size A and B,respectively,compared with single power processor element. Finally,the conclusion is drawn that the peak bandwidth of memory access on Cell BE can be obtained in SPMV,simple computation is more efficient on heterogeneous processors and loop-unrolling can hide local storage access latency while executing scalar operation on SIMD cores.展开更多
基金supported by Basic Science Research Program through the National Research Foundation(2015R1D1A3A01019869),Korea
文摘In the era of modern high performance computing, GPUs have been considered an excellent accelerator for general purpose data-intensive parallel applications. To achieve application speedup from GPUs, many of performance-oriented optimization techniques have been proposed. However, in order to satisfy the recent trend of power and energy consumptions, power/energy-aware optimization of GPUs needs to be investigated with detailed analysis in addition to the performance-oriented optimization. In this work, in order to explore the impact of various optimization strategies on GPU performance, power and energy consumptions, we evaluate performance and power/energy consumption of a well-known application running on different commercial GPU devices with the different optimization strategies. In particular, in order to see the more generalized performance and power consumption patterns of GPU based accelerations, our evaluations are performed with three different Nvdia GPU generations(Fermi, Kepler and Maxwell architectures), various core clock frequencies and memory clock frequencies. We analyze how a GPU kernel execution is affected by optimization and what GPU architectural factors have much impact on its performance and power/energy consumption. This paper also categorizes which optimization technique primarily improves which metric(i.e., performance, power or energy efficiency). Furthermore, voltage frequency scaling(VFS) is also applied to examine the effect of changing a clock frequency on these metrics. In general, our work shows that effective GPU optimization strategies can improve the application performance significantly without increasing power and energy consumption.
文摘Considering premature convergence in the searching process of genetic algorithm, a chaotic migration-based pseudo parallel genetic algorithm (CMPPGA) is proposed, which applies the idea of isolated evolution and information exchanging in distributed Parallel Genetic Algorithm by serial program structure to solve optimization problem of low real-time demand. In this algorithm, asynchronic migration of individuals during parallel evolution is guided by a chaotic migration sequence. Information exchanging among sub-populations is ensured to be efficient and sufficient due to that the sequence is ergodic and stochastic. Simulation study of CMPPGA shows its strong global search ability, superiority to standard genetic algorithm and high immunity against premature convergence. According to the practice of raw material supply, an inventory programming model is set up and solved by CMPPGA with satisfactory results returned.
基金supported by the National Natural Science Foundation of China(61771293)the Key Project of Shangdong Province(2019JZZY010111)。
文摘As a typical representative of the NP-complete problem, the traveling salesman problem(TSP) is widely utilized in computer networks, logistics distribution, and other fields. In this paper, a discrete lion swarm optimization(DLSO) algorithm is proposed to solve the TSP. Firstly, we introduce discrete coding and order crossover operators in DLSO. Secondly, we use the complete 2-opt(C2-opt) algorithm to enhance the local search ability.Then in order to enhance the efficiency of the algorithm, a parallel discrete lion swarm optimization(PDLSO) algorithm is proposed.The PDLSO has multiple populations, and each sub-population independently runs the DLSO algorithm in parallel. We use the ring topology to transfer information between sub-populations. Experiments on some benchmarks TSP problems show that the DLSO algorithm has a better accuracy than other algorithms, and the PDLSO algorithm can effectively shorten the running time.
基金Projects(11372055,11302033)supported by the National Natural Science Foundation of ChinaProject supported by the Huxiang Scholar Foundation from Changsha University of Science and Technology,ChinaProject(2012KFJJ02)supported by the Key Labortory of Lightweight and Reliability Technology for Engineering Velicle,Education Department of Hunan Province,China
文摘A methodology for topology optimization based on element independent nodal density(EIND) is developed.Nodal densities are implemented as the design variables and interpolated onto element space to determine the density of any point with Shepard interpolation function.The influence of the diameter of interpolation is discussed which shows good robustness.The new approach is demonstrated on the minimum volume problem subjected to a displacement constraint.The rational approximation for material properties(RAMP) method and a dual programming optimization algorithm are used to penalize the intermediate density point to achieve nearly 0-1 solutions.Solutions are shown to meet stability,mesh dependence or non-checkerboard patterns of topology optimization without additional constraints.Finally,the computational efficiency is greatly improved by multithread parallel computing with OpenMP.
文摘Remarks on a benchmark nonlinear constrained optimization problem are made. Due to a citation error, two absolutely different results for the benchmark problem are obtained by independent researchers. Parallel simulated annealing using simplex method is employed in our study to solve the benchmark nonlinear constrained problem with mistaken formula and the best-known solution is obtained, whose optimality is testified by the Kuhn Tucker conditions.
基金Project(2008AA01A201) supported the National High-tech Research and Development Program of ChinaProjects(60833004, 60633050) supported by the National Natural Science Foundation of China
文摘Developing parallel applications on heterogeneous processors is facing the challenges of 'memory wall',due to limited capacity of local storage,limited bandwidth and long latency for memory access. Aiming at this problem,a parallelization approach was proposed with six memory optimization schemes for CG,four schemes of them aiming at all kinds of sparse matrix-vector multiplication (SPMV) operation. Conducted on IBM QS20,the parallelization approach can reach up to 21 and 133 times speedups with size A and B,respectively,compared with single power processor element. Finally,the conclusion is drawn that the peak bandwidth of memory access on Cell BE can be obtained in SPMV,simple computation is more efficient on heterogeneous processors and loop-unrolling can hide local storage access latency while executing scalar operation on SIMD cores.