HPC Plus 学术论坛

2015全国高性能计算学术年会（HPC China 2015）将于2015年11月10日-12日在无锡举行，大会由中国计算机学会（CCF）主办，中国计算机学会高性能计算专业委员会、江南大学联合承办。大会主题为“根植中国芯 超算强国梦”。

CCF 高专委秘书处经过商议，HPC CHINA 2015上举办《HPC Plus》主题论坛，邀请2014-2015年度发表在CCF A类期刊或会议上的高性能计算专委会委员的论文作者在《HPC Plus》主题论坛上做学术报告，代表HPC领域这一年最好的学术成果。

 时间安排 报告题目 报告人 14:00-14:25 yaSpMV Yet Another SpMV Framework on GPUs 颜深根,SenseTime 14:25-14:50 Large-scale Neo-heterogeneous Programming and Optimization of SNP Detection on Tianhe-2 彭绍亮，国防科技大学 14:50-15:15 以路径为中心的大规模图数据处理系统PathGraph 袁平鹏，华中科技大学 15:15-15:35 Robust Structured Subspace Learning for Data Representation 李泽超, 南京理工大学 15:35-15:50 茶歇 15:50-16:15 针对GPU的静态和动态结合的缓存旁路技术 梁云，北京大学 16:15-16:40 DwarfCode: A Performance Prediction Tool for Parallel Applications 张伟哲，哈尔滨工业大学 16:40-17:05 When HPC Meets Big Data: Emerging HPC Technologies for High-Performance Data Management Systems 何丙胜，南洋理工大学 17:05-17:30 Enabling Renewable Energy Powered Sustainable High-Performance Computing 李超，上海交通大学 17:30-17:55 IaaS平台中工作流计算任务的多目标调度 朱昭萌，南京理工大学

华宇是华中科技大学副教授、博士生导师，IEEE和中国计算机学会的高级会员，计算机学会学术工委通讯委员、高性能计算、信息存储和体系结构专委委员。曾在美国内布拉斯加大学林肯分校做博士后研究工作。主要研究内容包括海量存储系统中元数据的语义管理方法，数据去重机制和近似存储系统体系结构等方面。在国际期刊TC、TPDS和国际会议USENIX FAST、USENIX ATC、INFOCOM、SC、ICDCS、MSST等上发表多篇学术论文，发表的学术论文被引用超过500次。在RTSS、INFOCOM、ICDCS、ICPP、IWQoS等30多个国际会议上担任程序委员会或组委会委员；是国际期刊FCS、JCN等的编辑，TC、TPDS、TCC、TVT、JSAC、VLDB Journal等的审稿人。主持和参加多项国家自然科学基金、973、 863计划重大项目和教育部创新团队项目等,是湖北省优秀硕士学位论文指导教师，曾获得中国电子学会电子信息科学技术二等奖。

yaSpMV Yet Another SpMV Framework on GPUs

SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work has shown impressive progress, load imbalance and high memory bandwidth remain the critical performance bottlenecks for SpMV. In this paper, we present our novel solutions to these problems. First, we devise a new SpMV format, called blocked compressed common coordinate (BCCOO), which uses bit flags to store the row indices in a blocked common coordinate (COO) format so as to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices to enhance the cache hit rates when accessing the vector to be multiplied. Second, we revisit the segmented scan approach for SpMV to address the load imbalance problem. We propose a highly efficient matrix-based segmented sum/scan for SpMV and further improve it by eliminating global synchronization. Then, we introduce an auto-tuning framework to choose optimization parameters based on the characteristics of input sparse matrices and target hardware platforms. Our experimental results on GTX680 GPUs and GTX480 GPUs show that our proposed framework achieves significant performance improvement over the vendor tuned CUSPARSE V5.0 (up to 229% and 65% on average on GTX680 GPUs, up to 150% and 42% on average on GTX480 GPUs) and some most recently proposed schemes (e.g., up to 195% and 70% on average over clSpMV on GTX680 GPUs, up to 162% and 40% on average over clSpMV on GTX480 GPUs).

Dr. Shaoliang Peng is a professor in National University of Defense Technology 国防科技大学 (NUDT, Changsha, China) and an adjunct professor of BGI

Large-scale Neo-heterogeneous Programming and Optimization of SNP Detection on Tianhe-2
(天河2上SNP检测的大规模微异构编程和并行优化)

SNP detection is a fundamental procedure in genome analysis. A popular SNP detection tool SOAPsnp can take more than one week to analyze one human genome with a 20-fold coverage. To improve the effciency, we developed mSNP, a parallel version of SOAPsnp. mSNP utilizes CPU cooperated with Intelr Xeon PhiTM for large-scale SNP detection. Firstly, we redesigned the key data structure of SOAPsnp, which significantly reduces the overhead of memory operations. Secondly, we devised a coordinated parallel framework, in which CPU collaborates with Xeon Phi for higher hardware utilization. Thirdly, we proposed a read-based window division strategy to improve throughput and parallel scale on multiple nodes. To the best of our knowledge, mSNP is the first SNP detection tool empowered by Xeon Phi. We achieved a 45x speedup on a single node of Tianhe-2, without any loss in precision. Moreover, mSNP showed promising scalability on 8192 nodes on Tianhe-2.

Robust Structured Subspace Learning for Data Representation

To uncover an appropriate latent subspace for data representation, in this paper we propose a novel Robust Structured Subspace Learning (RSSL) algorithm by integrating image understanding and feature learning into a joint learning framework. The learned subspace is adopted as an intermediate space to reduce the semantic gap between the low-level visual features and the high-level semantics. To guarantee the subspace to be compact and discriminative, the intrinsic geometric structure of data, and the local and global structural consistencies over labels are exploited simultaneously in the proposed algorithm. Besides, we adopt the $\ell_{2,1}$-norm for the formulations of loss function and regularization respectively to make our algorithm robust to the outliers and noise. An efficient algorithm is designed to solve the proposed optimization problem. It is noted that the proposed framework is a general one which can leverage several well-known algorithms as special cases and elucidate their intrinsic relationships. To validate the effectiveness of the proposed method, extensive experiments are conducted on diversity datasets for different image understanding tasks, i.e., image tagging, clustering, and classification, and the more encouraging results are achieved compared with some state-of-the-arts approaches.

Coordinated Static and Dynamic Cache Bypassing for GPUs

The massive parallel architecture enables graphics processing units (GPUs) to boost performance for a wide range of applications. Recently, to broaden the scope of applications that can be accelerated by GPUs, GPU vendors have used caches in conjunction with scratchpad memory as on-chip memory in the new generations of GPUs. Unfortunately, GPU caches face many performance challenges that arise due to excessive thread contention for cache resource.

Cache bypassing, where memory requests can selectively bypass the cache, is one solution that can help to mitigate the cache resource contention problem. In this work, we propose coordinated static and dynamic cache bypassing to improve application performance. At compile-time, we identify the global loads that indicate strong preferences for caching or bypassing through profiling. For the rest global loads, our dynamic cache bypassing has the flexibility to cache only a fraction of threads. Our coordinated static and dynamic cache bypassing technique achieves up to 2.28X (average 1.32X) performance speedup for a variety of GPU applications.

DwarfCode: A Performance Prediction Tool for Parallel Applications

We present DwarfCode, a performance prediction tool for MPI applications on diverse computing platforms. The goal is to accurately predict the running time of applications for task scheduling and job migration. First, DwarfCode collects the execution traces to record the computing and communication events. Then, it merges the traces from different processes into a single trace. After that, DwarfCode identifies and compresses the repeating patterns in the final trace to shrink the size of the events. Finally, a dwarf code is generated to mimic the original program behavior. This smaller running benchmark is replayed in the target platform to predict the performance of the original application. In order to generate such a benchmark, two major challenges are to reduce the time complexity of trace merging and repeat compression algorithms. We propose an O(mpn) trace merging algorithm to combine the traces generated by separate MPI processes , where m denotes the upper bound of tracing distance, p denotes the number of processes, and n denotes the maximum of event numbers of all the traces. More importantly, we put forward a novel repeat compression algorithm, whose time complexity is O(nlogn). Experimental results show that DwarfCode can accurately predict the running time of MPI applications. The error rate is below 10% for compute and communication intensive applications. This toolkit has been released for free download as a GNU General Public License v3 software.

Dr. Bingsheng He is currently an Associate Professor at School of Computer Engineering, Nanyang Technological University, Singapore. Before that, he held a research position in the System Research group of Microsoft Research Asia (2008-2010), where his major research was building high performance cloud computing systems for Microsoft. He got the Bachelor degree in Shanghai Jiao Tong University (1999-2003), and the Ph.D. degree in Hong Kong University of Science & Technology (2003-2008). His current research interests include cloud computing, database systems and high performance computing. His papers are published in prestigious international journals (such as ACM TODS and IEEE TKDE/TPDS/TC) and proceedings (such as ACM SIGMOD, VLDB/PVLDB, ACM/IEEE SuperComputing, ACM HPDC, and ACM SoCC). He has been awarded with the IBM Ph.D. fellowship (2007-2008) and with NVIDIA Academic Partnership (2010-2011). Since 2010, he has (co-)chaired a number of international conferences and workshops, including CloudCom 2014/2015 and HardBD2015. He has served in editor board of international journals, including IEEE Transactions on Cloud Computing (IEEE TCC) and IEEE Transactions on Parallel and Distributed Systems (IEEE TPDS).

When HPC Meets Big Data: Emerging HPC Technologies for High-Performance Data Management Systems

Big data has become a buzz word. Among various big-data challenges, high performance is a must, not an option. We are facing the challenges (and also opportunities) at all levels ranging from sophisticated algorithms and procedures to mine the gold from massive data to high-performance computing (HPC) techniques and systems to get the useful data in time. Our research has been on the system design and implementation of HPC technologies as weapons to address the performance requirement of data management systems. Interestingly, we have also observed the interplay between HPC architectures and (big) data management systems. In this talk, I will present our recent research efforts in developing high performance data management systems with GPUs and on Cloud. Finally, I will outline our research agenda. More details about our research can be found at http://pdcc.ntu.edu.sg/xtra/.

Enabling Renewable Energy Powered Sustainable High-Performance Computing

Multi-objective Scheduling of Workflow Applications in IaaS

IaaS平台中工作流计算任务的多目标调度

Cloud computing provides promising platforms for executing large applications with enormous computational resources to offer on demand. In IaaS model, users are charged based on their usage of provisioned VMs and the required QoS specifications. Although there are many existing workflow scheduling algorithms in traditional distributed or heterogeneous computing environments, they have difficulties in being directly applied to IaaS since IaaS differs from traditional environments by its service-based resource management and pay-per-use pricing strategies. In this talk, we will highlight such difficulties, and formulate the multi-objective workflow scheduling problem in IaaS. Also, we will present our recent research in the area of scheduling workflow applications on the IaaS platforms, including two novel evolutionary approaches for addressing the QoS-optimization scheduling problem and a list-based heuristic for addressing the Budget-constrained performance-effective scheduling problem. Experimental comparisons between the proposed algorithms and state-of-the-art algorithms will be presented and discussed.