HPC plus 学术论坛

HPC Plus 学术论坛

论坛简介：

2015全国高性能计算学术年会（HPC China 2015）将于2015年11月10日-12日在无锡举行，大会由中国计算机学会（CCF）主办，中国计算机学会高性能计算专业委员会、江南大学联合承办。大会主题为“根植中国芯超算强国梦”。

CCF 高专委秘书处经过商议，HPC CHINA 2015上举办《HPC Plus》主题论坛，邀请2014-2015年度发表在CCF A类期刊或会议上的高性能计算专委会委员的论文作者在《HPC Plus》主题论坛上做学术报告，代表HPC领域这一年最好的学术成果。

时间：2015年11月11日下午

地点：无锡太湖皇冠假日酒店太湖厅3

时间安排	报告题目	报告人
14:00-14:25	yaSpMV Yet Another SpMV Framework on GPUs	颜深根,SenseTime
14:25-14:50	Large-scale Neo-heterogeneous Programming and Optimization of SNP Detection on Tianhe-2	彭绍亮，国防科技大学
14:50-15:15	以路径为中心的大规模图数据处理系统PathGraph	袁平鹏，华中科技大学
15:15-15:35	Robust Structured Subspace Learning for Data Representation	李泽超, 南京理工大学
15:35-15:50	茶歇
15:50-16:15	针对GPU的静态和动态结合的缓存旁路技术	梁云，北京大学
16:15-16:40	DwarfCode: A Performance Prediction Tool for Parallel Applications	张伟哲，哈尔滨工业大学
16:40-17:05	When HPC Meets Big Data: Emerging HPC Technologies for High-Performance Data Management Systems	何丙胜，南洋理工大学
17:05-17:30	Enabling Renewable Energy Powered Sustainable High-Performance Computing	李超，上海交通大学
17:30-17:55	IaaS平台中工作流计算任务的多目标调度	朱昭萌，南京理工大学

论坛主席：

华宇，华中科技大学

华宇是华中科技大学副教授、博士生导师，IEEE和中国计算机学会的高级会员，计算机学会学术工委通讯委员、高性能计算、信息存储和体系结构专委委员。曾在美国内布拉斯加大学林肯分校做博士后研究工作。主要研究内容包括海量存储系统中元数据的语义管理方法，数据去重机制和近似存储系统体系结构等方面。在国际期刊TC、TPDS和国际会议USENIX FAST、USENIX ATC、INFOCOM、SC、ICDCS、MSST等上发表多篇学术论文，发表的学术论文被引用超过500次。在RTSS、INFOCOM、ICDCS、ICPP、IWQoS等30多个国际会议上担任程序委员会或组委会委员；是国际期刊FCS、JCN等的编辑，TC、TPDS、TCC、TVT、JSAC、VLDB Journal等的审稿人。主持和参加多项国家自然科学基金、973、 863计划重大项目和教育部创新团队项目等,是湖北省优秀硕士学位论文指导教师，曾获得中国电子学会电子信息科学技术二等奖。

论坛讲者：

颜深根,SenseTime

讲者简介：

颜深根博士毕业于中国科学院大学，曾就职于百度研究院深度学习实验室，研究兴趣包括大规模异构并行，深度学习，图像识别等。曾于2013年6月至2014年2月在美国北卡罗来纳州立大学访问交流。博士期间主要专注于基于异构平台的算法设计，代码优化等性能相关的研究，博士期间发表的两篇论文被并行计算领域顶级会议PPoPP’13和PPoPP’14分别录用。在百度期间主要负责大规模异构集群搭建，大规模深度学习算法并行优化，另外在博士期间参与了《OpenCL异构计算》一书的翻译及作为核心成员参与了OpenCL版本OpenCV的开发。

报告题目：

yaSpMV Yet Another SpMV Framework on GPUs

报告摘要：

SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work has shown impressive progress, load imbalance and high memory bandwidth remain the critical performance bottlenecks for SpMV. In this paper, we present our novel solutions to these problems. First, we devise a new SpMV format, called blocked compressed common coordinate (BCCOO), which uses bit flags to store the row indices in a blocked common coordinate (COO) format so as to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices to enhance the cache hit rates when accessing the vector to be multiplied. Second, we revisit the segmented scan approach for SpMV to address the load imbalance problem. We propose a highly efficient matrix-based segmented sum/scan for SpMV and further improve it by eliminating global synchronization. Then, we introduce an auto-tuning framework to choose optimization parameters based on the characteristics of input sparse matrices and target hardware platforms. Our experimental results on GTX680 GPUs and GTX480 GPUs show that our proposed framework achieves significant performance improvement over the vendor tuned CUSPARSE V5.0 (up to 229% and 65% on average on GTX680 GPUs, up to 150% and 42% on average on GTX480 GPUs) and some most recently proposed schemes (e.g., up to 195% and 70% on average over clSpMV on GTX680 GPUs, up to 162% and 40% on average over clSpMV on GTX480 GPUs).

彭绍亮，国防科技大学

讲者简介：

Dr. Shaoliang Peng is a professor in National University of Defense Technology 国防科技大学 (NUDT, Changsha, China) and an adjunct professor of BGI 华大基因. He got his Ph.D. and M.S. degrees in Computer Science (CS) from NUDT in 2003 and 2008, respectively. He also received his B.S. degrees on Mathematics from NUDT in 2001. He was a visiting scholar at CS Department of City University of Hong Kong (CityU) from 2007 to 2008 and at BGI Hong Kong from 2013 to 2014. His research interests are high-performance computing, bioinformatics, drug design, mobile computing, and biology simulation. He has participated in many keystone projects in China such as TianHe supercomputers. He is PI of several projects including the National Natural Science Foundation of China (NSFC). He has authored many original papers – of which some are published in Nature Communication, Genome Biology, PloS ONE, IEEE Transactions, Science China, BMC Bioinformatics, and ISC.
　

报告题目：

Large-scale Neo-heterogeneous Programming and Optimization of SNP Detection on Tianhe-2
(天河2上SNP检测的大规模微异构编程和并行优化)
　

报告摘要：

SNP detection is a fundamental procedure in genome analysis. A popular SNP detection tool SOAPsnp can take more than one week to analyze one human genome with a 20-fold coverage. To improve the effciency, we developed mSNP, a parallel version of SOAPsnp. mSNP utilizes CPU cooperated with Intelr Xeon PhiTM for large-scale SNP detection. Firstly, we redesigned the key data structure of SOAPsnp, which significantly reduces the overhead of memory operations. Secondly, we devised a coordinated parallel framework, in which CPU collaborates with Xeon Phi for higher hardware utilization. Thirdly, we proposed a read-based window division strategy to improve throughput and parallel scale on multiple nodes. To the best of our knowledge, mSNP is the first SNP detection tool empowered by Xeon Phi. We achieved a 45x speedup on a single node of Tianhe-2, without any loss in precision. Moreover, mSNP showed promising scalability on 8192 nodes on Tianhe-2.
　

袁平鹏，华中科技大学

讲者简介：

袁平鹏，华中科技大学计算机学院副教授，计算机学会高级会员。从事海量数据管理、语义网、并行及分布式计算研究工作。负责研发了大规模RDF图管理系统TripleBit以及其分布式版本。与国际上著名的RDF图存储系统RDF-3X、MonetDB、Trinity.RDF、Shard等相比较，查询性能均超出。在此基础上研发了图迭代计算系统PathGraph。相关研究发表在VLDB 2013、ICDE 2015、SC 2014、CIKM 2014等上面。

报告题目：

以路径为中心的大规模图数据处理系统PathGraph

报告摘要：

庞大规模的图数据为图数据处理技术带来了极大挑战。以路径为中心的大规模图数据处理系统PathGraph在单机上能够处理十亿级别图数据。PathGraph采用以路径为中心的并行计算模型。该模型执行Scatter或Gather操作。为聚合相关性较高的路径，本系统采用以树作为基本划分单元来划分图数据。对于划分后的子图，本系统采用压缩的邻接表方式存储。顶点ID采用变长整型编码方法，以进一步减少图数据的存储空间。在不同规模的数据集上，从几十万到几十亿规模级别的图数据上的实验结果均显示，PathGraph均大幅度超越当前著名系统GraphChi和X-Stream等。

李泽超, 南京理工大学

讲者简介：

李泽超博士毕业于中国科学院自动化研究所，目前在南京理工大学工作，研究兴趣包括海量多媒体分析以及异质数据挖掘等。曾于2010年12月至2011年11月在新加坡中新数字媒体研究院访问交流。博士期间主要针对多媒体分析与理解问题，以子空间学习为主线，在特征选择、语义映射、个性化标签推荐和新闻检索方面提出了多种新颖的方法，取得了多项创新研究成果。相关研究成果在国际期刊和会议发表或录用论文30余篇，其中，以第一作者发表/录用国际知名期刊10余篇（如IEEE TPAMI、IEEE TIP、IEEE TMM、IEEE TKDE等），获得2013年中国科学院院长奖和2015年中国科学院优秀博士论文将等多项荣誉称号。

报告题目：

Robust Structured Subspace Learning for Data Representation

报告摘要：

To uncover an appropriate latent subspace for data representation, in this paper we propose a novel Robust Structured Subspace Learning (RSSL) algorithm by integrating image understanding and feature learning into a joint learning framework. The learned subspace is adopted as an intermediate space to reduce the semantic gap between the low-level visual features and the high-level semantics. To guarantee the subspace to be compact and discriminative, the intrinsic geometric structure of data, and the local and global structural consistencies over labels are exploited simultaneously in the proposed algorithm. Besides, we adopt the $\ell_{2,1}$-norm for the formulations of loss function and regularization respectively to make our algorithm robust to the outliers and noise. An efficient algorithm is designed to solve the proposed optimization problem. It is noted that the proposed framework is a general one which can leverage several well-known algorithms as special cases and elucidate their intrinsic relationships. To validate the effectiveness of the proposed method, extensive experiments are conducted on diversity datasets for different image understanding tasks, i.e., image tagging, clustering, and classification, and the more encouraging results are achieved compared with some state-of-the-arts approaches.

梁云，北京大学

讲者简介：

梁云于2004年获得同济大学工学学士学位，2010年获得新加坡国立大学计算机科学专业博士学位。2010-2012年在伊利诺伊大学香槟分校（UIUC）担任研究员。2012年8月至今，担任北京大学信息科学技术学院研究员。研究方向为编译技术、高性能计算、计算机系统结构、GPU/FPGA。梁云博士在相关领域的顶级国际学术会议及期刊（包括HPCA,ISCA, CGO, DAC, ICCAD, FCCM, FPGA, RTSS, IPDPS, TPDS等）发表论文30余篇。他的论文荣获FCCM 2011最佳论文奖，多次被提名为最佳论文。梁云博士在多个国际会议和期刊进行学术服务，其中包括担任ASP-DAC分会主席, PACT, DATE, CASES等会议的技术评议会委员，以及IEEE TC, TPDS, TVLSI, TCAD ,ACM TACO等期刊的审稿人。

报告题目：

Coordinated Static and Dynamic Cache Bypassing for GPUs

针对GPU的静态和动态结合的缓存旁路技术

报告摘要：

The massive parallel architecture enables graphics processing units (GPUs) to boost performance for a wide range of applications. Recently, to broaden the scope of applications that can be accelerated by GPUs, GPU vendors have used caches in conjunction with scratchpad memory as on-chip memory in the new generations of GPUs. Unfortunately, GPU caches face many performance challenges that arise due to excessive thread contention for cache resource.

Cache bypassing, where memory requests can selectively bypass the cache, is one solution that can help to mitigate the cache resource contention problem. In this work, we propose coordinated static and dynamic cache bypassing to improve application performance. At compile-time, we identify the global loads that indicate strong preferences for caching or bypassing through profiling. For the rest global loads, our dynamic cache bypassing has the flexibility to cache only a fraction of threads. Our coordinated static and dynamic cache bypassing technique achieves up to 2.28X (average 1.32X) performance speedup for a variety of GPU applications.

张伟哲，哈尔滨工业大学
　

讲者简介：

张伟哲是哈尔滨工业大学计算机科学与技术学院的教授、博士生导师。美国伊利诺伊香槟分校、休斯顿大学计算机系访问学者，高性能计算专委会委员。于1999, 2001 和 2006年在哈工大分别获得本科、硕士和博士学位。主要从事高性能计算、并行与分布式系统和计算机网络安全方面的研究。主持国家自然科学基金、国家863重点项目等多项国家级与省部级科研项目。在《IEEE Transactions on Computers》、《IEEE Transactions on Cloud Computing》、《Science in China Series F-Information》、《 IEEE Cluster》《 IEEE IPDPS》、《ACM CIKM》等国内外刊物与会议发表论文100余篇。《Future Generation Computer Systems》、《Microprocessors and Microsystems》和《IJHIT》等近10种国际刊物的客座编辑。
　

报告题目：

DwarfCode: A Performance Prediction Tool for Parallel Applications
　

报告摘要：

We present DwarfCode, a performance prediction tool for MPI applications on diverse computing platforms. The goal is to accurately predict the running time of applications for task scheduling and job migration. First, DwarfCode collects the execution traces to record the computing and communication events. Then, it merges the traces from different processes into a single trace. After that, DwarfCode identifies and compresses the repeating patterns in the final trace to shrink the size of the events. Finally, a dwarf code is generated to mimic the original program behavior. This smaller running benchmark is replayed in the target platform to predict the performance of the original application. In order to generate such a benchmark, two major challenges are to reduce the time complexity of trace merging and repeat compression algorithms. We propose an O(mpn) trace merging algorithm to combine the traces generated by separate MPI processes , where m denotes the upper bound of tracing distance, p denotes the number of processes, and n denotes the maximum of event numbers of all the traces. More importantly, we put forward a novel repeat compression algorithm, whose time complexity is O(nlogn). Experimental results show that DwarfCode can accurately predict the running time of MPI applications. The error rate is below 10% for compute and communication intensive applications. This toolkit has been released for free download as a GNU General Public License v3 software.
　

何丙胜，南洋理工大学

讲者简介：

Dr. Bingsheng He is currently an Associate Professor at School of Computer Engineering, Nanyang Technological University, Singapore. Before that, he held a research position in the System Research group of Microsoft Research Asia (2008-2010), where his major research was building high performance cloud computing systems for Microsoft. He got the Bachelor degree in Shanghai Jiao Tong University (1999-2003), and the Ph.D. degree in Hong Kong University of Science & Technology (2003-2008). His current research interests include cloud computing, database systems and high performance computing. His papers are published in prestigious international journals (such as ACM TODS and IEEE TKDE/TPDS/TC) and proceedings (such as ACM SIGMOD, VLDB/PVLDB, ACM/IEEE SuperComputing, ACM HPDC, and ACM SoCC). He has been awarded with the IBM Ph.D. fellowship (2007-2008) and with NVIDIA Academic Partnership (2010-2011). Since 2010, he has (co-)chaired a number of international conferences and workshops, including CloudCom 2014/2015 and HardBD2015. He has served in editor board of international journals, including IEEE Transactions on Cloud Computing (IEEE TCC) and IEEE Transactions on Parallel and Distributed Systems (IEEE TPDS).

报告题目：

When HPC Meets Big Data: Emerging HPC Technologies for High-Performance Data Management Systems

报告摘要：

Big data has become a buzz word. Among various big-data challenges, high performance is a must, not an option. We are facing the challenges (and also opportunities) at all levels ranging from sophisticated algorithms and procedures to mine the gold from massive data to high-performance computing (HPC) techniques and systems to get the useful data in time. Our research has been on the system design and implementation of HPC technologies as weapons to address the performance requirement of data management systems. Interestingly, we have also observed the interplay between HPC architectures and (big) data management systems. In this talk, I will present our recent research efforts in developing high performance data management systems with GPUs and on Cloud. Finally, I will outline our research agenda. More details about our research can be found at http://pdcc.ntu.edu.sg/xtra/.

李超，上海交通大学

讲者简介：

李超，本科于浙江大学获工学学士学位，研究生于美国佛罗里达大学获计算机工程博士学位，目前在上海交通大学计算机科学与工程系任特别副研究员，博士生导师。主要研究新型计算机系统、高效能数据中心、以及与绿色可持续计算相关的基础架构设计。他在计算机体系结构领域重要国际会议ISCA, HPCA, MICRO等发表六篇第一作者论文，并获得2011年度HPCA最佳论文奖。他于2013年获选成为 Facebook Graduate Fellow，并获得国家优秀自费留学生奖。他是CCF，IEEE，ACM会员。

报告题目：

Enabling Renewable Energy Powered Sustainable High-Performance Computing

报告摘要：

随着信息时代数据量的不断累积和暴发式增长，设计绿色环保的高性能计算系统变得愈加重要。近五年来，诸多研究团体都在致力于开展新能源绿色计算方面的研究，工业界也涌现出一批以新能源为亮点的数据中心设计。在本报告中，我们将介绍不同类型的新能源数据中心设计方法，并具体探讨一套智能的负载管理方法：负载微整形技术。该方法能够在复杂多变的新能源环境下实现负载的高效运行，并能有效地降低数据中心成本。

朱昭萌，南京理工大学

讲者简介：

朱昭萌，于2010年获南京理工大学工学学士学位，现为南京理工大学计算机科学与工程学院博士研究生。 2014年1月至2015年1月在英国布鲁奈尔大学公费访问交流。博士期间主要研究兴趣包括异构并行计算，云计算和分布式系统等。多篇论文被TPDS，IPDPS等知名学术期刊及会议录用或发表。

报告题目：

Multi-objective Scheduling of Workflow Applications in IaaS

IaaS平台中工作流计算任务的多目标调度

报告摘要：

Cloud computing provides promising platforms for executing large applications with enormous computational resources to offer on demand. In IaaS model, users are charged based on their usage of provisioned VMs and the required QoS specifications. Although there are many existing workflow scheduling algorithms in traditional distributed or heterogeneous computing environments, they have difficulties in being directly applied to IaaS since IaaS differs from traditional environments by its service-based resource management and pay-per-use pricing strategies. In this talk, we will highlight such difficulties, and formulate the multi-objective workflow scheduling problem in IaaS. Also, we will present our recent research in the area of scheduling workflow applications on the IaaS platforms, including two novel evolutionary approaches for addressing the QoS-optimization scheduling problem and a list-based heuristic for addressing the Budget-constrained performance-effective scheduling problem. Experimental comparisons between the proposed algorithms and state-of-the-art algorithms will be presented and discussed.