Semi-hierarchical Semantic-aware Storage
[TPDS] Yu Hua, Hong Jiang, Yifeng Zhu, Dan Feng, Lei Xu, "SANE: Semantic-Aware Namespace in Ultra-large-scale File Systems", IEEE Transactions on Parallel and Distributed Systems (TPDS), Vol.25, No.5, May 2014, pages:1328-1338.
The explosive growth in data volume and complexity imposes great challenges for file systems. To address these challenges, an innovative namespace management scheme is in desperate need to provide both the ease and efficiency of data access. In almost all today’s file systems, the namespace management is based on hierarchical directory trees. This tree-based namespace scheme is prone to severe performance bottlenecks and often fails to provide real-time response to complex data lookups. This paper proposes a Semantic-Aware Namespace scheme, called SANE, which provides dynamic and adaptive namespace management for ultra-large storage systems with billions of files. SANE introduces a new naming methodology based on the notion of semantic-aware per-file namespace, which exploits semantic correlations among files, to dynamically aggregate correlated files into small, flat but readily manageable groups to achieve fast and accurate lookups. SANE is implemented as a middleware in conventional file systems and works orthogonally with hierarchical directory trees. The semantic correlations and file groups identified in SANE can also be used to facilitate file prefetching and data de-duplication, among other system-level optimizations.
Hua, Hong Jiang,
Dan Feng, "FAST:
Near Real-time Searchable Data Analytics for the Cloud", Proceedings of the
International Conference for High Performance Computing, Networking, Storage and
Analysis (SC), November 2014, Pages: 754-765. (Acceptance rate: 82/394=20.8%)
With the explosive growth in data volume and complexity and the increasing need for highly efficient searchable data analytics, existing cloud storage systems have largely failed to offer an adequate capability for real-time data analytics. Since the true value or worth of data heavily depends on how efficiently data analytics can be carried out on the data in (near-) real-time, large fractions of data end up with their values being lost or significantly reduced due to the data staleness. To address this problem, we propose a near-real-time and cost-effective searchable data analytics methodology, called FAST. The idea behind FAST is to explore and exploit the semantic correlation within and among datasets via correlation-aware hashing and manageable flat-structured addressing to significantly reduce the processing latency, while incurring acceptably small loss of data-search accuracy. The near-real-time property of FAST enables rapid identification of correlated files and the significant narrowing of the scope of data to be processed. FAST supports several types of data analytics, which can be implemented in existing searchable storage systems. We conduct a real-world use case in which children reported missing in an extremely crowded environment (e.g., a highly popular scenic spot on a peak tourist day) are identified in a timely fashion by analyzing 60 million images using FAST.
[USENIX ATC] Min Fu, Dan Feng, Yu Hua, Xubin He, Zuoning Chen, Wen Xia, Fangting Huang, Qing Liu, "Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information", Proceedings of USENIX Annual Technical Conference (USENIX ATC), June 2014, pages: 181-192. (Acceptance rate: 36/241=14.9%).
[USENIX ATC] Wen Xia, Hong Jiang, Dan Feng, Yu Hua, “SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput,” Proceedings of USENIX Annual Technical Conference (USENIX ATC), June 2011. (Acceptance rate: 15%)Data Deduplication is becoming increasingly popular in storage systems as a space-efficient approach to data backup and archiving. Most existing state-of-the-art deduplication methods are either locality based or similarity based, which, according to our analysis, do not work adequately in many situations. While the former produces poor deduplication throughput when there is little or no locality in datasets, the latter can fail to identify and thus remove significant amounts of redundant data when there is a lack of similarity among files. In this paper, we present SiLo, a near-exact deduplication system that effectively and complementarily exploits similarity and locality to achieve high duplicate elimination and throughput at extremely low RAM overheads. The main idea behind SiLo is to expose and exploit more similarity by grouping strongly correlated small files into a segment and segmenting large files, and to leverage locality in the backup stream by grouping contiguous segments into blocks to capture similar and duplicate data missed by the probabilistic similarity detection. By judiciously enhancing similarity through the exploitation of locality and vice versa, the SiLo approach is able to significantly reduce RAM usage for index lookup and maintain a very high deduplication throughput.
[USENIX ATC] Yuanyuan Sun, Yu Hua, Song Jiang, Qiuyu Li, Shunde Cao, Pengfei Zuo, "SmartCuckoo: A Fast and Cost-Efficient Hashing Index Scheme for Cloud Storage Systems", Proceedings of USENIX Annual Technical Conference (USENIX ATC), July 2017, pages: 553-566. (Acceptance rate: 60/283=21%)
Fast query services are important to improve overall performance of large-scale storage systems when handling a large number of files. Open-addressing cuckoo hash schemes have been widely used to support query services due to the salient features of simplicity and ease of use. Conventional schemes are unfortunately inadequate to address the potential problem of having endless loops during item insertion, which degrades the query performance. To address the problem, we propose a cost efficient cuckoo hashing scheme, named SmartCuckoo. The idea behind SmartCuckoo is to represent the hashing relationship as a directed pseudoforest and use it to track item placements for accurately predetermining the occurrence of endless loop. SmartCuckoo can efficiently predetermine insertion failures without paying a high cost of carrying out step-by-step probing. We have implemented SmartCuckoo in a large-scale cloud storage system. Extensive evaluations using three real world traces and the YCSB benchmark demonstrate the efficiency and efficacy of SmartCuckoo. We have released the source code of SmartCuckoo for public use.
[SoCC] Yuanyuan Sun, Yu Hua, Xue Liu, Shunde Cao, Pengfei Zuo, "DLSH: A Distribution-aware LSH Scheme for Approximate Nearest Neighbor Query in Cloud Computing", Proceedings of ACM Symposium on Cloud Computing (SoCC), 2017, pages: 242-255. .
Cloud computing needs to process and
analyze massive high dimensional data in a real-time manner. Approximate queries
in cloud computing systems can provide timely queried results with acceptable
accuracy, thus alleviating the consumption of a large amount of resources.
Locality Sensitive Hashing (LSH) is able to maintain the data locality and
support approximate queries. However, due to randomly choosing hash functions,
LSH has to use too many functions to guarantee the query accuracy. The extra
computation and storage overheads exacerbate the real performance of LSH. In
order to reduce the overheads and deliver high performance, we propose a
distribution-aware scheme, called DLSH, to offer cost-effective approximate
nearest neighbor query service for cloud computing. The idea of DLSH is to
leverage the principal components of the data distribution as the projection
vectors of hash functions in LSH, further quantify the weight of each hash
function and adjust the interval value in each hash table. We then refine the
queried result set based on the hit frequency to significantly decrease the time
overhead of distance computation. Extensive experiments in a large-scale cloud
computing testbed demonstrate significant improvements in terms of multiple
system performance metrics. We have released the source code of DLSH for public
[INFOCOM] Yu Hua, Wenbo He, Xue Liu, Dan Feng, "SmartEye: Real-time and Efficient Cloud Image Sharing for Disaster Environments", Proceedings of the 34th IEEE International Conference on Computer Communications (INFOCOM), 2015, pages: 1616-1624. (Acceptance rate: 316/1640=19%).
Rapid disaster relief is important to save human lives and reduce property loss. With the wide use of smartphones and their ubiquitous easy access to the Internet, sharing and uploading images to the cloud via smartphones offer a nontrivial opportunity to provide information of disaster zones. However, due to limited available bandwidth and energy, smartphone based crowdsourcing fails to support the real-time data analytics. The key to efficiently and timely share and analyze the images is to determine the value/worth of the images based on their significance and redundancy, and only upload those valuable and unique images. In this paper, we propose a near-realtime and cost-efficient scheme, called SmartEye, in the cloudassisted disaster environment. The idea behind SmartEye is to implement QoS-aware in-network deduplication over DiffServ in the software-defined networks (SDN). Due to the ease of use, simplicity and scalability, DiffServ supports the in-network deduplication to meet the needs of differentiated QoS. SmartEye aggregates the flows with similar features via a semantic hashing, and provides communication services for the aggregated, not a single, flow. To achieve these goals, we leverage two main optimization schemes, including semantic hashing and space efficient filters. Efficient image sharing is helpful to disaster detection and scene recognition. To demonstrate the feasibility of SmartEye, we conduct two real-world case studies in which the loss in Typhoon Haiyan (2013) and Hurricane Sandy (2012) can be identified in a timely fashion by analyzing massive data consisting of more than 22 million images using our SmartEye system. Extensive experimental results illustrate that SmartEye is efficient and effective to achieve real-time analytics in disasters.
[INFOCOM] Yu Hua, Bin Xiao, Xue Liu, "NEST: Locality-aware Approximate Query Service for Cloud Computing", Proceedings of the 32nd IEEE International Conference on Computer Communications (INFOCOM), April 2013, pages: 1327-1335. (Acceptance rate: 17%)
Cloud computing applications face the
challenges of dealing with a huge volume of data that needs the support of fast
approximate queries to enhance system scalability and improve quality of
service, especially when users are not aware of exact query inputs.
Locality-Sensitive Hashing (LSH) can support the approximate queries that
unfortunately suffer from imbalanced load and space inefficiency among
distributed data servers, which severely limits the query accuracy and incurs
long query latency between users and cloud servers. In this paper, we propose a
novel scheme, called NEST, which offers ease-of-use and cost-effective
approximate query service for cloud computing. The novelty of NEST is to
leverage cuckoo-driven locality-sensitive hashing to find similar items that are
further placed closely to obtain load balancing buckets in hash tables. NEST
hence carries out flat and manageable addressing in adjacent buckets, and
obtains constant scale query complexity even in the worst case. The benefits of
NEST include the increments of space utilization and fast query response.
Theoretical analysis and extensive experiments in a large-scale cloud testbed
demonstrate the salient properties of NEST to meet the needs of approximate
query service in cloud computing environments.
Today’s cloud data centers contain more than millions of servers and offer high bandwidth. A fundamental problem is how to significantly improve the large-scale system’s scalability to interconnect a large number of servers and meanwhile support various online services in cloud computing. One way is to deal with the challenge of potential mismatching between the network architecture and the data placement. To address this challenge, we present ANTELOPE, a scalable distributed data-centric scheme in cloud data centers, in which we systematically take into account both the property of network architecture and the optimization of data placement. The basic idea behind ANTELOPE is to leverage precomputation based data cube to support online cloud services. Since the construction of data cube suffers from the high costs of full materialization, we use a semantic-aware partial materialization solution to significantly reduce the operation and space overheads
Related Technical Talks
"Edge-aware Efficient Data Storage Systems", Forum "storage systems in data centers" in CCF CNCC (China National Computer Congress), October2017.
"Semi-hierarchical Architecture for Smart Storage Systems", Keynote Speaker in the 23rd National Conference of Information Storage (NCIS), September 2017.
"Cost-efficient and Reliable Storage System in the Edge", Guangzhou University, September 2017.
"Cost-efficient Remote Transmission Services in Cloud Storage Systems", ACM Chongqing Workshop on Smart Sensing and Advanced Computing, August 2017.
"Cloud Storage Hardware Systems for Data Redundancy", The 7th Chinese Conference on Cloud Computing (CCCC), Dec. 2016.
"FCoE-based Large-scale Storage Systems", Forum "cloud computing and data center networks" in HPC China, October 2016.
"Searchable Storage Systems", Chinese Computer Systems Teaching Workshop, June 2016.
"Smarter Storage for Big Data Analytics: Architecture and Systems", Storage System Forum in University of Science and Technology of China, May 2016.
"Correlation-aware Large-scale Storage Systems", Industrial Technology Innovation Strategic Alliance of Storage, April 2016.
"Real-time Storage Systems for Disaster Management", Excellent Young Scholar Forum in the 21st National Conference of Information Storage (NCIS), September 2015.
"System and Architecture of Approximate Data Deduplication in Cloud Storage", Wuhan University, May 2015. See the News.
"Large-scale Network Storage Systems", Jinan University, November 2014. See the News.
"Approximate Data Deduplication in Storage Systems for Big Data", Excellent Young Scholar Forum in the 20th National Conference of Information Storage (NCIS), September 2014.
"Data Deduplication based Cloud Backup Systems", The workshop of smart city and cloud computing by ACM (China) Conference Committee, May 2014.
" HPC Plus(2015).
" The CCF Advanced Disciplines Lectures "Frontiers of Storage Systems and Devices" (2017).