Home

Big Data Management and Analytics

Release time:2017-10-08

For the past three decades, classical database management systems have maintained a feverish pace in realizing significant efficiencies in dealing with the vast amount of information that needs to be maintained to model the operational characteristics of large-scale enterprises. In the intervening years especially in the 1990s, data warehousing and data analysis emerged as a major research and technology frontier. While earlier DBMSs focused on modeling operational characteristics of enterprises, big data systems are now expected to model vast amounts of heterogeneous and complex data. Clearly, scalable data management and complex data analytics in the context of big data has emerged as a new research frontier. The main contributions are included as follows.

1) Indexing and query processing techniques: One major research topic in data management field is to make the data management system support more complex data types, such as spatial objects, images and bio-information. The research group in PKU proposed some new data representation methods and indexing structures to facilitate the organization of different types of complex data, designed multi-feature extraction, feature fusing, and dimension transformation techniques to alleviate the dimensionality curse problem. Query processing is the key factor to improve the performance of database systems. Relational database systems can support simple query tasks; however, many complex query tasks are not well supports due to the constraints of relational model and SQL. The group proposed new solutions for query reformulation, query expression, P2P query processing, and buffer management to improve the performance of database systems.

2) Data mining and Machine learning techniques: The complex data, such as social media data, have some characteristics, including large volume, heterogeneity, rich structure and correlation, which bring great challenge to researchers in this field. The group proposed new data mining techniques which integrate both the content and structural features to mine the embedded knowledge of big data; proposed a novel network embedding method called “LINE,” which can embed very large information networks into low-dimensional vector spaces and is suitable for arbitrary types of information networks; proposed new solutions to incorporate world knowledge to text mining via heterogeneous information networks.

3) Graph data management and mining: Graph processing systems have been widely used in enterprises like online social networks to process their daily jobs. With the fast growing of social applications, they have to handle massive concurrent jobs ef?ciently. However, due to the inherent design for a single job, existing systems incur great inef?ciency in terms of memory usage, execution and fault tolerance. Motivated by this issue, the group designed and implemented a novel graph processing system that enables ef?cient job-level parallelism. The new design allows multiple concurrent jobs to share graph structure data in memory, which fundamentally increases job-level concurrency and reduces fault tolerance overhead which signi?cantly outperforms popular systems in both memory usage and job completion time, when executing concurrent graph jobs. We have also developed various graph processing and mining methods, including an efficient top-k shortest path discovery method in large graphs, and a distributed graph pattern-matching algorithm to achieve high efficiency in billion edges graph.

The results of big data management and analytics research have been widely published on premier conferences including SIGMOD, VLDB, SIGKDD, SIGIR, ICML, WWW, AAAI and high-impact journals including ACM TOIS/TODS, IEEE TKDE/TC/TPDS, and so on, along with 3,000+ Google Scholar citations. They have been awarded by the best/(best student) papers of WISE’10, 13, ICML 14. Specifically, ICML is the top international conference in Machine learning area, the Best Paper award (2014) from Professor ZHANG, Ming’s group is the first one in China. The research achievements around data management and analytics have won the Second-Class State Award for Science and Technology Progress, and Natural Science Award of MOE China.

Besides the research papers, the big data group also put many efforts on the system building and applications.

1) The group provided some web service systems to the public, such as “TianWang” search engine (http://e.pku.edu.cn), “WebInfomall”, Chinese Web archive over past 15 years (http://www.infomall.cn), “Maze” for P2P file sharing with millions of users, and “AmazingStore” which provides distributed data storage services.

2) Our technique has been deployed in a graph based recommendation system in Alibaba. Alibaba, a leading e-commerce company, provides a platform for users and shops, where users can browse, save, or purchase items in shops. The interactions between users, items and shops can be modelled as a large, dynamic graph with rich features. Our methods run on a 1000-node cluster to processing 10 billion newly added edges in each week to learn users’ profile and item’s similarities.

3) We implemented a new system, named Angel, to facilitate the development of large-scale ML applications in production environment. Angel employed hybrid parallelism to accelerate the performance of ML algorithms. The pulling of parameters and the pushing of updates were fully optimized in Angel to reduce the network overheads. Angel has been deployed in a Tencent production cluster with thousands of nodes and it supports various applications.

A system framework for data extraction, management and analysis was developed, named “APEMA”. These achievements have been deployed in the country's important application fields, such as in the fields of space vehicle design and national information security management in the past 10 years. And they have been applied in the products of companies such as Microsoft, GeoCyber Solution Inc.