Language Knowledge Base for the Development of Chinese Information Processing

2017-10-05


Language information processing aims to enable computers to understand and use human languages. The Comprehensive Language Knowledge Base (CLKB) is the infrastructure of language information processing techniques and industry, which provides formalized and standardized language knowledge as well as core softwares for new generation information technologies such as smart search, machine translation and man-machine conversation.

In our country, formal research on language information processing emerged in mid 1980s when there are very few Chinese language resources in the world. Different from English and Japanese, a lack of formalized annotation in Chinese makes it difficult to construct a Chinese language knowledge base that is badly needed for Chinese information processing. The research group in the Institute of Computational Linguistics at Peking University, who is erudite on Chinese and Chinese culture, have studied computational models for Chinese and methodologies for language knowledge bases since 1986. The comprehensive language knowledge base that has been developed for 30 years strongly supports the original scientific research and application development for Chinese information processing.

CKLB is novel in the following aspects: (1) it serves as a practical Chinese language resource for information processing, which never existed before; (2) it builds language knowledge description system and computational models for Chinese; (3) it formulates serialized language knowledge construction standard which has partially adopted by China's national bureau of standards; (4) it presents a word sense based integration solution, breaking the technical bottleneck of heterogeneous knowledge base integration; (5) it proposes an engineering method for building a language knowledge base.

The CLKB, which won the national scientific and technological progress award in 2011, includes 6 language knowledge bases, 10 standards, 4 core softwares and 4 application systems. They support each other, forming a closely connected organic whole.

Language knowledge base is the core of CKLB, including:

1) The grammatical knowledge base of contemporary Chinese, containing 3.6 million grammatical attributes description for 80,000 words.

2)    The bank of Chinese phrase structure rules, containing more than 600 grammatical rules.

3) The multistage processing corpus of contemporary Chinese, which has word segmentation and part-of-speech tags for 150 million Chinese characters of which 52 million Chinese characters are annotated in fine grain and 28 million Chinese characters are tagged with their senses.

4) Multi-lingual concept dictionary, containing 100 thousand synset concept

5) Parallel corpus, containing 1 million English-Chinese parallel sentence pairs.

6) Multi-domain term bank, containing 350,000 Chinese-English bilingual terms.

The serialized language knowledge base provided by CLKB includes a variety knowledge such as words, phrases, sentences, discourses and lexical, syntactical and even semantic knowledge, which can be used in both general and specialized domains in multiple languages including Chinese. CLKB is the largest and most widely acknowledged Chinese language knowledge resource.

CLKB not only has a profound effect in academia and but also have produced substantial social and economic effects. Its manual and proposed standards have been widely cited and two excellent doctorate theses focused on CLKB. Both of the two projects that won the first prize of Qian weichang Chinese information processing science and technology award (top award in the field of Chinese information processing) in 2010 states that all the linguistic knowledge in their systems is from CLKB. CLKB has more than 10,000 free users and its contract users since 1996 spread all over the world, including Apple, Google, IBM, Intel, Microsoft and Huawei. CLKB also contributes to processing of minority languages in China, gesture language translation and international communication of Chinese. The lifecycle of CKLB is incredibly long in IT field.

The grammatical knowledge base of contemporary Chinese, which serves as the core part of CKLB, won the second prize of the ministry of education's technology progress award in 1998. Prof. Shiwen Yu, the first inventor of CLKB, won the first Lifetime Achievement Award by Chinese information processing society in the 30th anniversary of Chinese information processing society in 2011. In addition, CLKB and its inventors won many other awards. After winning the national scientific and technological progress award in 2011, CLKB was awarded the special contribution prize of combination of production, education and scientific research by Peking University in 2013.

Language knowledge resources in CLKB, which are based on the grammatical knowledge base of contemporary Chinese, have started negotiating transfer since 1996. By 2017, the transfer has never stopped. It is uncommon that a research achievement has such a long life in IT field.