2.793

2018影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于篇章的汉语句法结构树库构建

卢露 矫红岩 李梦 荀恩东

卢露, 矫红岩, 李梦, 荀恩东. 基于篇章的汉语句法结构树库构建. 自动化学报, 2020, 46(x): 1−11 doi: 10.16383/j.aas.c190828
引用本文: 卢露, 矫红岩, 李梦, 荀恩东. 基于篇章的汉语句法结构树库构建. 自动化学报, 2020, 46(x): 1−11 doi: 10.16383/j.aas.c190828
Lu Lu, Jiao Hong-Yan, Li Meng, Xun En-Dong. A discourse-based chinese chunkbank. Acta Automatica Sinica, 2020, 46(x): 1−11 doi: 10.16383/j.aas.c190828
Citation: Lu Lu, Jiao Hong-Yan, Li Meng, Xun En-Dong. A discourse-based chinese chunkbank. Acta Automatica Sinica, 2020, 46(x): 1−11 doi: 10.16383/j.aas.c190828

基于篇章的汉语句法结构树库构建

doi: 10.16383/j.aas.c190828
基金项目: 国家社科基金重点项目(16AYY007), 北京语言大学2019年研究生创新基金项目(19YCX121)资助
详细信息
    作者简介:

    卢露:北京语言大学硕士研究生, 语言智能与技术专业E-mail: 201821198367@stu.blcu.edu.cn

    矫红岩:北京语言大学硕士研究生, 主要研究方向自然语言处理E-mail: jiaohongyan0815@163.com

    李梦:北京语言大学硕士研究生, 计算机应用技术专业

    荀恩东:通讯作者, 北京语言大学信息科学学院教授, 主要研究方向为自然语言处理E-mail: edxun@blcu.edu.cn

A Discourse-Based Chinese ChunkBank

Funds: Supported by The National Social Science Fund of China (16AYY007), The Graduate Research and Innovations Fund of Beijing Language and Culture and University(19YCX121)
  • 摘要: 为快速构建一个大规模、多领域的高质树库, 本文提出一种基于短语功能与句法角色的组块的、便于标注多层次结构的标注体系, 在篇章中综合利用标点、句法结构、表述功能作为句边界判断标准, 确立合理的句边界与层次; 在句子中以组块的句法功能为主, 参考篇章功能、人际功能, 以4个性质标记、8个功能标记、4个句标记来描写句中3类5种组块, 标注基本句型骨架, 突出中心词信息. 目前已初步构建有质量保证的千万汉字规模的浅层结构分析树, 包含60余万小句的9千余条句型结构库, 语料涉及百科、新闻、专利等应用领域文本1万余篇; 与此同时也探索了高效的标注众包管理模式.
  • 图  1  组块状标注结果及浅层分析的树结构示例

    Fig.  1  A Sample of Shallow chunk-based syntactic tree

    图  2  “句”的层次结构

    篇章由若干段落构成, 段落由单句或复句构成, “句”分单句、复句; 小句作为基本句单位, 可独立充当单句, 也可与其它小句形成分句关系, 共同构成复句; 复句中, 还有一些由片段充当的分句, 这部分主要起连接作用、语气作用.

    Fig.  2  The hierarchy of Chinese sentences in discourse

    A discourse or a text is composed of several paragraphs, which are composed of Simple Sentence or Compound Sentence. Clause, as the basic sentence unit, can act as a Simple Sentences independently or form a series of Compound Sentences with other clauses. In Compound sentences, there are some Sentence Fragments, which mainly play the role of connection and modality.

    图  3  标注平台标注界面及管理工具界面

    Fig.  3  The interface of annotation website and management tool

    图  4  例句大于1的句型骨架双对数分布

    高频句型数量少, 鲜有相同频次分布; 低频句型则相反.

    Fig.  4  The distribution of double logarithm sketch for Chinese sentence pattern

    The number of high frequency sentence patterns is quite small, and the same frequency distribution is rare, while the low-frequency pattern is the opposite.

    图  5  标句点号失效、缺省在IP小句中的分布.

    Fig.  5  sentence-division-punctuation is invalid or missing for sentence-division-punctuation

    图  6  树库全文本Kappa值分布与各标注符号Kappa平均值

    树库中全文本Kappa值分布在0.8-1之间; “@”表示体词性主宾语.

    Fig.  6  The distribution of Kappa coefficient of the text in our Treebank, and the mean of every kind of label’s Kappa

    The Kappa coefficient of the text are distributed between 0.8-1. " @ "means the nominal subject and object.

    表  1  树库标记集

    Table  1  Tags for chunk-tree

    序号 符号 标记类型 序号 符号 标记类型 序号 符号 标记类型
    1 VP 谓词性组块 6 NPRE 名词谓语 11 AUX 辅助组块
    2 NP 体词性组块 7 MOD 状语、补语 12 ROOT 单复句
    3 UNK 谓词与体词并列的组块 8 SBJ 主语 13 IP 完整小句
    4 NULL 其它性质组块 9 OBJ 宾语 14 HLP 独词句或片段
    5 PRD 述语 10 CON 衔接组块 15 W 标点
    下载: 导出CSV

    表  2  目前有效标注语料分布

    Table  2  The data distribution of our treebank

    类别 ROOT IP HLP 汉字数 文件数 说明
    新闻 132009 359373 50378 4920170 4813 新浪2006、新华社新闻2012-2018间新闻
    百科 76595 149097 14823 2376151 2982 自动化控制系统、电子学与计算机、轻工、大气与海洋及水文科学、航空航天、经济学
    专利 69260 166462 16966 2839935 3915 2018国家专利申请文书描述与权利申明部分
    ROOT中汉字字长分布 平均IP字数
    类别 平均 最大 最小 中位数 众数
    新闻 37 837 1 33 1 13.69
    百科 31 251 0 27 20 15.94
    专利 40 819 0 29 16 17.06
    下载: 导出CSV
  • [1] 陈荣春. 从句子的表述性谈单句复句的划分. 语文研究, 1981, 1981(01): 46−51

    Chen Rongchun. Talk about Chinese compound sentence from declarability. Linguistic Researches, 1981, 1981(01): 46−51
    [2] 董秀芳. 汉语词汇化和语法化的现象与规律. 上海: 学林出版社, 2017: 143-144

    Dong Xiufang. The Phenomenon and Regularity of Chinese Lexicalization and Grammaticalization. Shanhai: Akademia Press, 2017: 143-144.
    [3] 郭丽娟, 彭雪, 李正华, 张民. 面向多领域多来源文本的汉语依存句法树库构建. 中文信息学报, 2019, 33(02): 34−42 doi: 10.3969/j.issn.1003-0077.2019.02.005

    Guo Lijuan, Pen Xue, Li Zhenghua, Zhang Min. Construction of Chinese Dependency Syntax Treebanks for Multi-domain and Multi-source Texts. Journal of Chinese Information Processing, 2019, 33(02): 34−42 doi: 10.3969/j.issn.1003-0077.2019.02.005
    [4] 胡壮麟. 语篇的衔接与连贯. 上海: 上海外语教育出版社, 1994: 108-109

    Hu Zhuanglin. Discourse Cohesion and Coherence. Shanhai: Shanghai Foreign Language Education Press, 1994: 108-109.
    [5] 李秀明. 汉语元话语标记研究. 上海: 复旦大学, 2006

    Li Xiuming. The Research of Chinese Metadiscourse Marker[Ph.D dissertation], Fudan University, 2006.
    [6] 钱小飞. 组块分析研究综述. 现代语文, 2018, 2018(06): 166−170

    Qian Xiaofei. Research Review on Chunk Parsing. Modern Chinese, 2018, 2018(06): 166−170
    [7] 邱立坤, 金澎, 王厚峰. 基于依存语法构建多视图汉语树库. 中文信息学报, 2015, 29(3): 9−15 doi: 10.3969/j.issn.1003-0077.2015.03.002

    Qiu Likun, Jin Peng, Wang Houfeng. A Multi-view Chinese Treebank Based on Dependency Grammar. Journal of Chinese Information Processing, 2015, 29(3): 9−15 doi: 10.3969/j.issn.1003-0077.2015.03.002
    [8] 宋柔, 葛诗利, 尚英, 卢达威. 面向文本信息处理的汉语句子和小句. 中文信息学报, 2017, 31(02): 18−24+35

    Song Rou, Ge Shili, Shang Ying, Lu Dawei. Chinese Sentence and Clause for Text Information Processing. Journal of Chinese Information Processing, 2017, 31(02): 18−24+35
    [9] 邢福义. 汉语复句研究. 北京: 商务印书馆, 2001: 2-6, 26-31, 38-56, 546-548

    Xing Fuyi, The Research on Chinese Sentences With Two or More Clause. Beijing: The Commercial Press, 2001: 2-6, 26-31, 38-56, 546-548.
    [10] 徐赳赳. 现代汉语篇章语言学. 北京: 商务印书馆, 2010: 218-222

    Xu Jiujiu, The Text Linguistics of Modern Chinese. Beijing: The Commercial Press, 2010: 218-222.
    [11] 杨一飞. 语篇中的连接手段. 上海: 复旦大学, 2011

    Yang Yifei, Connection in Modern Chinese Discourses[Ph.D dissertation], Fudan University, 2011.
    [12] 赵春利, 石定栩. 语气、情态与句子功能类型. 外语教学与研究, 2011, 43(04): 483−500+639

    Zhao Chunli, Shi Dingxu. Mood, modality and sentence typ. Foreign Language Teaching and Research, 2011, 43(04): 483−500+639
    [13] 周强. 构建大规模的汉语语块库. 山西大学计算机系. 自然语言理解与机器翻译——全国第六届计算语言学联合学术会议论文集. 山西: 山西大学计算机系, 2001: 6

    Zhou Qiang. Build a large scale Chinese functional chunk bank. In: Proceedings of the 6th national conference on computational linguistics--Natural language understanding and machine translation. Shanxi, China: School of Computer and Information Technology of Shanxi University, 2001: 6.
    [14] 周强, 张伟, 俞士汶. 汉语树库的构建. 中文信息学报, 1997, 11(4): 43−52

    Zhou Qiang, Zhang Wei, Yu Shiwen. The Building of Chinese Treebank. Journal of Chinese Information Processing, 1997, 11(4): 43−52
    [15] 周强. 汉语句法树库标注体系. 中文信息学报, 2004, 18(4): 2−9

    Zhou Qiang. Annotation Scheme for Chinese Treebank. Journal of Chinese Information Processing, 2004, 18(4): 2−9
    [16] Brody M. Phrase structure and dependence. [Online], available: http://real-eod.mtak.hu/8176/1/WorkingPapersInTheTheoryOfGrammar_01-1_1994.pdf
    [17] Che W, Li Z, Liu T.Chinese dependency treebank1.0 (LDC2012T05) [DB/OL].Philadelphia: Linguistic Data Consortium[Online], available: https://catalog.ldc.upenn.edu/LDC2012T05,2012.
    [18] Chen KJ. et al. Sinica Treebank: Design criteria, representational issues and implementation, Chapter 13[Online], available: https://link.springer.com/chapter/10.1007%2F978-94-010-0201-1_13, 2003.
    [19] Chu C, Nakazawa T, Kawahara D, et al. SCTB: A Chinese Treebank in Scientific Domain. In: Proceedings of the 12th Workshop on Asian Language Resources (ALR12), Osaka, Japan: 2016. 59-67.
    [20] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[Online], available: https://arxiv.org/abs/1810.04805, 2018
    [21] Holle, Henning, and Robert Rein. "The modified Cohen’s kappa: Calculating interrater agreement for segmentation and annotation." Understanding Body Movement: A Guide to Empirical Research on Nonverbal Behaviour, H. Lausberg, Ed. Frankfurt am Main: Peter Lang Verlag(2013): 261-277.
    [22] Kong F, Wang HL, Zhou GD. Suvery on Chinese Discourse Understanding. Journal of Software, 2019, 30(7): 2052−2072
    [23] Mitchell Stern, Jacob Andreas, and Dan Klein. A minimal span-based neural constituency parser. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, 2017(1)(Long Papers): 818–827.
    [24] Kitaev N, Cao S, Klein D. Multilingual constituency parsing with self-attention and pre-training. [Online], available: https://arxiv.org/abs/1812.11760, Jun 4, 2018.
    [25] Zhang X, Xue N. Extending and scaling up the Chinese treebank annotation. In: Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing. 2012: 27-34.
    [26] Xue N, Palmer M. Adding semantic roles to the Chinese Treebank. Natural Language Engineering, 2009, 15(1): 143−172 doi: 10.1017/S1351324908004865
  • 加载中
计量
  • 文章访问数:  35
  • HTML全文浏览量:  12
  • 被引次数: 0
出版历程
  • 收稿日期:  2019-12-15
  • 修回日期:  2020-04-10

目录

    /

    返回文章
    返回