2.624

2020影响因子

(CJCR)

  • 中文核心
  • EI
  • 中国科技核心
  • Scopus
  • CSCD
  • 英国科学文摘

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

基于篇章的汉语句法结构树库构建

卢露 矫红岩 李梦 荀恩东

卢露, 矫红岩, 李梦, 荀恩东. 基于篇章的汉语句法结构树库构建. 自动化学报, 2022, 48(12): 1−11 doi: 10.16383/j.aas.c190828
引用本文: 卢露, 矫红岩, 李梦, 荀恩东. 基于篇章的汉语句法结构树库构建. 自动化学报, 2022, 48(12): 1−11 doi: 10.16383/j.aas.c190828
Lu Lu, Jiao Hong-Yan, Li Meng, Xun En-Dong. A discourse-based Chinese chunkbank. Acta Automatica Sinica, 2022, 48(12): 1−11 doi: 10.16383/j.aas.c190828
Citation: Lu Lu, Jiao Hong-Yan, Li Meng, Xun En-Dong. A discourse-based Chinese chunkbank. Acta Automatica Sinica, 2022, 48(12): 1−11 doi: 10.16383/j.aas.c190828

基于篇章的汉语句法结构树库构建

doi: 10.16383/j.aas.c190828
基金项目: 国家社会科学基金(16AYY007), 北京语言大学研究生创新基金(19YCX121)资助
详细信息
    作者简介:

    卢露:北京语言大学信息科学学院硕士研究生. 主要研究方向为语言智能与技术. E-mail: 201821198367@stu.blcu.edu.cn

    矫红岩:北京语言大学信息科学学院硕士研究生. 主要研究方向为自然语言处理. E-mail: jiaohongyan0815@163.com

    李梦:北京语言大学信息科学学院硕士研究生. 主要研究方向为计算机应用技术. E-mail: limeng_gertrude@163.com

    荀恩东:北京语言大学信息科学学院教授. 主要研究方向为自然语言处理. 本文通信作者.E-mail: edxun@blcu.edu.cn

A Discourse-based Chinese ChunkBank

Funds: Supported by National Social Science Fundation of China (16AYY007) and Graduate Research and Innovations Fundation of Beijing Language and Culture University (19YCX121)
More Information
    Author Bio:

    LU Lu Master student at the College of Information Science, Beijing Language and Culture University. Her main research interest is linguistic intelligence and technology

    JIAO Hong-Yan Master student at the College of Information Science, Beijing Language and Culture University. Her main research interest is natural language processing

    LI Meng Master student at the College of Information Science, Bei-jing Language and Culture University. Her main research interest is computer applications technology

    XUN En-Dong Professor at the College of Information Science, Bei-jing Language and Culture University. His main research interest is natural language processing. Corresponding author of this paper

  • 摘要: 为快速构建一个大规模、多领域的高质树库, 提出一种基于短语功能与句法角色组块的、便于标注多层次结构的标注体系, 在篇章中综合利用标点、句法结构、表述功能作为句边界判断标准, 确立合理的句边界与层次; 在句子中以组块的句法功能为主, 参考篇章功能、人际功能, 以4个性质标记、8个功能标记、4个句标记来描写句中3类5种组块, 标注基本句型骨架, 突出中心词信息. 目前已初步构建有质量保证的千万汉字规模的浅层结构分析树, 包含60余万小句的9千余条句型结构库, 语料涉及百科、新闻、专利等应用领域文本1万余篇; 与此同时也探索了高效的标注众包管理模式.
  • 图  1  组块状标注结果及浅层分析的树结构示例

    Fig.  1  A sample of shallow chunk-based syntactic tree

    图  2  “句”的层次结构示例

    Fig.  2  A Sample of the hierarchy in Chinese discourse sentences

    图  3  标注平台标注界面及管理工具界面

    Fig.  3  The interface of annotation website and management tool

    图  4  基本句型随机法齐夫对数分布

    Fig.  4  The rank-frequency random logarithmic distribution for the Chinese sentence patterns

    图  5  标句点号失效、缺省在IP小句中的分布

    Fig.  5  Sentence-division-punctuation is invalid or missing

    图  6  树库全文本Kappa值分布与各标注符号Kappa平均值

    Fig.  6  The distribution of Kappa coefficient of the text in our Treebank, and the mean of every kind of label's Kappa

    表  1  树库标记集

    Table  1  Tags for chunk-tree

    序号 符号 标记类型
    1 VP 谓词性组块
    2 NP 体词性组块
    3 UNK 谓词与体词并列的组块
    4 NULL 其它性质组块
    5 PRD 述语
    6 NPRE 名词谓语
    7 MOD 状语、补语
    8 SBJ 主语
    9 OBJ 宾语
    10 CON 衔接组块
    11 AUX 辅助组块
    12 ROOT 单复句
    13 IP 完整小句
    14 HLP 独词句或片段
    15 W 标点
    下载: 导出CSV

    表  2  目前有效标注语料分布

    Table  2  The data distribution of our treebank

    类别ROOTIPHLP汉字数文件数说明
    新闻1320093593735037849201704813新浪 2006、新华社新闻 2012 ~ 2018 年新闻
    百科765951490971482323761512982自动化控制系统、电子学与计算机、轻工、大气与海洋及
    水文科学、航空航天、经济学
    专利6926016646216966283993539152018 年国家专利申请文书描述与权利申明部分
    ROOT 中汉字字长分布IP 中字长分布
    类别平均最大最小中位数众数 平均
    新闻37837133113.69
    百科312510272015.94
    专利408190291617.06
    下载: 导出CSV
  • [1] Zhang X, Xue N. Extending and scaling up the Chinese treebank annotation. In: Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing. Tianjin, China: 2012: 27−34.
    [2] 周强, 张伟, 俞士汶. 汉语树库的构建. 中文信息学报, 1997, 11(4): 43-52

    Zhou Qiang, Zhang Wei, Yu Shiwen. The Building of Chinese Treebank. Journal of Chinese Information Processing, 1997, 11(4): 43-52.
    [3] 周强. 汉语句法树库标注体系. 中文信息学报, 2004, 18(4): 2-9

    Zhou Qiang. Annotation Scheme for Chinese Treebank. Journal of Chinese Information Processing, 2004, 18(4): 2-9
    [4] Chen KJ. et al. Sinica Treebank: Design criteria, representational issues and implementation, Chapter 13 [Online], available: https://link.springer.com/chapter/10.1007%2F978-94-010-0201-1_13, 2003.
    [5] Che W, Li Z, Liu T. Chinese dependency treebank1.0 (LDC2012T05) [DB/OL].Philadelphia: Linguistic Data Consortium[Online], available: https://catalog.ldc.upenn.edu/ LDC2012 T05, April 16, 2019.
    [6] 郭丽娟, 彭雪, 李正华, 张民. 面向多领域多来源文本的汉语依存句法树库构建. 中文信息学报, 2019, 33(02): 34-42 doi: 10.3969/j.issn.1003-0077.2019.02.005

    Guo Lijuan, Pen Xue, Li Zhenghua, Zhang Min. Construction of Chinese Dependency Syntax Treebanks for Multi-domain and Multi-source Texts. Journal of Chinese Information Processing, 2019, 33(02): 34-42. doi: 10.3969/j.issn.1003-0077.2019.02.005
    [7] Brody M. Phrase structure and dependence. [Online], available:http://real-eod.mtak.hu/8176/1/WorkingPapersInTheTheoryOfGrammar_01-1_1994.pdf,April 16, 2019.
    [8] 邱立坤, 金澎, 王厚峰. 基于依存语法构建多视图汉语树库. 中文信息学报, 2015, 29(3): 9-15 doi: 10.3969/j.issn.1003-0077.2015.03.002

    Qiu Likun, Jin Peng, Wang Houfeng. A Multi-view Chinese Treebank Based on Dependency Grammar. Journal of Chinese Information Processing, 2015, 29(3): 9-15. doi: 10.3969/j.issn.1003-0077.2015.03.002
    [9] 周强. 构建大规模的汉语语块库. 自然语言理解与机器翻译——全国第六届计算语言学联合学术会议. 城市: 2001, 6

    Zhou Qiang. Build a large scale Chinese functional chunk bank. In: Proceedings of the 6th National Conference on Computational Linguistics--natural Language Understanding and Machine Translation. Shanxi, China: 2001, 6
    [10] Xue N, Palmer M. Adding semantic roles to the Chinese Treebank. Natural Language Engineering, 2009, 15(1): 143-172. doi: 10.1017/S1351324908004865
    [11] Kong F, Wang HL, Zhou GD. Suvery on Chinese Discourse Understanding. Journal of Software, 2019, 30(7): 2052-2072(in Chinese).
    [12] 钱小飞. 组块分析研究综述. 现代语文, 2018, 2018(06): 166-170

    Qian Xiaofei. Research Review on Chunk Parsing. Modern Chinese, 2018, 2018(06): 166-170.
    [13] Chu C, Nakazawa T, Kawahara D, et al. SCTB: A Chinese treebank in scientific domain. In: Proceedings of the 12th Workshop on Asian Language Resources. Osaka, Japan: 2016. 59−67
    [14] 赵春利, 石定栩. 语气、情态与句子功能类型. 外语教学与研究, 2011, 43(04): 483-500+639

    Zhao Chunli, Shi Dingxu. Mood, modality and sentence typ. Foreign Language Teaching and Research, 2011, 43(04): 483-500+639.
    [15] 胡壮麟. 语篇的衔接与连贯. 上海: 上海外语教育出版社, 1994: 108−109

    Hu Zhuang-Lin. Discourse Cohesion and Coherence. Shanhai: Shanghai Foreign Language Education Press, 1994: 108−109.
    [16] 邢福义. 汉语复句研究. 北京: 商务印书馆, 2001: 2−6, 26-31, 38-56, 546-548

    Xing Fu-Yi, The Research on Chinese Sentences With Two or More Clause. Beijing: The Commercial Press, 2001: 2−6, 26-31, 38−56, 546−548
    [17] 徐赳赳. 现代汉语篇章语言学. 北京: 商务印书馆, 2010: 218-222

    Xu Jiu-Jiu, The Text Linguistics of Modern Chinese. Beijing: The Commercial Press, 2010: 218−222
    [18] 李秀明. 汉语元话语标记研究. 上海: 复旦大学, 2006

    Li Xiuming. The Research of Chinese Metadiscourse Marker[Ph.D dissertation], Fudan University, 2006.
    [19] 杨一飞. 语篇中的连接手段. 上海: 复旦大学, 2011

    Yang Yi-Fei, Connection in Modern Chinese Discourses[Ph.D dissertation], Fudan University, 2011
    [20] 宋柔, 葛诗利, 尚英, 卢达威. 面向文本信息处理的汉语句子和小句. 中文信息学报, 2017, 31(02): 18-24+35

    Song Rou, Ge Shili, Shang Ying, Lu Dawei. Chinese Sentence and Clause for Text Information Processing. Journal of Chinese Information Processing, 2017, 31(02): 18-24+35.
    [21] 陈荣春. 从句子的表述性谈单句复句的划分. 语文研究, 1981, 1981(01): 46-51

    Chen Rongchun. Talk about Chinese compound sentence from declarability. Linguistic Researches, 1981, 1981(01): 46-51.
    [22] 董秀芳. 汉语词汇化和语法化的现象与规律. 上海: 学林出版社, 2017: 143−144

    Dong Xiu-Fang. The Phenomenon and Regularity of Chinese Lexicalization and Grammaticalization. Shanhai: Akademia Press, 2017: 143−144
    [23] Holle H, Robert R. “The modified Cohen's kappa: Calculating interrater agreement for segmentation and annotation.” Understanding Body Movement: A Guide to Empirical Research on Nonverbal Behaviour, H. Lausberg, Ed. Frankfurt am Main: Peter Lang Verlag (2013): 261−277
    [24] Kitaev N, Cao S, Klein D. Multilingual constituency parsing with self-attention and pre-training [Online], available: https://arxiv.org/abs/1812.11760, April 16, 2019.
    [25] Mitchell S, Jacob A, Dan K. A minimal spanbased neural constituency parser. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Vancouver, Canada: 2017, 818–827
    [26] Devlin J, Chang M W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint, 2018, arXiv:1810.04805
  • 加载中
计量
  • 文章访问数:  670
  • HTML全文浏览量:  304
  • 被引次数: 0
出版历程
  • 收稿日期:  2019-12-05
  • 网络出版日期:  2022-11-23

目录

    /

    返回文章
    返回