-
摘要: 识别短文本的语言种类是社交媒体中自然语言处理的重要前提,也是一个挑战性热点课题.由于存在集外词和不同语种相同词汇干扰的问题,传统基于n-gram的短文本语种识别方法(如Textcat、LIGA、logLIGA等)识别效果在不同的数据集上相差甚远,鲁棒性较差.本文提出了一种基于n-gram频率语种识别改进方法,根据训练数据不同特性,自动确定语言中特征词和共有词的权重,增强语种识别模型在不同数据集上的鲁棒性.实验结果证明了该方法的有效性.Abstract: Language identification of short text is not only an important prerequisite for social media in natural language processing but also a challenging hot-topic. Due to the existence of foreign words and the same lexical interference in different languages, the effect of the tranditional n-gram based short text recognition method (eg Textcat, LIGA, logLIGA, etc.) is different in different datasets and robustness is poor. This paper presents an improved method based on n-gram frequency, which, according to the different characteristics of training data, can automatically determine the right language feature words and public words' weight, so as to enhance the language identification model' robustness on different data sets. Experimental results demonstrate the effectiveness of this method.
-
Key words:
- Language identification /
- short text /
- n-gram frequency /
- robustness
1) 本文责任编委 贾磊 -
表 1 四种数据集情况简介
Table 1 Introduction to four datasets
数据集 语种数量 文件数量 交叉验证训练集 交叉验证测试集 Europral 21 21 000 18 900 2 100 LIGA 6 9 066 8 160 906 Twituser_21 21 6 356 5 721 635 Twituser_7 7 2 970 2 673 297 -
[1] Cavnar W B, Trenkle J M. N-gram-based text categorization. In: Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval. Las Vegas, USA, 1994. 161-175 https://www.researchgate.net/publication/2375544_N-Gram-Based_Text_Categorization [2] Frank Scheelen. Libtextcat. Software[Online], available: http://software.wise-guys.nl/libtextcat/, 2003. [3] Hammarström H. A fine-grained model for language identification. In: Proceedings of the 2007 Workshop of Improving Non English Web Searching. Amsterdam, The Netherlands: ACM, 2007. 14-20 https://www.researchgate.net/publication/290889741_A_fine-grained_model_for_language_identification [4] Ceylan H, Kim Y. Language identification of search engine queries. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Stroudsburg, PA, USA: Association for Computational Linguistics, 2009, 2: 1066-1074 http://dl.acm.org/citation.cfm?id=1690295 [5] Vatanen T, Väyrynen J J, Virpioja S. Language identification of short text segments with n-gram models. In: Proceedings of the 2010 International Conference on Language Resources and Evaluation. Valletta, Malta: LREC, 2010. 3423-3430 https://www.researchgate.net/publication/220746211_Language_Identification_of_Short_Text_Segments_with_N-gram_Models [6] Carter S, Weerkamp W, Tsagkias M. Microblog language identification:overcoming the limitations of short, unedited and idiomatic text. Language Resources and Evaluation, 2013, 47(1):195-215 doi: 10.1007/s10579-012-9195-y [7] Tromp E, Pechenizkiy M. Graph-based n-gram language identification on short texts. In: Proceedings of the 20th Machine Learning Conference of Belgium and the Netherlands. Hague, Netherlands, 2011. 27-34 https://www.researchgate.net/publication/292017010_Graph-Based_N-gram_Language_Identification_on_Short_Texts [8] Vogel J, Tresner-Kirsch D. Robust language identification in short, noisy texts: improvements to LIGA. In: Proceedings of the 3rd International Workshop on Mining Ubiquitous and Social Environments (MUSE 2012). 2012. 43-50 https://www.researchgate.net/publication/268423546_Robust_Language_Identification_in_Short_Noisy_Texts_Improvements_to_LIGA [9] Lui M, Baldwin T. Langid. PY: an off-the-shelf language identification tool. In: Proceedings of ACL 2012 System Demonstrations. Stroudsburg, PA, USA: Association for Computational Linguistics, 2012. 25-30 http://dl.acm.org/citation.cfm?id=2390475 [10] 中谷秀洋. Short Text Language Detection with Infinity-Gram. 奈良先端科学技术大学院大学, 2012. [11] Brown R D. Selecting and weighting n-grams to identify 1100 languages. Speech, and Dialogue. Lecture Notes in Computer Science. Berlin, Heidelberg, Germany: Springer, 2013. 475-483 https://www.researchgate.net/publication/290616632_Selecting_and_Weighting_N-Grams_to_Identify_1100_Languages?ev=auth_pub [12] Gonzalez-Dominguez J, Lopez-Moreno I, Moreno P J, Gonzalez-Rodriguez J. Frame-by-frame language identification in short utterances using deep neural networks. Neural Networks, 2015, 64:49-58 doi: 10.1016/j.neunet.2014.08.006 [13] Zazo R, Lozano-Diez A, Gonzalez-Dominguez J, Toledano D T, Gonzalez-Rodriguez J. Language identification in short utterances using long short-term memory (LSTM) recurrent neural networks. PLoS One, 2016, 11(1):Article No.e0146917 doi: 10.1371/journal.pone.0146917 [14] Tkachenko M, Yamshinin A, Lyubimov N, Kotov M, Nastasenko M. Language identification using time delay neural network d-vector on short utterances. Speech and Computer. Lecture Notes in Computer Science. Cham, Germany: Springer, 2016. 443-449 doi: 10.1007%2F978-3-319-43958-7_53 [15] Ghahabi O, Bonafonte A, Hernando J, Moreno A. Deep neural networks for i-vector language identification of short utterances in cars. In: Proceedings of INTERSPEECH 2016. San Francisco, USA: ISCA, 2016. 367-371 https://www.researchgate.net/publication/304572056_Deep_Neural_Networks_for_i-Vector_Language_Identification_of_Short_Utterances_in_Cars [16] Song Y, Cui R L, Hong X H, Mcloughlin I, Shi J, Dai L R. Improved language identification using deep bottleneck network. In: Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). South Brisbane, QLD, Australia: IEEE, 2015. 4200-4204 http://ieeexplore.ieee.org/document/7178762/ [17] Song Y, Hong X H, Jiang B, Cui R L, McLoughlin I, Dai L R. Deep bottleneck network based i-vector representation for language identification. In: Proceedings of INTERSPEECH 2015. Dresden, Germany: ISCA, 2015. 398-402 https://isca-speech.org/archive/interspeech_2015/i15_0398.html [18] Jin M, Song Y, Mcloughlin I, Dai L R, Ye Z F. LID-senone extraction via deep neural networks for end-to-end language identification. In: Proceedings of Odyssey 2016. Bilbao, Spain, 2016. 210-216 https://www.researchgate.net/publication/305685321_LID-senone_Extraction_via_Deep_Neural_Networks_for_End-to-End_Language_Identification [19] 单煜翔, 邓妍, 刘加.一种联合语种识别的新型大词汇量连续语音识别算法.自动化学报, 2012, 38(3):366-374 http://www.aas.net.cn/CN/abstract/abstract17687.shtmlShan Yu-Xiang, Deng Yan, Liu Jia. A novel large vocabulary continuous speech recognition algorithm combined with language recognition. Acta Automatica Sinica, 2012, 38(3):366-374 http://www.aas.net.cn/CN/abstract/abstract17687.shtml [20] 杨绪魁, 屈丹, 张文林.正交拉普拉斯语种识别方法.自动化学报, 2014, 40(8):1812-1818 http://www.aas.net.cn/CN/abstract/abstract18448.shtmlYang Xu-Kui, Qu Dan, Zhang Wen-Lin. An orthogonal Laplacian language recognition approach. Acta Automatica Sinica, 2014, 40(8):1812-1818 http://www.aas.net.cn/CN/abstract/abstract18448.shtml [21] 徐嘉明, 张卫强, 杨登舟, 刘加, 夏善红.基于流形正则化极限学习机的语种识别系统.自动化学报, 2015, 41(9):1680-1685 http://www.aas.net.cn/CN/abstract/abstract18741.shtmlXu Jia-Ming, Zhang Wei-Qiang, Yang Deng-Zhou, Liu Jia, Xia Shan-Hong. Manifold regularized extreme learning machine for language recognition. Acta Automatica Sinica, 2015, 41(9):1680-1685 http://www.aas.net.cn/CN/abstract/abstract18741.shtml [22] Zubiaga A, Vicente I S, Gamallo P, Pichel J R, Alegria I, Aranberri N, Ezeiza A, Fresno V. TweetLID:a benchmark for tweet language identification. Language Resources and Evaluation, 2016, 50(4):729-766 doi: 10.1007/s10579-015-9317-4 [23] Kalimeri M, Constantoudis V, Papadimitriou C, Karamanos K, Diakonos F K, Papageorgiou H. Word-length entropies and correlations of natural language written texts. Journal of Quantitative Linguistics, 2015, 22(2):101-118 doi: 10.1080/09296174.2014.1001636 [24] Baldwin T, Lui M. Language identification: the long and the short of the matter. In: Human Language Technologies: the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2010. 229-237 http://dl.acm.org/citation.cfm?id=1857999.1858026 [25] Lui M, Baldwin T. Accurate language identification of twitter messages. In: Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM). Gothenburg, Sweden: Association for Computational Linguistics, 2014. 17-25 https://www.researchgate.net/publication/267448202_Accurate_Language_Identification_of_Twitter_Messages [26] Koehn P. Europarl: a parallel corpus for statistical machine translation. Proceedings of the 3rd Workshop on Statistical Machine Translation, 2005. 3-4 http://www.researchgate.net/publication/228379274_Europarl_A_parallel_corpus_for_statistical_machine_translation