作者识别研究综述

张洋; 江铭虎

doi:10.16383/j.aas.c200654

作者识别研究综述

doi: 10.16383/j.aas.c200654 cstr: 32138.14.j.aas.c200654

张洋^1,,
江铭虎^1,

1.
清华大学人文学院计算语言学实验室北京 100084

基金项目: 国家自然科学基金(62036001)资助

详细信息

作者简介:
张洋：清华大学人文学院中文系博士研究生. 主要研究方向为作者识别, 文本分类, 情感分析. E-mail: yumaoqiuq@163.com

江铭虎：清华大学人文学院中文系教授. 主要研究方向为自然语言处理, 脑与语言认知, 模式识别, 人工智能. 本文通信作者. E-mail: jiang.mh@mail.tsinghua.edu.cn

计量
- 文章访问数: 1994
- HTML全文浏览量: 1383
- PDF下载量: 456
- 被引次数: 0
出版历程
- 收稿日期: 2020-08-14
- 录用日期: 2021-02-09
- 网络出版日期: 2021-03-17
- 刊出日期: 2021-11-18

A Review on Authorship Identification Research

ZHANG Yang^1
,,
JIANG Ming-Hu^1
,

1.
Lab of Computational Linguistics, School of Humanities, Tsinghua University, Beijing 100084

Funds: Supported by National Natural Science Foundation of China (62036001)

More Information

Author Bio:
ZHANG Yang　Ph. D. candidate in the Department of Chinese Language and Literature, School of Humanities, Tsinghua University. His research interest covers authorship identification, text categorization, sentiment analysis

JIANG Ming-Hu　Professor in the Department of Chinese Language and Literature, School of Humanities, Tsinghua University. His research interest covers natural language processing, brain and language cognition, pattern recognition, artificial intelligence. Corresponding author of this paper

摘要

摘要: 作者识别是根据已知文本推断未知文本作者的交叉学科. 其传统研究通常基于文学或语言学的经验知识, 而现代研究则主要依靠数学方法量化作者的写作风格. 近些年, 随着认知科学、系统科学和信息技术的发展, 作者识别受到越来越多研究者的关注. 本文主要站在计算语言学的角度综述作者识别领域现代研究中的方法和思路. 首先, 简要介绍了作者识别的发展历程. 然后, 详述了文体风格特征、作者识别方法以及该领域中多层面的研究. 接着介绍了与作者识别相关的一些评测、数据集及评价指标. 最后, 指出该领域存在的一些问题, 结合这些问题分析并展望了作者识别的发展趋势.
- 作者识别 /
- 文体学 /
- 写作风格 /
- 评价指标
Abstract: Authorship identification is an interdisciplinary subject of inferring the author of unknown texts based on the known texts. The traditional research of authorship identification is generally based on the empirical knowledge of literature or linguistics, while the modern research mostly relies on mathematical methods to quantify the author＇ s writing style. In recent years, with the development of cognitive science, system science and information technology, more and more researchers pay attention to authorship identification. This paper mainly reviews the methods and ideas in modern research in the field of authorship identification from the perspective of computational linguistics. First, the development history of authorship identification is introduced briefly. Then, the stylometry, authorship identification methods and multi-faceted research in this realm are expounded. Next, some evaluations, data sets and evaluation metrics related to authorship identification are explicated. Finally, some problems in this domain are pointed out, while the development trend of authorship identification is analyzed and forecasted combined with these problems.
- Authorship identification /
- stylometry /
- writing style /
- evaluation metrics
注释:

1) ¹ https://umlt.infotech.monash.edu/?page_id=266² http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm³ https://umlt.infotech.monash.edu/?page_id=152⁴ https://www.cs.cmu.edu/~./enron/⁵ https://drive.google.com/drive/folders/1hlIWVSt0dfy8fz8d4w RzZItl-LCo5BH1?usp=sharing

2) ² http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm

3) ³ https://umlt.infotech.monash.edu/?page_id=152

4) ⁴ https://www.cs.cmu.edu/~./enron/

5) ⁵ https://drive.google.com/drive/folders/1hlIWVSt0dfy8fz8d4w RzZItl-LCo5BH1?usp=sharing

6) ⁶ https://archive.ics.uci.edu/ml/datasets/Reuter_50_50⁷ https://pan.webis.de

7) ⁷ https://pan.webis.de

HTML全文

图 1 作者识别流程图

Fig. 1 Flow diagram of authorship identification

下载: 全尺寸图片幻灯片

表 1 文体风格特征对比表

Table 1 Comparative table of stylometry

文体特征	特征细分	获取难易度	应用广泛度	其他
字符特征	字符数量, 字符 n-gram, 字符错误	非常容易, 可直接提取	很高	主题独立, 可捕捉书写错误, 特征维度容易过大, 导致数据稀疏
词汇特征	词长, 词频, 词汇丰富度, 单词 n-gram, 词拼写错误	容易, 直接提取或分词后提取	很高	主题相关, 可捕捉书写错误
句法特征	短语或句子结构, 词性 n-gram, 句法 n-gram, 重写规则频率	较难, 深层句法特征需借助句法解析器	低	主题独立, 通常不具有连续性, 解析器容易引入噪声
语义特征	同义词, 语义依赖	困难, 需借助语义分析工具	很低	主题相关, 通常作为其他特征的补充, 很少独立使用

下载: 导出CSV

表 2 无监督方法对比表

Table 2 Comparative table of unsupervised method

方法	模型	策略	算法
k 均值聚类	k 中心聚类	样本与类中心距离最小	迭代算法
层次聚类	聚类树	类内样本距离最小	启发式算法
高斯混合聚类	高斯混合模型	似然函数最大	期望最大化算法
LSA	矩阵分解模型	平方损失最小	奇异值分解
LDA	LDA 模型	后验概率估计	吉布斯抽样, 变分推理

下载: 导出CSV

表 3 有监督方法对比表

Table 3 Comparative table of supervised method

方法	模型类型	模型特点	学习策略	稳定性	准确率
NB	生成模型	特征与类别的联合概率分布, 条件独立假设	极大似然估计, 最大后验概率估计	高	低
SVM	判别模型	分离超平面, 核技巧	极小化正则化合页损失, 软间隔最大化	中	高
DT	判别模型	分类树、回归树	正则化的极大似然估计	中	中
KNN	判别模型	特征空间, 样本点	无	低	中
NN	判别模型	神经元拓扑结构	目标函数最小化	中	偏高

下载: 导出CSV

参考文献(139)

[1]	祁瑞华. 文本作者身份识别. 北京: 清华大学出版社, 2017. 1−2 Qi Rui-Hua. Text Authorship Identification. Beijing: Tsinghua University Press, 2017. 1−2
[2]	Mendenhall T C. The characteristic curves of composition. Science, 1887, ns-9(214S): 237-246 doi: 10.1126/science.ns-9.214S.237
[3]	Yule G U. On sentence-length as a statistical characteristic of style in prose: With application to two cases of disputed authorship. Biometrika, 1939, 30(3-4): 363-390 doi: 10.1093/biomet/30.3-4.363
[4]	Mosteller F, Wallace D L. Inference and Disputed Authorship: The Federalist. Reading, Mass: Addison-Wesley Publishing Company, 1964.
[5]	Damerau F J. The use of function word frequencies as indicators of style. Computers and the Humanities, 1975, 9(6): 271-280 doi: 10.1007/BF02396290
[6]	Efron B, Thisted R A. Estimating the number of unseen species: How many words did Shakespeare know. Biometrika, 1976, 63(3): 435-447
[7]	Chaski C E. Who's at the keyboard? Authorship attribution in digital evidence investigations. International Journal of Digital Evidence, 2005, 4(1): 1-14 (请联系作者确认页码信息)
[8]	Hoover D L. Testing Burrows's delta. Literary and Linguistic Computing, 2004, 19(4): 453-475 doi: 10.1093/llc/19.4.453
[9]	Frantzeskou G, Stamatatos E, Gritzalis S, Katsikas S. Effective identification of source code authors using byte-level information. In: Proceedings of the 28th International Conference on Software Engineering. Shanghai, China: ACM, 2006. 893−896
[10]	Koppel M, Schler J, Argamon S, Winter Y. The "fundamental problem" of authorship attribution. English Studies, 2012, 93(3): 284-291 doi: 10.1080/0013838X.2012.668794
[11]	Rudman J. The state of authorship attribution studies: Some problems and solutions. Computers and the Humanities, 1997, 31(4): 351-365 doi: 10.1023/A:1001018624850
[12]	Koppel M, Schler J, Argamon S. Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology, 2009, 60(1): 9-26 doi: 10.1002/asi.20961
[13]	Luyckx K. Scalability Issues in Authorship Attribution. Antwerp: UPA University Press, 2010. 13−18
[14]	Potha N, Stamatatos E. A profile-based method for authorship verification. In: Proceedings of the 8th Hellenic Conference on Artificial Intelligence. Ioannina, Greece: Springer, 2014. 313−326
[15]	El Manar El Bouanani S, Kassou I. Authorship analysis studies: A survey. International Journal of Computer Applications, 2014, 86(12): 22-29 doi: 10.5120/15038-3384
[16]	Johnson A, Wright D. Identifying idiolect in forensic authorship attribution: An N-gram textbite approach. Language and Law, 2014, 1(1): 37-69
[17]	Keselj V, Peng F C, Cercone N, Thomas C. N-gram-based author profiles for authorship attribution. In: Proceedings of the Pacific Association for Computational Linguistics. Halifax, Canada: PACL, 2003. 255−264
[18]	Houvardas J, Stamatatos E. N-gram feature selection for authorship identification. In: Proceedings of the 12th International Conference on Artificial Intelligence: Methodology, Systems, and Applications. Varna, Bulgaria: Springer, 2006. 77−86
[19]	Stamatatos E. Ensemble-based author identification using character N-grams. In: Proceedings of the 3rd International Workshop on Text-Based Information Retrieval. Seattle, WA, USA, 2006. 41−46
[20]	Sapkota U, Bethard S, Montes-y-Gomez M, Solorio T. Not all character N-grams are created equal: A study in authorship attribution. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Denver, Colorado, USA: ACL, 2015. 93−102
[21]	Sari Y, Vlachos A, Stevenson M. Continuous n-gram representations for authorship attribution. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Valencia, Spain: ACL, 2017. 267−273
[22]	Gomez-Adorno H, Posadas-Duran J P, Sidorov G, Pinto D. Document embeddings learned on various types of N-grams for cross-topic authorship attribution. Computing, 2018, 100(7): 741-756 doi: 10.1007/s00607-018-0587-8
[23]	Burrows J. 'Delta': A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 2002, 17(3): 267-287 doi: 10.1093/llc/17.3.267
[24]	Hoover D L. Another perspective on vocabulary richness. Computers and the Humanities, 2003, 37(2): 151-178 doi: 10.1023/A:1022673822140
[25]	Garcia A M, Martin J C. Function words in authorship attribution studies. Literary and Linguistic Computing, 2007, 22(1): 49-66 doi: 10.1093/llc/fql048
[26]	Zhao Y, Zobel J. Effective and scalable authorship attribution using function words. In: Proceedings of the 2nd Asia Information Retrieval Symposium. Jeju Island, Korea: Springer, 2005. 174−189
[27]	Coyotl-Morales R M, Villasenor-Pineda L, Montes-y-Gomez M, Rosso P. Authorship attribution using word sequences. In: Proceedings of the 11th Iberoamerican Congress in Pattern Recognition. Cancun, Mexico: Springer, 2006. 844−853
[28]	Stamatatos E. Authorship attribution based on feature set subspacing ensembles. International Journal on Artificial Intelligence Tools, 2006, 15(5): 823-838 doi: 10.1142/S0218213006002965
[29]	Koppel M, Schler J, Bonchek-Dokow E. Measuring differentiability: Unmasking pseudonymous authors. Journal of Machine Learning Research, 2007, 8: 1261-1276
[30]	Savoy J. Authorship attribution based on specific vocabulary. ACM Transactions on Information Systems, 2012, 30(2): Article 12
[31]	Akimushkin C, Amancio D R, Oliveira O N. On the role of words in the network structure of texts: Application to authorship attribution. Physica A: Statistical Mechanics and its Applications, 2018, 495: 49-58 doi: 10.1016/j.physa.2017.12.054
[32]	Raghavan S, Kovashka A, Mooney R. Authorship attribution using probabilistic context-free grammars. In: Proceedings of the ACL 2010 Conference Short Papers. Uppsala, Sweden: ACL, 2010. 38−42
[33]	Tschuggnall M, Specht G. Enhancing authorship attribution by utilizing syntax tree profiles. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. Gothenburg, Sweden: ACL, 2014. 195−199
[34]	Patchala J, Bhatnagar R. Authorship attribution by consensus among multiple features. In: Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe, New Mexico, USA: ACL, 2018. 2766−2777
[35]	Zhang R C, Hu Z Y, Guo H Y, Mao Y Y. Syntax encoding with application in authorship attribution. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: ACL, 2018. 2742−2753
[36]	Sidorov G, Velasquez F, Stamatatos E, Gelbukh A, Chanona-Hernandez L. Syntactic N-grams as machine learning features for natural language processing. Expert Systems with Applications, 2014, 41(3): 853-860 doi: 10.1016/j.eswa.2013.08.015
[37]	Posadas-Duran J P, Sidorov G, Batyrshin I. Complete syntactic N-grams as style markers for authorship attribution. In: Proceedings of the 13th Mexican International Conference on Artificial Intelligence. Tuxtla Gutierrez, Mexico: Springer, 2014. 9−17
[38]	Posadas-Duran J P, Sidorov G, Batyrshin I, Mirasol-Melendez E. Author verification using syntactic N-grams. In: Working Notes of the Conference and Labs of the Evaluation Forum 2015. Toulouse, France, 2015.
[39]	Posadas-Duran J P, Markov I, Gomez-Adorno H, Sidorov G, Batyrshin I, Gelbukh A, et al. Syntactic N-grams as features for the author profiling task. In: Working Notes of the Conference and Labs of the Evaluation Forum 2015. Toulouse, France, 2015.
[40]	Gamon M. Linguistic Correlates of Style: Authorship classification with deep linguistic analysis features. In: Proceedings of the 20th International Conference on Computational Linguistics. Geneva, Switzerland: ACL, 2004. 611−617
[41]	武晓春, 黄萱菁, 吴立德. 基于语义分析的作者身份识别方法研究. 中文信息学报, 2006, 20(6): 61-68 doi: 10.3969/j.issn.1003-0077.2006.06.009 Wu Xiao-Chun, Huang Xuan-Jing, Wu Li-De. Authorship identification based on semantic analysis. Journal of Chinese Information Processing, 2006, 20(6): 61-68 doi: 10.3969/j.issn.1003-0077.2006.06.009
[42]	Argamon S, Whitelaw C, Chase P, Hota S R, Garg N, Levitan S. Stylistic text classification using functional lexical features. Journal of the American Society for Information Science and Technology, 2007, 58(6): 802-822 doi: 10.1002/asi.20553
[43]	Hedegaard S, Simonsen J G. Lost in translation: Authorship attribution using frame semantics. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA: ACL, 2011. 65−70
[44]	Daelemans W. Explanation in computational stylometry. In: Proceedings of the 14th International Conference on Computational Linguistics and Intelligent Text Processing. Samos, Greece: Springer, 2013. 451−462
[45]	Dasgupta A, Drineas P, Harb B, Josifovski V, Mahoney M W. Feature selection methods for text classification. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Jose, CA, USA: ACM, 2007. 230−239
[46]	Lijo V P, Seetha H. Text-based sentiment analysis: Review. International Journal of Knowledge and Learning, 2017, 12(1): 1-26 (请联系作者确认页码信息) doi: 10.1504/IJKL.2017.088163
[47]	Cui M J, Li L, Wang Z H, You M Y. A survey on relation extraction. In: Proceedings of the 2nd China Conference on Knowledge Graph and Semantic Computing. Chengdu, China: Springer, 2017. 50−58
[48]	Ma J B, Xue B, Zhang M J. A profile-based authorship attribution approach to forensic identification in Chinese online messages. In: Proceedings of the 11th Pacific Asia Workshop on Intelligence and Security Informatics. Auckland, New Zealand: Springer, 2016. 33−52
[49]	李航. 统计学习方法. 第2版. 北京: 清华大学出版社, 2019. 6−12, 27−28, 59, 237, 245−253, 435−436 Li Hang. Statistical Learning Methods (Second edition). Beijing: Tsinghua University Press, 2019. 6−12, 27−28, 59, 237, 245−253, 435−436
[50]	Jin M Z, Jiang M H. Text clustering on authorship attribution based on the features of punctuations usage. In: Proceedings of the 11th International Conference on Signal Processing. Beijing, China: IEEE, 2012. 2175−2178
[51]	Hacohen-Kerner Y, Margaliot O. Authorship attribution of responsa using clustering. Cybernetics and Systems, 2014, 45(6): 530-545 doi: 10.1080/01969722.2014.945311
[52]	Fifield D, Follan T, Lunde E. Unsupervised authorship attribution. arXiv: 1503.07613, 2015
[53]	Mansoorizadeh M, Aminiyan M, Rahgooy T, Eskandari M. Multi feature space combination for authorship clustering. In: Working Notes of the Conference and Labs of the Evaluation Forum 2016. Evora, Portugal, 2016.
[54]	Bagnall D. Authorship clustering using multi-headed recurrent neural networks. In: Working Notes of the Conference and Labs of the Evaluation Forum 2016. Evora, Portugal, 2016.
[55]	Agarwal L, Thakral K, Bhatt G, Mittal A. Authorship clustering using TF-IDF weighted word-embeddings. In: Proceedings of the 11th Forum for Information Retrieval Evaluation. Kolkata, India: ACM, 2019. 24−29
[56]	Nakov P. Latent semantic analysis for German literature investigation. In: Proceedings of the International Conference on Computational Intelligence, Theory and Applications. Dortmund, Germany: Springer, 2001. 834−841
[57]	Satyam A, Dawn A K, Saha S K. A statistical analysis approach to author identification using latent semantic analysis. In: Working Notes of the Conference and Labs of the Evaluation Forum 2014. Sheffield, UK, 2014.
[58]	Jelodar H, Wang Y L, Yuan C, Feng X, Jiang X H, Li Y C, et al. Latent Dirichlet allocation (LDA) and topic modeling: Models, applications, a survey. arXiv: 1711.04305, 2018
[59]	Seroussi Y, Zukerman I, Bohnert F. Authorship attribution with latent Dirichlet allocation. In: Proceedings of the 15th Conference on Computational Natural Language Learning. Portland, Oregon, USA: ACL, 2011. 181−189
[60]	Savoy J. Authorship attribution based on a probabilistic topic model. Information Processing & Management, 2013, 49(1): 341-354
[61]	Anwar W, Bajwa I S, Choudhary M A, Ramzan S. An empirical study on forensic analysis of Urdu text using LDA-based authorship attribution. IEEE Access, 2019, 7: 3224-3234 doi: 10.1109/ACCESS.2018.2885011
[62]	张学工. 模式识别. 第 3 版. 北京: 清华大学出版社, 2010. 48−53 Zhang Xue-Gong. Pattern Recognition (Third edition). Beijing: Tsinghua University Press, 2010. 48−53
[63]	Zhao Y, Zobel J. Searching with style: Authorship attribution in classic literature. In: Proceedings of the 13th Australasian Computer Science Conference. Ballarat, Victoria, Australia: ACS, 2007. 59−68
[64]	Boutwell S R. Authorship Attribution of Short Messages Using Multimodal Features [Master thesis], Naval Postgraduate School, USA, 2011
[65]	Altheneyan A S, Menai M E B. Naive Bayes classifiers for authorship attribution of Arabic texts. Journal of King Saud University - Computer and Information Sciences, 2014, 26(4): 473-484 doi: 10.1016/j.jksuci.2014.06.006
[66]	Howedi F, Mohd M. Text classification for authorship attribution using naive Bayes classifier with limited training data. Computer Engineering and Intelligent Systems, 2014, 5(4): 48-56
[67]	周志华. 机器学习. 北京: 清华大学出版社, 2016. 33−35, 121−123 Zhou Zhi-Hua. Machine Learning. Beijing: Tsinghua University Press, 2016. 33−35, 121−123
[68]	Diederich J, Kindermann J, Leopold E, Paass G. Authorship attribution with support vector machines. Applied Intelligence, 2003, 19(1): 109-123
[69]	Schwartz R, Tsur O, Rappoport A, Koppel M. Authorship attribution of micro-messages. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, Washington, USA: ACL, 2013. 1880−1891
[70]	Mikros G K, Perifanos K A. Authorship attribution in Greek tweets using author＇s multilevel N-gram profiles. In: Proceedings of the 2013 AAAI Spring Symposium Series. Palo Alto, USA: AAAI, 2013. 17−23
[71]	Li J S, Monaco J V, Chen L C, Tappert C C. Authorship authentication using short messages from social networking sites. In: Proceedings of the 11th International Conference on e-Business Engineering. Guangzhou, China: IEEE, 2014. 314−319
[72]	Martin-del-Campo-Rodriguez C, Alvarez D A P, Sifuentes C E M, Sidorov G, Batyrshin I, Gelbukh A. Authorship attribution through punctuation N-grams and averaged combination of SVM. In: Working Notes of the Conference and Labs of the Evaluation Forum 2019. Lugano, Switzerland, 2019.
[73]	Soler-Company J, Wanner L. On the relevance of syntactic and discourse features for author profiling and identification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Valencia, Spain: ACL, 2017. 681−687
[74]	Rokach L, Maimon O. Data Mining with Decision Trees: Theory and Applications. Singapore: World Scientific Publishing, 2008. 5−8
[75]	Apte C, Weiss S. Data mining with decision trees and decision rules. Future Generation Computer Systems, 1997, 13(2-3): 197-210 doi: 10.1016/S0167-739X(97)00021-6
[76]	Frery J, Largeron C, Juganaru-Mathieu M. UJM at CLEF in author verification based on optimized classification trees. In: Working Notes of the Conference and Labs of the Evaluation Forum 2014. Sheffield, UK, 2014.
[77]	Digamberrao K S, Prasad R S. Author identification using sequential minimal optimization with rule-based decision tree on Indian literature in Marathi. Procedia Computer Science, 2018, 132: 1086-1101 doi: 10.1016/j.procs.2018.05.024
[78]	Maitra P, Ghosh S, Das D. Authorship verification — An approach based on random forest. In: Working Notes of the Conference and Labs of the Evaluation Forum 2015. Toulouse, France, 2015.
[79]	Trstenjak B, Mikac S, Donko D. KNN with TF-IDF based framework for text categorization. Procedia Engineering, 2014, 69: 1356-1364 doi: 10.1016/j.proeng.2014.03.129
[80]	Halvani O, Steinebach M, Zimmermann R. Authorship verification via k-nearest neighbor estimation. In: Working Notes of the Conference and Labs of the Evaluation Forum 2013. Valencia, Spain, 2013.
[81]	Anwar W, Bajwa I S, Ramzan S. Design and implementation of a machine learning-based authorship identification model. Scientific Programming, 2019, 2019: 9431073
[82]	Sarwar R, Porthaveepong T, Rutherford A, Rakthanmanon T, Nutanong S. StyloThai: A scalable framework for stylometric authorship identification of Thai documents. ACM Transactions on Asian and Low-Resource Language Information Processing, 2020, 19(3): Article No. 36
[83]	Gurney K. An Introduction to Neural Networks. London: CRC Press, 1997. 13−16
[84]	Bagnall D. Author identification using multi-headed recurrent neural networks. In: Working Notes of the Conference and Labs of the Evaluation Forum 2015. Toulouse, France, 2015.
[85]	Ruder S, Ghaffari P, Breslin J G. Character-level and multi-channel convolutional neural networks for large-scale authorship attribution. arXiv: 1609.06686, 2016
[86]	Qian C, He T C, Zhang R. Deep Learning based Authorship Identification, Department of Electrical Engineering, Stanford, CA, 2017.
[87]	Shrestha P, Sierra S, Gonzalez F A, Rosso P, Montes-y-Gomez M, Solorio T. Convolutional neural networks for authorship attribution of short texts. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Valencia, Spain: ACL, 2017. 669−674
[88]	Jafariakinabad F, Tarnpradab S, Hua K A. Syntactic recurrent neural network for authorship attribution. arXiv: 1902.09723, 2019
[89]	Khomytska I, Teslyuk V. Statistical models for authorship attribution. In: Proceedings of the 9th International Conference on Computer Science and Information Technologies. Lviv, Ukraine: Springer, 2019. 579−592
[90]	Grabchak M, Cao L J, Zhang Z Y. Authorship attribution using diversity profiles. Journal of Quantitative Linguistics, 2018, 25(2): 142-155 doi: 10.1080/09296174.2017.1343268
[91]	Srinivasan L, Nalini C. An improved framework for authorship identification in online messages. Cluster Computing, 2019, 22(5): 12101-12110
[92]	Qian T Y, Liu B, Chen L, Peng Z Y. Tri-training for authorship attribution with limited training data. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore, Maryland, USA: ACL, 2014. 345−351
[93]	Luyckx K, Daelemans W. Authorship attribution and verification with many authors and limited data. In: Proceedings of the 22nd International Conference on Computational Linguistics. Manchester, UK: ACL, 2008. 513−520
[94]	Eder M. Does size matter? Authorship attribution, small samples, big problem. Literary & linguistic computing, 2015, 30(2): 167-182
[95]	Koppel M, Schler J, Argamon S. Authorship attribution in the wild. Language Resources and Evaluation, 2011, 45(1): 83-94 doi: 10.1007/s10579-009-9111-2
[96]	Luyckx K, Daelemans W. The effect of author set size and data size in authorship attribution. Literary and Linguistic Computing, 2011, 26(1): 35-55 doi: 10.1093/llc/fqq013
[97]	Stamatatos E. On the robustness of authorship attribution based on character N-gram features. Journal of Law and Policy, 2013, 21(2): 421-439
[98]	Markov I, Stamatatos E, Sidorov G. Improving cross-topic authorship attribution: The role of pre-processing. In: Proceedings of the 18th International Conference on Computational Linguistics and Intelligent Text Processing. Budapest, Hungary: Springer, 2017. 289−302
[99]	Rahgouy M, Giglou H B, Rahgooy T, Sheykhlan M K, Mohammadzadeh E. Cross-domain authorship attribution: Author identification using a multi-aspect ensemble approach. In: Working Notes of the Conference and Labs of the Evaluation Forum 2019. Lugano, Switzerland, 2019.
[100]	Mikros G K, Argiri E K. Investigating topic influence in authorship attribution. In: Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection. Amsterdam, Netherlands, 2007.
[101]	Sari Y, Stevenson M, Vlachos A. Topic or style? Exploring the most useful features for authorship attribution. In: Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe, New Mexico, USA: ACL, 2018. 343−353
[102]	Seroussi Y, Bohnert F, Zukerman I. Authorship attribution with author-aware topic models. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Jeju, Korea: ACL, 2012. 264−269
[103]	Seroussi Y, Zukerman I, Bohnert F. Authorship attribution with topic models. Computational Linguistics, 2014, 40(2): 269-310 doi: 10.1162/COLI_a_00173
[104]	Yang M, Chen X J, Tu W T, Lu Z Y, Zhu J, Qu Q. A topic drift model for authorship attribution. Neurocomputing, 2018, 273: 133-140 doi: 10.1016/j.neucom.2017.08.022
[105]	Halvani O, Winter C, Pflug A. Authorship verification for different languages, genres and topics. Digital Investigation, 2016, 16: S33-S43 doi: 10.1016/j.diin.2016.01.006
[106]	Bacciu A, La Morgia M, Mei A, Nemmi E N, Neri V, Stefa J. Cross-domain authorship attribution combining instance-based and profile-based features. In: Working Notes of the Conference and Labs of the Evaluation Forum 2019. Lugano, Switzerland, 2019.
[107]	Stamatatos E. Authorship attribution using text distortion. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Valencia, Spain: ACL, 2017. 1138−1149
[108]	Stamatatos E. Masking topic-related information to enhance authorship attribution. Journal of the Association for Information Science and Technology, 2018, 69(3): 461-473 doi: 10.1002/asi.23968
[109]	Ishikawa M, Kawakami H. Compression-based distance between string data and its application to literary work classification based on authorship. Computational Statistics, 2013, 28(2): 851-873 doi: 10.1007/s00180-012-0332-2
[110]	Diamantini C, Panti M. An efficient and scalable data compression approach to classification. ACM SIGKDD Explorations Newsletter, 2000, 2(2): 49-55 doi: 10.1145/380995.381014
[111]	Cerra D, Datcu M, Reinartz P. Authorship analysis based on data compression. Pattern Recognition Letters, 2014, 42: 79-84 doi: 10.1016/j.patrec.2014.01.019
[112]	Halvani O, Winter C, Graner L. On the usefulness of compression models for authorship verification. In: Proceedings of the 12th International Conference on Availability, Reliability and Security. Reggio Calabria, Italy: ACM, 2017. Article No. 54
[113]	Lichtblau D, Stoean C. Authorship attribution using the chaos game representation. arXiv: 1802.06007, 2018
[114]	Lichtblau D, Stoean C. Text documents encoding through images for authorship attribution. In: Proceedings of the 6th International Conference on Statistical Language and Speech Processing. Mons, Belgium: Springer, 2018. 178−189
[115]	Boenninghoff B, Rupp J, Nickel R M, Kolossa D. Deep Bayes factor scoring for authorship verification. In: Working Notes of the Conference and Labs of the Evaluation Forum 2020. Thessaloniki, Greece, 2020.
[116]	Halvani O, Graner L, Regev R. Cross-domain authorship verification based on topic agnostic features. In: Working Notes of the Conference and Labs of the Evaluation Forum 2020. Thessaloniki, Greece, 2020.
[117]	Kipnis A. Higher criticism as an unsupervised authorship discriminator. In: Working Notes of the Conference and Labs of the Evaluation Forum 2020. Thessaloniki, Greece, 2020.
[118]	Weren E R D, Kauer A U, Mizusaki L, Moreira V P, de Oliveira J P M, Wives L K. Examining multiple features for author profiling. Journal of Information and Data Management, 2014, 5(3): 266-279
[119]	Rangel F, Rosso P, Potthast M, Stein B. Overview of the 5th author profiling task at PAN 2017: Gender and language variety identification in twitter. In: Working Notes of the Conference and Labs of the Evaluation Forum 2017. Dublin, Ireland, 2017.
[120]	Martinc M, Skrjanec I, Zupan K, Pollak S. PAN 2017: Author profiling - gender and language variety prediction. In: Working Notes of the Conference and Labs of the Evaluation Forum 2017. Dublin, Ireland, 2017.
[121]	Tellez E S, Miranda-Jimenez S, Graff M, Moctezuma D. Gender and language-variety identification with MicroTC. In: Working Notes of the Conference and Labs of the Evaluation Forum 2017. Dublin, Ireland, 2017.
[122]	Takahashi T, Tahara T, Nagatani K, Miura Y, Taniguchi T, Ohkuma T. Text and image synergy with feature cross technique for gender identification. In: Working Notes of the Conference and Labs of the Evaluation Forum 2018. Avignon, France, 2018.
[123]	Daneshvar S, Inkpen D. Gender identification in twitter using N-grams and LSA. In: Working Notes of the Conference and Labs of the Evaluation Forum 2018. Avignon, France, 2018.
[124]	Tellez E S, Miranda-Jimenez S, Moctezuma D, Graff M, Salgado V, Ortiz-Bejar J. Gender identification through multi-modal tweet analysis using MicroTC and bag of visual words. In: Working Notes of the Conference and Labs of the Evaluation Forum 2018. Avignon, France, 2018.
[125]	Rangel F, Rosso P, Montes-y-Gomez M, Potthast M, Stein B. Overview of the 6th author profiling task at PAN 2018: Multimodal gender identification in twitter. In: Working Notes of the Conference and Labs of the Evaluation Forum 2018. Avignon, France, 2018.
[126]	Rangel F, Rosso P. Overview of the 7th author profiling task at PAN 2019: Bots and gender profiling in twitter. In: Working Notes of the Conference and Labs of the Evaluation Forum 2019. Lugano, Switzerland, 2019.
[127]	Radivchev V, Nikolov A, Lambova A. Celebrity profiling using TF-IDF, logistic regression, and SVM. In: Working Notes of the Conference and Labs of the Evaluation Forum 2019. Lugano, Switzerland, 2019.
[128]	Hodge A, Price S. Celebrity profiling using twitter follower feeds. In: Working Notes of the Conference and Labs of the Evaluation Forum 2020. Thessaloniki, Greece, 2020.
[129]	Siagian A H A M, Aritsugi M. DBMS-KU approach for author profiling and deception detection in Arabic. In: Working Notes of the Forum for Information Retrieval Evaluation 2019. Kolkata, India, 2019.
[130]	Nayel H A. NAYEL@APDA: Machine learning approach for author profiling and deception detection in Arabic texts. In: Working Notes of the Forum for Information Retrieval Evaluation 2019. Kolkata, India, 2019.
[131]	Sharmila D V, Kannimuthu S, Ravikumar G, Anand K M. KCE_DALab-APDAFIRE2019: Author profiling and deception detection in Arabic using weighted embedding. In: Working Notes of the Forum for Information Retrieval Evaluation 2019. Kolkata, India, 2019.
[132]	Potthast M, Schremmer F, Hagen M, Stein B. Overview of the author obfuscation task at PAN 2018: A new approach to measuring safety. In: Working Notes of the Conference and Labs of the Evaluation Forum 2018. Avignon, France, 2018.
[133]	Potthast M, Hagen M, Stein B. Author obfuscation: Attacking the state of the art in authorship verification. In: Working Notes of the Conference and Labs of the Evaluation Forum 2016. Evora, Portugal, 2016.
[134]	Mihaylova T, Karadjov G, Kiprov Y, Georgiev G, Koychev I, Nakov P. SU@PAN＇2016: Author obfuscation. In: Working Notes of the Conference and Labs of the Evaluation Forum 2016. Evora, Portugal, 2016.
[135]	Mansoorizadeh M, Rahgooy T, Aminiyan M, Eskandari M. Author obfuscation using WordNet and language models. In: Working Notes of the Conference and Labs of the Evaluation Forum 2016. Evora, Portugal, 2016.
[136]	Keswani Y, Trivedi H, Mehta P, Majumder P. Author masking through translation. In: Working Notes of the Conference and Labs of the Evaluation Forum 2016. Evora, Portugal, 2016.
[137]	Castro-Castro D, Bueno R O, Munoz R. Author masking by sentence transformation. In: Working Notes of the Conference and Labs of the Evaluation Forum 2017. Dublin, Ireland, 2017.
[138]	Kocher M, Savoy J. UniNE at CLEF 2018: Author masking. In: Working Notes of the Conference and Labs of the Evaluation Forum 2018. Avignon, France, 2018.
[139]	Rahgouy M, Giglou H B, Rahgooy T, Zeynali H, Rasouli S K M. Author masking directed by author＇s style. In: Working Notes of the Conference and Labs of the Evaluation Forum 2018. Avignon, France, 2018.