A Novel Document Representation Method for Author Profiling using Auto-Encoders
Main Article Content
Abstract
Author profiling is used for identifying certain demographic characteristics including age, gender, religion, language, nationality, and others of an author for a certain text. Author profiling has applications in different areas such as marketing, security, education, and forensics. Most of the research works concentrated on predicting the age and gender of an author by analyzing the writings of authors. The researchers started research on author profiling by using different types of stylistic features. Later, they realized that same set of stylistic features are used by the different classes of age and gender. Then, the research community observed that the writing styles of different authors can best differentiate by using the content-based features such as words that are utilized in the writing of a text. Some set of successful research works are used feature selection methods to identify the best relevant words from the datasets, and some other set of works are used term weight measures to denote the term importance within a document to represent the documents as vectors. The performance of author profiling approaches mainly depends on the type of information is used for representing the documents as vectors. In this work, we developed a novel document representation method by using different types of information in document representation. In the proposed method, the document representation considers three varieties of information such as the compressed feature representation of document identified by the auto-encoder, the contextualized information of word embeddings, and the importance of a word within a document to represent the documents as vectors. In this work, we conducted experiment on two standard datasets such as reviews dataset and Twitter dataset those are provided in PAN 2014 competition and PAN 2016 competitions respectively. In these datasets, the dataset pertaining to gender consists of two classes of documents such as female and male, and the dataset pertaining to age consists of five classes of documents such as 18-24, 25-34, 35-49, 50-64, and 65-xx. The proposed document representation method shows best performance for age and gender prediction on two datasets when compared with several popular methods of author profiling.