=========== HEADER =========== Readme.txt for "Fifty Victorian Era Novelists Authorship Attribution Data" Item Handle: https://dataworks.iupui.edu/handle/11243/23 DOI: http://dx.doi.org/10.7912/D2N65J Documentation written on 20180517, 20180601 By Heather Coates, DataWorks Repository Manager Revised by Abdulmecit Gungor, investigator =========== SUGGESTED DATA CITATION =========== Please cite this data in the references for any publication which uses it. Gungor, A. (2018). Fifty Victorian Era Novelists Authorship Attribution Data. IUPUI University Library. http://dx.doi.org/10.7912/D2N65J =========== PRIMARY STUDY INFORMATION =========== ACKNOWLEDGEMENTS Project title: Fifty Victorian Era Novelists Authorship Attribution Data Investigator Name: Abdulmecit Gungor Investigator Institution: Indiana University Purdue University Indianapolis (IUPUI) School of Science, Department of Computer & Information Science Investigator Email: mgungor@iu.edu Investigator Role (related to this dataset): graduate student investigator DATA SOURCE Google Big Query https://cloud.google.com/bigquery/public-data/gdelt-books The data was extracted through https://blog.gdeltproject.org/ using Google Big Query. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source —http://gdeltproject.org/about.html. The GDELT Project is an open platform for research and analysis of global society and thus all datasets released by the GDELT Project are available for unlimited and unrestricted use for any academic, commercial, or governmental use of any kind without fee. DATA PREPARATION Workflow at https://github.com/agungor2/Authorship_Attribution Data were prepared for processing using the procedures described in the thesis "Benchmarking Authorship Attribution Over a Thousand Books By Victorian Era Authors", Section 3.1 DATE(S) of DATA COLLECTION Data were retrieved November, 2016 FILE DESCRIPTION: DATA DICTIONARY Name, Size, Bytes, Class WW, 50x3500, 1400000, double aid, 93600x1, 748800, double bid, 93600x1, 748800, double ind, 93600x1, 748800, double shortened_vocab, 1x10000, 1254644, cell test_ind, 93600x1, 93600, logical tfidf, 1113x50920, 453391680, double train_ind, 93600x1, 93600, logical txt_pieces, 93600x1000, 748800000, double vocab, 1x50920, 6387934, cell VARIABLE NAMING CONVENTIONS Name, Description WW, Author Word list aid, Author Id bid, Book Id ind, Index numbers shortened_vocab, 10000 Vocabulary list test_ind, Testing Indexes tfidf, Tfidf Scores train_ind, Training Indexes txt_pieces, All one hot encoded data vocab, All vocabulary list =========== THESIS PROJECT =========== ABSTRACT Authorship attribution is the process of identifying the author of a given text and from the machine learning perspective, it can be seen as a classification problem. To create the largest publicly available authorship attribution dataset we've extracted the works of 50 well-known Victorian-era authors. All of these extracted works are novels. In order to create non-exhaustive learning problem, we've provided 45 authors in training and 50 authors in the testing data. 5 missing authors in testing consist of %34 of all testing set. Each instance then represented with a 1000 word pieces for each author. There are 93600 text piece instance in total each which consist of 1000 words. To make the problem a bit more challenging, we've separated different books for both training and testing. We have performed 5 main feature extraction technique on this data and compared the performance of such features within different classifiers and deep learning structures. The usage of Word2Vec in authorship attribution problem is also introduced with two main approaches: author based Word2Vec training and treating each author's text pieces individually. Support vector machine classifiers with nu-SVC type is observed to give best success rates on the stacked useful feature set. RESEARCH AIMS The main purpose of this work is to lay the foundations of feature extraction techniques in Authorship Attribution problems. These are lexical, character-level, syntactic, semantic, application specific features. In order to showcase each of these feature extraction techniques we have aimed to offer a new data resource for the author attribution research community and demonstrated them with examples. These examples can be found through https://github.com/agungor2/Authorship_Attribution. The dataset we have introduced consists of works of Victorian era authors and the main feature extraction techniques. METHODS To decrease the bias and create a reliable authorship attribution dataset the following criteria have been chosen to filter out authors in Gdelt database: English language writing authors, authors that have enough books available (at least 5), 19th century authors. With these criteria 50 authors have been selected and their books were queried through Big Query Gdelt database. The next task has been cleaning the dataset due to OCR reading problems in the original raw form. To achieve that, firstly all books have been scanned through to get the overall number of unique words and each words frequencies. While scanning the texts, the first 500 words and the last 500 words have been removed to take out specific features such as the name of the author, the name of the book and other word specific features that could make the classification task easier. After this step, we have chosen top 10,000 words that occurred in the whole 50 authors text data corpus. The words that are not in top 10,000 words were removed while keeping the rest of the sentence structure intact. Afterwards, the words are represented with numbers from 1 to 10,000 reverse ordered according to their frequencies. The entire book is split into text fragments with 1000 words each. We separately maintained author and book identification number for each one of them in different arrays. Text segments with less than 1000 words were filled with zeros to keep them in the dataset as well. 1000 words make approximately 2 pages of writing, which is long enough to extract a variety of features from the document. The reason why we have represented top 10,000 words with numbers is to keep the anonymity of texts and allow researchers to run feature extraction techniques faster. Dealing with large amounts of text data can be more challenging than numerical data for some feature extraction techniques. =========== CREDITS =========== Template provided by Indiana University UITS Research Storage, Indiana University Bloomington Libraries, IUPUI University Library