=========== HEADER =========== Readme.txt for Data from Classifying the Unknown: Insect Identification by Deep Zero-shot Bayesian Learning Item Handle: http://hdl.handle.net/11243/41 Documentation written on October 18th, 2021 By Sarkhan Badirli Updated 2022-10-25, by Heather Coates, Raneem Hijazi =========== SUGGESTED DATA CITATION =========== Please cite this data in the references for any publication which uses it. S Badirli, CJ Picard, G Mohler, Z Akata, M Dundar (2021). Data from Classifying the Unknown: Insect Identification by Deep Zero-shot Bayesian Learning. https://doi.org/10.7912/D2/27 =========== PRIMARY STUDY INFORMATION =========== ACKNOWLEDGEMENTS Project title: Classifying the Unknown: Insect Identification by Deep Zero-shot Bayesian Learning Funding agency: M.D. and S.B. were sponsored by the National Science Foundation (NSF) grant IIS-1252648 (CAREER). G.M. was sponsored by NSF grant ATD-2124313. Investigator Name: Sarkhan Badirli Investigator Institution: Indiana University - Purdue University Indianapolis Investigator Address: Computer and Information Science Department, IUPUI, IN 46202, USA Investigator Email: s.badirli@iu.edu Investigator Role (related to this dataset): data collection, data processing/cleaning, data analysis, data visualization Investigator ID (ORCID): 0000-0001-8440-6830 Investigator Name: Christine J. Picard Investigator Institution: Department of Biology, Indiana University - Purdue University Indianapolis Investigator Email: cpicard@iupui.edu Investigator Address: Department of Biology, IUPUI, IN 46202, USA Investigator Role (related to this dataset): data analysis, supervised the data cleaning process, paper writing Investigator ID (ORCID): N/A Investigator Name: George Mohler Investigator Institution: Indiana University - Purdue University Indianapolis Investigator Email: gmohler@iupui.edu Investigator Address: Computer and Information Science Department, IUPUI, IN 46202, USA Investigator Role (related to this dataset): data processing, supervised ML perspective Investigator ID (ORCID): N/A Investigator Name: Zeynep Akata Investigator Institution: University of Tübingen and Max Planck Institute for Intelligent Systems Investigator Email: zeynep.akata@uni-tuebingen.de Investigator Address: University of Tubingen, BW, Germany Investigator Role (related to this dataset): theory development, paper editing Investigator ID (ORCID): 0000-0002-1432-7747 Investigator Name: Murat Dundar Investigator Institution: Indiana University - Purdue University Indianapolis Investigator Email: mdundar@iupui.edu Investigator Address: Computer and Information Science Department, IUPUI, IN 46202, USA Investigator Role (related to this dataset): senior author, data processing, data analysis, supervised ML prespective, paper writing Investigator ID (ORCID): 0000-0001-5752-468X DATE(S) of DATA COLLECTION The data were retrieved approximately between 2021-01-01 and 2021-01-09 GEOGRAPHIC LOCATION(S) of DATA COLLECTION The data were collected in Computer and Information Science Department, IUPUI, IN 46206, USA DIRECTORY/FILE NAMING CONVENTIONS The data is stored under `data` directory. This directory contains 2 folders: `INSECT-Images` and `INSECT` FILE INFORMATION The raw insect images can be accessed from `data\INSECTS-Images` folder. The directory contains folders with species names. Each folder contains images from that species. Note that there are 2 data files under `data\INSECTS`: `data.mat` and `splits.mat`. `.mat` is MATLAB format to store the data but can be accessed by many programming languages like Python, etc. The variables and their explanations from these 2 files are listed below: `data.mat` * `embeddings_dna`: Vector Embeddings for DNA barcode data * `embeddings_img`: Vector Embeddings for IMAGE data * `labels`: Numeric labels for species * `species`: Species names * `G`: Genus labels of species * `nucleotides`: DNA barcode of the species * `bold_ids`: IDs of the sampels from BOLD system. You may use this ids to see the full details of the spekciemen in BOLD system. * `ids`: Image names of the samples in our dataset. `splits.mat` * `train_loc`: Indices of training data points for tuning * `trainval_loc`: Indices of training data points for final inference * `test_seen_loc`: Indices of test data from seen classes * `test_unseen_loc`: Indices of test data from unseen classes * `val_seen_loc`: Indices of validation data from seen classes * `val_unseen_loc`: Indices of validation data from unseen classes ACCESS & SHARING 1. Licenses/restrictions placed on the data (who is allowed to access and use these data?): CC BY-NC 4.0, https://creativecommons.org/licenses/by-nc/4.0/ 2. Links to thesis, dissertation, reports, or publications that cite or use the data (include DOI): Sarkhan Badirli, Christine J. Picard, George Mohler, Zeynep Akata, Murat Dundar. (2021). Classifying the Unknown: Identification of Insects by Deep Zero-shot Bayesian Learning. (preprint). https://doi.org/10.21203/rs.3.rs-1099185/v1 Sarkhan Badirli, Christine J. Picard, George Mohler, Zeynep Akata, Murat Dundar. (2021). Classifying the Unknown: Identification of Insects by Deep Zero-shot Bayesian Learning. (preprint). https://doi.org/10.1101/2021.09.15.460492 3. Links to publicly accessible locations of the data (if applicable): N/A 4. Links/relationships to other data files/sets: N/A 5. Was data derived from another source? A. List source(s): The dataset is collected from the Public Portal of Barcode of Life Database (BOLD): http://www.boldsystems.org/ The data were retrieved using Matlab. The code is provided below: for i=1:numel(insect_species) fname=['C:\Users\sbadirli\OneDrive - Indiana University\Desktop\Insect ZSL Project\INSECT DNA\',insect_species{i},'.txt']; url=['http://v3.boldsystems.org/index.php/API_Public/combined?taxon=',insect_species{i},'&format=tsv']; outfilename = websave(fname,url); end =========== OPTIONAL =========== If this information is described in your thesis, you do not have to complete this section. Abstract Insects represent a large majority of biodiversity on Earth, yet so few species are described. Describing new species typically requires specific taxonomic expertise to identify morphological characters that distinguish it from other known species and DNA-based methods have aided in providing additional evidence of separate species. Machine learning (ML) provides a powerful method in identifying new species given its analytical processing is more sensitive to subtle physical differences in images humans may not process. We develop a Bayesian deep learning method for zero-shot classification of species. The proposed approach forms a Bayesian hierarchy of species around corresponding genera and uses deep embeddings of images and DNA barcodes to identify insects to the lowest taxonomic level possible. To demonstrate this proof of concept, we use a database of 32,848 insect images from 1,040 described species split into training and test data wherein the test data includes 243 species not present in the training data. Our results demonstrate that using DNA sequences and images together, known insects can be classified with 96.66\% accuracy while unknown (to the database) insects have an accuracy of 81.39\% in identifying the correct genus. The proposed deep zero-shot Bayesian model demonstrates a powerful new approach that can be used for the gargantuan task of identifying new insect species. For more information please refer to the paper. The code can accessed at https://github.com/sbadirli/Zero-shot-Insect-Discovery =========== CREDITS =========== Template provided by Indiana University UITS Research Storage, Indiana University Bloomington Libr