Babatunde K Olorisade
A critical analysis of studies that address the use of text mining for citation screening in systematic reviews
Olorisade, Babatunde K; de Quincey, Ed; Brereton, Pearl; Andras, Peter
Authors
Ed de Quincey
Pearl Brereton
Prof Peter Andras P.Andras@napier.ac.uk
Dean of School of Computing Engineering and the Built Environment
Abstract
Background: Since the introduction of the systematic review process to Software Engineering in 2004, researchers have investigated a number of ways to mitigate the amount of effort and time taken to filter through large volumes of literature.
Aim: This study aims to provide a critical analysis of text mining techniques used to support the citation screening stage of the systematic review process.
Method: We critically re-reviewed papers included in a previous systematic review which addressed the use of text mining methods to support the screening of papers for inclusion in a review. The previous review did not provide a detailed analysis of the text mining methods used. We focus on the availability in the papers of information about the text mining methods employed, including the description and explanation of the methods, parameter settings, assessment of the appropriateness of their application given the size and dimensionality of the data used, performance on training, testing and validation data sets, and further information that may support the reproducibility of the included studies.
Results: Support Vector Machines (SVM), Naïve Bayes (NB) and Committee of classifiers (Ensemble) are the most used classification algorithms. In all of the studies, features were represented with Bag-of-Words (BOW) using both binary features (28%) and term frequency (66%). Five studies experimented with n-grams with n between 2 and 4, but mostly the unigram was used. χ2, information gain and tf-idf were the most commonly used feature selection techniques. Feature extraction was rarely used although LDA and topic modelling were used. Recall, precision, F and AUC were the most used metrics and cross validation was also well used. More than half of the studies used a corpus size of below 1,000 documents for their experiments while corpus size for around 80% of the studies was 3,000 or fewer documents. The major common ground we found for comparing performance assessment based on independent replication of studies was the use of the same dataset but a sound performance comparison could not be established because the studies had little else in common. In most of the studies, insufficient information was reported to enable independent replication. The studies analysed generally did not include any discussion of the statistical appropriateness of the text mining method that they applied. In the case of applications of SVM, none of the studies report the number of support vectors that they found to indicate the complexity of the prediction engine that they use, making it impossible to judge the extent to which over-fitting might account for the good performance results.
Conclusions: There is yet to be concrete evidence about the effectiveness of text mining algorithms regarding their use in the automation of citation screening in systematic reviews. The studies indicate that options are still being explored, but there is a need for better reporting as well as more explicit process details and access to datasets to facilitate study replication for evidence strengthening. In general, the reader often gets the impression that text mining algorithms were applied as magic tools in the reviewed papers, relying on default settings or default optimization of available machine learning toolboxes without an in-depth understanding of the statistical validity and appropriateness of such tools for text mining purposes.
Presentation Conference Type | Conference Paper (Published) |
---|---|
Conference Name | EASE '16: 20th International Conference on Evaluation and Assessment in Software Engineering |
Start Date | Jun 1, 2016 |
End Date | Jun 3, 2016 |
Online Publication Date | Jun 1, 2016 |
Publication Date | 2016 |
Deposit Date | Nov 10, 2021 |
Publisher | Association for Computing Machinery (ACM) |
Pages | 1-11 |
Book Title | EASE '16: Proceedings of the 20th International Conference on Evaluation and Assessment in Software Engineering |
ISBN | 978-1-4503-3691-8 |
DOI | https://doi.org/10.1145/2915970.2915982 |
Public URL | http://researchrepository.napier.ac.uk/Output/2809199 |
You might also like
A review of privacy-preserving federated learning for the Internet-of-Things
(2021)
Book Chapter
Amnesia: Neuropsychological Interpretation and Artificial Neural Network Simulation
(1998)
Journal Article
Neural activity pattern systems
(2004)
Journal Article
Medical research funding may have over-expanded and be due for collapse
(2005)
Journal Article
Downloadable Citations
About Edinburgh Napier Research Repository
Administrator e-mail: repository@napier.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2024
Advanced Search