Automatic detection of cognitive distortions from short written text could support large-scale mental-health screening and digital cognitive-behavioural therapy (CBT). Many recent approaches rely on heavy deep-learning models and large datasets that are difficult to deploy in real time or in resource-constrained settings. This work presents a compact, fully transparent pipeline for binary and multi-class classification of cognitive distortions using a small synthetic corpus of brief statements. We simulate 300 short texts covering 10 canonical distortion types (e.g., all-or-nothing thinking, overgeneralization, labelling) plus neutral statements. Texts are vectorized with a 100-dimensional TF-IDF representation over uni and bi grams, and three lightweight classifiers are compared: logistic regression, random forest, and linear SVM. On a stratified 60/20/20 train validation test split, logistic regression and linear SVM both achieve perfect test performance for the binary task (Accuracy = Precision = Recall = F1 = 1.00; AUC = 1.00), while random forest reaches 0.98 accuracy and 0.98 F1. A separate multinomial logistic-regression model trained only on distorted texts correctly identifies the specific distortion type with 0.96 accuracy across 10 classes. Five-fold cross-validation confirms the stability of the pipeline (mean accuracy 1.00, SD 0.00) on this synthetic dataset. Although the unrealistically high scores are driven by the small, highly patterned synthetic corpus, the results demonstrate that compact TF-IDF models can deliver ultra-fast, interpretable cognitive-distortion classification and provide a practical blueprint for future work on larger, clinically realistic datasets.
Cite this paper
Filippis, R. D. and Foysal, A. A. (2026). Ultra-Fast Cognitive Distortion Classification from Short Text-A Lightweight TF-IDF and Logistic Regression Pipeline on Synthetic Data. Open Access Library Journal, 13, e14924. doi: http://dx.doi.org/10.4236/oalib.1114924.
Friedman, H.H. (2023) The Thinking Traps That Ruin Your Happiness: How to Recognize, Challenge, and Overcome Cognitive Distortions. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.4670101
Friedman, H. (2023) Overcoming Cognitive Distortions: How to Recognize and Challenge the Thinking Traps that Make You Miserable. Koppelman School of Business, Brooklyn College, City University of New York.
Özdemir, İ. and Kuru, E. (2023) Investigation of Cognitive Distortions in Panic Disorder, Generalized Anxiety Disorder and Social Anxiety Disorder. Journal of Clinical Medicine, 12, Article 6351. https://doi.org/10.3390/jcm12196351
Koçöz, D. (2017) Revisiting Cog-nitive Distortions and Psychopathology Relationship: Testing Mediating Roles of Mindfulness and Negative Self-Focus Using Structural Equation Modeling. Mas-ter’s Thesis, İstanbul Arel üniversitesi.
Hetrick, S.E., Cox, G.R., Witt, K.G., Bir, J.J. and Merry, S.N. (2016) Cognitive Behavioural Therapy (CBT), Third-Wave CBT and Interpersonal Therapy (IPT) Based Interventions for Preventing Depression in Children and Adolescents. Cochrane Database of Systematic Reviews, 2016, CD003380. https://doi.org/10.1002/14651858.cd003380.pub4
Kang, Y., Cai, Z., Tan, C., Huang, Q. and Liu, H. (2020) Natural Language Processing (NLP) in Management Research: A Literature Review. Journal of Management Analytics, 7, 139-172. https://doi.org/10.1080/23270012.2020.1756939
Nadkarni, P.M., Ohno-Machado, L. and Chapman, W.W. (2011) Natural Language Processing: An Introduction. Journal of the American Medical Informatics Association, 18, 544-551. https://doi.org/10.1136/amiajnl-2011-000464
Grail, Q., Pe-rez, J. and Gaussier, E. (2021) Globalizing BERT-Based Transformer Architec-tures for Long Document Summarization. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Vol-ume, Online, April 2021, 1792-1810. https://doi.org/10.18653/v1/2021.eacl-main.154
Abdal, M.N., Oshie, M.H.K., Haue, M.A. and Islam, K. (2023) A Transformer Based Model for Twit-ter Sentiment Analysis Using Roberta. 2023 26th International Conference on Computer and Information Technology (ICCIT), Cox’s Bazar, 13-15 December 2023, 1-6. https://doi.org/10.1109/iccit60459.2023.10441627
Bzdok, D. (2017) Classical Statistics and Statistical Learning in Imaging Neuroscience. Frontiers in Neuroscience, 11, Arti-cle 543. https://doi.org/10.3389/fnins.2017.00543
Rousseeuw, P.J. (2025) Explainable Linear and Generalized Linear Models by the Predictions Plot. The American Statistician, 80, 157-163. https://doi.org/10.1080/00031305.2025.2539235
Mishra, P. (2021) Explainability for Linear Models. In: Practical Explainable AI Using Python: Artifi-cial Intelligence Model Explanations Using Python-Based Libraries, Extensions, and Frameworks, Apress, 35-92.
Ibrahim, R., Elbagoury, A., Kamel, M.S. and Karray, F. (2018) Tools and Approaches for Topic Detection from Twitter Streams: Sur-vey. Knowledge and Information Systems, 54, 511-539. https://doi.org/10.1007/s10115-017-1081-x
Liu, Q., Wang, J., Zhang, D., Yang, Y. and Wang, N. (2018) Text Features Extraction Based on TF-IDF Asso-ciating Semantic. 2018 IEEE 4th International Conference on Computer and Communications (ICCC), Chengdu, 7-10 December 2018, 2238-2243. https://doi.org/10.1109/compcomm.2018.8780663
Browning, M., Carter, C.S., Chatham, C., Den Ouden, H., Gillan, C.M., Baker, J.T., et al. (2020) Realizing the Clinical Potential of Computational Psychiatry: Report from the Banbury Center Meeting, February 2019. Biological Psychiatry, 88, e5-e10. https://doi.org/10.1016/j.biopsych.2019.12.026
Chen, C.S. and Vinogradov, S. (2024) Personalized Cognitive Health in Psychiatry: Current State and the Promise of Computational Methods. Schizophrenia Bulletin, 50, 1028-1038. https://doi.org/10.1093/schbul/sbae108
Laufer, O., Israeli, D. and Paz, R. (2016) Behavioral and Neural Mechanisms of Overgeneralization in Anxiety. Current Biology, 26, 713-722. https://doi.org/10.1016/j.cub.2016.01.023
Semin, G.R.,and Smith, E.R. (2002) Interfaces of Social Psychology with Situated and Embodied Cognition. Cognitive Systems Research, 3, 385-396.https://doi.org/10.1016/S1389-0417(02)00049-9
Weeks, J.W. (2010) The Disqualification of Positive Social Outcomes Scale: A Novel Assess-ment of a Long-Recognized Cognitive Tendency in Social Anxiety Disorder. Journal of Anxiety Disorders, 24, 856-865. https://doi.org/10.1016/j.janxdis.2010.06.008
Morse, B.S. and Schwartzwald, D. (1998) Isophote-Based Interpolation. Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No. 98CB36269), Chicago, 7 October 1998, 227-231. https://doi.org/10.1109/ICIP.1998.999013
Ruba, A.L. and Pollak, S.D. (2020) The Development of Emotion Reasoning in Infancy and Early Childhood. Annual Review of Developmental Psychology, 2, 503-531. https://doi.org/10.1146/annurev-devpsych-060320-102556
Fan, H.Y. and Poole, M.S. (2006) What Is Personalization? Perspectives on the Design and Implementa-tion of Personalization in Information Systems. Journal of Organizational Compu-ting and Electronic Commerce, 16, 179-202. https://doi.org/10.1080/10919392.2006.9681199
Konopka, B.M., Lwow, F., Owczarz, M. and Łaczmański, Ł. (2018) Exploratory Data Analysis of a Clinical Study Group: Development of a Procedure for Exploring Multidimen-sional Data. PLOS ONE, 13, e0201950. https://doi.org/10.1371/journal.pone.0201950
Avval, T.G., Moeini, B., Carver, V., Fairley, N., Smith, E.F., Baltrusaitis, J., et al. (2021) The Of-ten-Overlooked Power of Summary Statistics in Exploratory Data Analysis: Comparison of Pattern Recognition Entropy (PRE) to Other Summary Statistics and Introduction of Divided Spectrum-Pre (DS-PRE). Journal of Chemical Infor-mation and Modeling, 61, 4173-4189. https://doi.org/10.1021/acs.jcim.1c00244
Buja, A., Cook, D., Hofmann, H., Lawrence, M., Lee, E., Swayne, D.F., et al. (2009) Statistical Inference for Exploratory Data Analysis and Model Diagnostics. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 367, 4361-4383. https://doi.org/10.1098/rsta.2009.0120
Dhariwal, A., Junges, R., Chen, T. and Petersen, F.C. (2021) Resistoxplorer: A Web-Based Tool for Visual, Statistical and Exploratory Data Analysis of Resistome Data. NAR Genomics and Bioinformatics, 3, lqab018. https://doi.org/10.1093/nargab/lqab018
Batch, A. and Elmqvist, N. (2017) The Interactive Visualization Gap in Initial Exploratory Data Analysis. IEEE Transactions on Visualization and Computer Graphics, 24, 278-287. https://doi.org/10.1109/tvcg.2017.2743990
Gillani, H.H., Qureshi, M.A., Beghdadi, A., Cheikh, F. and Ullah, M. (2025) Distortion Classification in Com-puter Vision Applications: Current Progress, Challenges, and Perspectives. ACM Computing Surveys, 58, 1-36. https://doi.org/10.1145/3773023
For-manowicz, M. and Hansen, K. (2021) Subtle Linguistic Cues Affecting Gender In(equality). Journal of Language and Social Psychology, 41, 127-147. https://doi.org/10.1177/0261927x211035170
van der Auwera, J. and König-Johan, E. (1990) Adverbial Participles, Gerunds and Absolute Construc-tions in the Languages of Europe. In: Toward a Typology of European Languages, De Gruyter Brill, 337.
Tan, C.M., Wang, Y.F. and Lee, C.D. (2002) The Use of Bigrams to Enhance Text Categorization. Information Processing & Manage-ment, 38, 529-546. https://doi.org/10.1016/s0306-4573(01)00045-0
Zhu, X.J., Goldberg, A.B., Rabbat, M. and Nowak, R. (2008) Learning Bigrams from Unigrams. Pro-ceedings of ACL-08: HLT, Columbus, 10 January 2008, 656-664.
Nik-karinen, I., Pimentel, T., Blasi, D. and Cotterell, R. (2021) Modeling the Unigram Distribution. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, August 2021, 3721-3729. https://doi.org/10.18653/v1/2021.findings-acl.326
Liu, I., Lo, K. and Wu, J. (1996) A Probabilistic Interpretation of “If-then”. The Quarterly Journal of Experimental Psychology Section A, 49, 828-844. https://doi.org/10.1080/713755646
Cestnik, B. and Bratko, I. (1991) On Estimating Probabilities in Tree Pruning. In: Lecture Notes in Computer Sci-ence, Springer, 138-150. https://doi.org/10.1007/bfb0017010
Beygel-zimer, A., Langford, J., Lifshits, Y., Sorkin, G. and Strehl, A.L. (2014) Conditional Probability Tree Estimation Analysis and Algorithms. arXiv: 1408.2031.
Rane, N., Choudhary, S.P. and Rane, J. (2024) En-semble Deep Learning and Machine Learning: Applications, Opportunities, Challenges, and Future Directions. Studies in Medical and Health Sciences, 1, 18-41. https://doi.org/10.48185/smhs.v1i2.1225
Gashler, M., Gi-raud-Carrier, C. and Martinez, T. (2008) Decision Tree Ensemble: Small Heter-ogeneous Is Better than Large Homogeneous. 2008 Seventh International Con-ference on Machine Learning and Applications, San Diego, 11-13 December 2008, 900-905. https://doi.org/10.1109/icmla.2008.154
Ghiasi, M.M. and Zendehboudi, S. (2021) Application of Decision Tree-Based Ensemble Learning in the Classification of Breast Cancer. Computers in Biology and Medicine, 128, Article 104089. https://doi.org/10.1016/j.compbiomed.2020.104089
Lian, W., Nie, G., Jia, B., Shi, D., Fan, Q. and Liang, Y. (2020) An Intrusion Detection Method Based on Decision Tree-Recursive Feature Elimination in Ensemble Learning. Mathematical Problems in Engineering, 2020, 1-15. https://doi.org/10.1155/2020/2835023
Qu, Z.W., Song, X.M., Zheng, S.Q., Wang, X.R., et al. (2018) Improved Bayes Method Based on TF-IDF Feature and Grade Factor Feature for Chinese Information Classification. 2018 IEEE In-ternational Conference on Big Data and Smart Computing (BigComp), Shanghai, 15-17 January 2018, 677-680. https://doi.org/10.1109/bigcomp.2018.00124
Kadhim, A.I. (2019). Term Weighting for Feature Extraction on Twitter: A Comparison between BM25 and TF-IDF. 2019 International Conference on Advanced Science and Engi-neering (ICOASE), Zakho-Duhok, 2-4 April 2019, 124-128. https://doi.org/10.1109/icoase.2019.8723825
Strobl, C., Malley, J. and Tutz, G. (2009) An Introduction to Recursive Partitioning: Rationale, Application, and Characteristics of Classifica-tion and Regression Trees, Bagging, and Random Forests. Psychological Methods, 14, 323-348. https://doi.org/10.1037/a0016973
Koskas, M., Luton, D., Graesslin, O., Barranger, E., Clavel-Chapelon, F., Haddad, B., et al. (2015) Direct Comparison of Logistic Regression and Recursive Partitioning to Predict Lymph Node Metastasis in Endometrial Cancer. International Journal of Gynecological Cancer, 25, 1037-1043. https://doi.org/10.1097/igc.0000000000000451
Kumara-kulasinghe, N.B., Blomberg, T., Liu, J., Saraiva Leao, A. and Papapetrou, P. (2020) Evaluating Local Interpretable Model-Agnostic Explanations on Clinical Machine Learning Classification Models. 2020 IEEE 33rd International Sympo-sium on Computer-Based Medical Systems (CBMS), Rochester, 28-30 July 2020, 7-12. https://doi.org/10.1109/cbms49503.2020.00009
Stiglic, G., Kocbek, P., Fijacko, N., Zitnik, M., Verbert, K. and Cilar, L. (2020) Interpretabil-ity of Machine Learning-Based Prediction Models in Healthcare. WIREs Data Mining and Knowledge Discovery, 10, e1379. https://doi.org/10.1002/widm.1379
Sayyidul Laily, F.T.A. (2024) Fea-ture Extraction and Classification of Retinal Images Using Sobel Segmentation and Linear Svc. International Journal of Artificial Intelligence in Medical Issues, 2, 136-149. https://doi.org/10.56705/ijaimi.v2i2.153
Lichtenstein, S., Fischhoff, B. and Phillips, L.D. (1977) Calibration of Probabilities: The State of the Art. In: Decision Making and Change in Human Affairs, Springer, 275-324. https://doi.org/10.1007/978-94-010-1276-8_19
DeMonbreun, B.G. and Craighead, W.E. (1977) Distortion of Perception and Recall of Positive and Neutral Feedback in Depression. Cognitive Therapy and Research, 1, 311-329. https://doi.org/10.1007/bf01663996
Candel, I., Merckelbach, H. and Zandbergen, M. (2003) Boundary Distortions for Neutral and Emotional Pic-tures. Psychonomic Bulletin & Review, 10, 691-695. https://doi.org/10.3758/bf03196533
Ugi, S., Maegawa, H., Morino, K., Nishio, Y., Sato, T., Okada, S., et al. (2016) Evaluation of a Novel Glucose Area under the Curve (AUC) Monitoring System: Comparison with the AUC by Continuous Glucose Monitoring. Diabetes & Metabolism Journal, 40, 326-333. https://doi.org/10.4093/dmj.2016.40.4.326
Couronné, R., Probst, P. and Boulesteix, A. (2018) Random Forest versus Logistic Regression: A Large-Scale Benchmark Experiment. BMC Bioinformatics, 19, Article No. 270. https://doi.org/10.1186/s12859-018-2264-5
Davis, J. and Goadrich, M. (2006) The Relationship between Precision-Recall and ROC Curves. Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, 25-29 June 2006, 233-240. https://doi.org/10.1145/1143844.1143874
Chicco, D. and Jurman, G. (2020) The Advantages of the Matthews Correlation Coefficient (MCC) over F1 Score and Accuracy in Binary Classification Evaluation. BMC Genomics, 21, Arti-cle No. 6. https://doi.org/10.1186/s12864-019-6413-7
Chicco, D., Warrens, M.J. and Jurman, G. (2021) The Matthews Correlation Coefficient (MCC) Is More Informative than Cohen’s Kappa and Brier Score in Binary Clas-sification Assessment. IEEE Access, 9, 78368-78381. https://doi.org/10.1109/access.2021.3084050
Strickett, M. (2024) Logistic Regression Methods versus Machine Learning Techniques in Status and Severity Prediction of South African Covid-19 Laboratory Test Data. Master’s Thesis, University of the Witwatersrand, Johannesburg (South Africa).
Li, L., Rysavy, M.A., Bobashev, G. and Das, A. (2024) Comparing Methods for Risk Prediction of Multicategory Outcomes: Dichotomized Logistic Regression Vs. Multinomial Logit Regression. BMC Medical Research Methodology, 24, Article No. 261. https://doi.org/10.1186/s12874-024-02389-x
Poursheikhali Asgary, M., Jahandideh, S., Abdolmaleki, P. and Kazemnejad, A. (2007) Analysis and Identification of β-Turn Types Using Multinomial Logistic Regression and Artifi-cial Neural Network. Bioinformatics, 23, 3125-3130. https://doi.org/10.1093/bioinformatics/btm324
McCauley, S. (2012) Applying Multinomial Logistic Regression to Categorize Student Technological Knowledge Based on Technology Usage Attributes. Walden Universi-ty.
Gill, C.J., Sabin, L. and Schmid, C.H. (2005) Why Clinicians Are Natural Bayesians. British Medical Journal, 330, 1080-1083. https://doi.org/10.1136/bmj.330.7499.1080
Hay-Smith, E.J.C., Brown, M., Anderson, L. and Treharne, G.J. (2016) Once a Clinician, Always a Clinician: A Systematic Review to Develop a Typology of Clinician-Researcher Dual-Role Experiences in Health Research with Patient-participants. BMC Medical Re-search Methodology, 16, Article No. 95. https://doi.org/10.1186/s12874-016-0203-6
Austin, S., Bandealy, A. and Cawley, E. (2024) Technology Meets Clinical Practice: Keel Mind as a Digital Therapy Platform. Mental Health and Digital Technologies, 1, 99-111. https://doi.org/10.1108/mhdt-02-2024-0006
Starke, A.D. and Willem-sen, M.C. (2024) Psychologically Informed Design of Energy Recommender Systems: Are Nudges Still Effective in Tailored Choice Environments? In: Hu-man-Computer Interaction Series, Springer, 221-259. https://doi.org/10.1007/978-3-031-55109-3_9
Benjamin, C.L., Puleo, C.M., Settipani, C.A., Brodman, D.M., Edmunds, J.M., Cummings, C.M., et al. (2011) History of Cogni-tive-Behavioral Therapy in Youth. Child and Adolescent Psychiatric Clinics of North America, 20, 179-189. https://doi.org/10.1016/j.chc.2011.01.011
Huys, Q.J.M., Maia, T.V. and Frank, M.J. (2016) Computational Psychiatry as a Bridge from Neuroscience to Clinical Applications. Nature Neuroscience, 19, 404-413. https://doi.org/10.1038/nn.4238
Ricketts, J., Barry, D., Guo, W. and Pelham, J. (2023) A Scoping Literature Review of Natural Language Processing Application to Safety Occurrence Reports. Safety, 9, 22.https://doi.org/10.3390/safety9020022
Davidson, S., Yamada, A., Mira, P.F., Carando, A., et al. (2020) Developing NLP Tools with a New Corpus of Learner Spanish. Proceedings of the Twelfth Language Resources and Evalua-tion Conference, Marseille, 11-16 May 2020, 7238-7243.