New algorithm for determining the number of features for the effective sentiment-classification of text documents

Idczak, Adam; Korzeniewski, Jerzy

doi:10.59139/ws.2023.05.3

Article details

Journal

Wiadomości Statystyczne. The Polish Statistician

2023 | 68 | 5 | 40-57

Article title

New algorithm for determining the number of features for the effective sentiment-classification of text documents

Authors

Adam Idczak , Jerzy Korzeniewski

Content

Full texts:

Download

Title variants

PL

Nowy algorytm ustalania liczby zmiennych potrzebnych do klasyfikacji dokumentów tekstowych ze względu na ich wydźwięk emocjonalny

Languages of publication

Abstracts

PL

Analiza sentymentu, czyli wydźwięku emocjonalnego, dokumentów tekstowych stanowi bardzo ważną część współczesnej eksploracji tekstu (ang. text mining). Celem artykułu jest przedstawienie nowej techniki analizy sentymentu tekstu, która może znaleźć zastosowanie w dowolnej metodzie klasyfikacji dokumentów ze względu na ich wydźwięk emocjonalny. Proponowana technika polega na niezależnym od klasyfikatora doborze cech, co skutkuje zmniejszeniem rozmiaru ich przestrzeni. Zaletami tej propozycji są intuicyjność i prostota obliczeniowa. Zasadniczym elementem omawianej techniki jest nowatorski algorytm ustalania liczby terminów wystarczających do efektywnej klasyfikacji, który opiera się na analizie korelacji pomiędzy pojedynczymi cechami dokumentów a ich wydźwiękiem. W celu weryfikacji przydatności proponowanej techniki zastosowano podejście statystyczne. Wykorzystano dwie metody: naiwny klasyfikator Bayesa i regresję logistyczną. Za ich pomocą zbadano trzy zbiory dokumentów składające się z 1169 opinii klientów jednego z banków działających na terenie Polski uzyskanych w 2020 r. Dokumenty zostały napisane w języku polskim. Badanie pokazało, że kilkunastokrotne zmniejszenie liczby terminów przy zastosowaniu proponowanej techniki na ogół poprawia jakość klasyfikacji.

EN

Sentiment analysis of text documents is a very important part of contemporary text mining. The purpose of this article is to present a new technique of text sentiment analysis which can be used with any type of a document-sentiment-classification method. The proposed technique involves feature selection independently of a classifier, which reduces the size of the feature space. Its advantages include intuitiveness and computational noncomplexity. The most important element of the proposed technique is a novel algorithm for the determination of the number of features to be selected sufficient for the effective classification. The algorithm is based on the analysis of the correlation between single features and document labels. A statistical approach, featuring a naive Bayes classifier and logistic regression, was employed to verify the usefulness of the proposed technique. They were applied to three document sets composed of 1,169 opinions of bank clients, obtained in 2020 from a Poland-based bank. The documents were written in Polish. The research demonstrated that reducing the number of terms over 10-fold by means of the proposed algorithm in most cases improves the effectiveness of classification.

Keywords

EN

sentiment analysis document sentiment classification text mining logistic regression naive Bayes classifier feature selection correlation

PL

analiza sentymentu klasyfikacja dokumentów ze względu na wydźwięk emocjonalny eksploracja tekstu regresja logistyczna naiwny klasyfikator Bayesa dobór cech korelacja

Publisher

Główny Urząd Statystyczny

Journal

Wiadomości Statystyczne. The Polish Statistician

Year

2023

Volume

68

Issue

5

Pages

40-57

Physical description

Dates

published

2023

Contributors

author

Adam Idczak

Uniwersytet Łódzki, Wydział Ekonomiczno-Socjologiczny / University of Lodz, Faculty of Economics and Sociology

https://orcid.org/0000000196762410

author

Jerzy Korzeniewski

Uniwersytet Łódzki, Wydział Ekonomiczno-Socjologiczny / University of Lodz, Faculty of Economics and Sociology

https://orcid.org/0000000165265921

References

Agarwal, A., Xie, B., Vovsha, I., Rambow, O., & Passonneau, R. (2011). Sentiment Analysis of Twitter Data. W: LSM '11: Proceedings of the Workshop on Languages in Social Media (s. 30-38). Association for Computational Linguistics.
Davies, A., & Ghahramani, Z. (2011). Language-independent Bayesian sentiment mining of Twitter. W: The fifth SNAKDD Workshop 2011 on Social Network Mining and Analysis (s. 99-106).
Domański, C., & Pruska, K. (2000). Nieklasyczne metody statystyczne. Polskie Wydawnictwo Ekonomiczne.
Elakkiya, E., Selvakumar, S. (2020). GAMEFEST: Genetic Algorithmic Multi Evaluation measure based FEature Selection Technique for social network spam detection. Multimed Tools and Application, 79(11-12), 7193-7225. https://doi.org/10.1007/s11042-019-08334-1.
Govindarajan, M. (2013). Sentiment Analysis of Movie Reviews using Hybrid Method of Naive Bayes and Genetic Algorithm. International Journal of Advanced Computer Research, 3(4), 139- 145. https://accentsjournals.org/PaperDirectory/Journal/IJACR/2013/12/21.pdf.
Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). John Wiley & Sons. https://doi.org/10.1002/9781118548387.
Idczak, A. P. (2021). Sentiment Classification of Bank Clients' Reviews Written in the Polish Language. Acta Universitatis Lodziensis. Folia Oeconomica, (2), 43-56. https://doi.org/10.18778 /0208-6018.353.03.
Iqbal, F., Hashmi, J. M., Fung, B. C. M., Batool, R., Khattak, A. M., Aleem, S., & Hung, P. C. K. (2019). A Hybrid Framework for Sentiment Analysis Using Genetic Algorithm Based Feature Reduction. IEEE Access, 7, 14637-14652. http://doi.org/10.1109/ACCESS.2019.2892852.
Khan, A., Baharudin, B., & Khan, K. (2011). Sentiment Classification Using Sentence-level Lexical Based Semantic Orientation of Online Reviews. Trends in Applied Sciences Research, 6(10), 1141-1157. https://doi.org/10.3923/tasr.2011.1141.1157.
Korzeniewski, J. (2012). Metody selekcji zmiennych w analizie skupień. Nowe procedury. Wydawnictwo Uniwersytetu Łódzkiego. http://dx.doi.org/10.18778/7525-695-6.
Kouloumpis, E., Wilson, T., & Moore, J. (2011). Twitter Sentiment Analysis: The Good the Bad and the OMG!. Proceedings of the Sixteenth International AAAI Conference on Web and Social Media, 5(1), 538-541. https://doi.org/10.1609/icwsm.v5i1.14185.
Medhat, W., Hassan, A., & Korashy, H. (2014). Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal, 5(4), 1093-1113. https://doi.org/10.1016/j.asej.2014.04.011.
Njølstad, P. C. S., Høysaeter, L. S., Wei, W., & Gulla, J. A. (2014). Evaluating Feature Sets and Classifiers for Sentiment Analysis of Financial News. W: WI-IAT '14: Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) (p. 71-78). IEEE. https://doi.org/10.1109/WI-IAT.2014.82.
Pintas, J. T., Fernandes, L. A. F., & Garcia, A. C. B. (2021). Feature selection methods for text classification: a systematic literature review. Artificial Intelligence Review, 54(8), 6149-6200. https://doi.org/10.1007/s10462-021-09970-6.
Yassir, A. H., Mohammed, A. A., Alkhazraji, A. A. J., Hameed, M. E., Talib, M. S., & Ali, M. F. (2020). Sentimental classification analysis of polarity multi-view textual data using data mining techniques. International Journal of Electrical & Computer Engineering (2088-8708), 10(5), 5526-5533. http://doi.org/10.11591/ijece.v10i5.pp5526-5534.
Yazdani, S. F., Murad, M. A. A., Sharef, N. M., Singh, Y. P., & Latiff, A. R. A. (2017). Sentiment Classification of Financial News Using Statistical Features. International Journal of Pattern Recognition and Artificial Intelligence, 31(3), 1-34. https://doi.org/10.1142/S0218001417500069.

Document Type

Publication order reference

Identifiers

DOI

10.59139/ws.2023.05.3

Biblioteka Nauki

18105028

YADDA identifier

bwmeta1.element.ojs-doi-10_59139_ws_2023_05_3

Article details

Journal

Wiadomości Statystyczne. The Polish Statistician

Article title

New algorithm for determining the number of features for the effective sentiment-classification of text documents

Authors

Content

Title variants

Languages of publication

Abstracts

Keywords

Publisher

Journal

Year

Volume

Issue

Pages

Physical description

Dates

Contributors

References

Document Type

Publication order reference

Identifiers

YADDA identifier