Search results

1

Some Remarks on the Data Imputation Using “missForest” Method

100%

Misztal M., University of Lodz D. o. S. M.

Acta Universitatis Lodziensis. Folia Oeconomica

|

2013

|

vol. 285

EN

Missing data are quite common in practical applications of statistical methods and imputation is a general statistical method for the analysis of incomplete data sets. Stekhoven and Bühlmann (2012) proposed an iterative imputation method (called “missForest”) based on Random Forests (Breiman 2001) to cope with missing values. In the paper a short description of “missForest” is presented and some selected missing data techniques are compared with “missForest” by artificially simulating different proportions and mechanisms of missing data using complete data sets from the UCI repository of machine learning databases.

PL

W pracy Stekhovena i Bühlmanna (2012) zaproponowano nową iteracyjną metodę imputacji (nazwaną „missForest”) opartą na metodzie Random Forests Breimana (2001). W niniejszym artykule omówiono metodę „missForest” i porównano kilka wybranych technik postępowania w sytuacji występowania braków danych z metodą „missForest”. W tym celu wykorzystano podejście symulacyjne generując różne proporcje i mechanizmy powstawania braków danych w zbiorach danych pochodzących głównie z repozytorium baz danych na Uniwersytecie Kalifornijskim w Irvine.

2

APPLICATION OF MIXED MODELS AND FAMILIES OF CLASSIFIERS TO ESTIMATION OF FINANCIAL RISK PARAMETERS

100%

Grzybowska U., Karwański M.

Metody Ilościowe w Badaniach Ekonomicznych

|

2015

|

vol. 16

|

issue 1

108-115

EN

The essential role in credit risk modeling is Loss Given Default (LGD) estimation. LGD is treated as a random variable with bimodal distribution. For LGD estimation advanced statistical models such as beta regression can be applied. Unfortunately, the parametric methods require amendments of the “inflation” type that lead to mixed modeling approach. Contrary to classical statistical methods based on probability distribution, the families of classifiers such as gradient boosting or random forests operate with information and allow for more flexible model adjustment. The problem encountered is comparison of obtained results. The aim of the paper is to present and compare results of LGD modeling using statistical methods and data mining approach. Calculations were done on real life data sourced from one of Polish large banks.

3

FAMILIES OF CLASSIFIERS – APPLICATION IN DATA

75%

Grzybowska U., Karwański M.

Metody Ilościowe w Badaniach Ekonomicznych

|

2014

|

vol. 15

|

issue 2

94-101

EN

Economic description of firms and companies is based on a number of indicators. The indicators are related to each other and can be considered only in a specific context. Regression models allow for such approach. Unfortunately, the problems we deal with are usually nonlinear and the choice of relevant information is very difficult. The aim of the paper is to present a method of variable selection based on random forest and gradient boosting approach and its application to companies ranking in DEA method. The results will be compared with the ordering obtained using expert supported approach for variable selection in DEA.

4

Postavení ambipozic v češtině

71%

Sláma J., Štěpánková B.

Slovo a slovesnost: časopis pro otázky teorie a kultury jazyka (Slovo a slovesnost: A journal for the theory of language and language cultivation)

|

2023

|

vol. 84

|

issue 2

91-121

EN

The study proposes that some words in Czech, including navzdory ‘despite’ and počínaje ‘starting from,’ should be treated as ambipositions, i.e., adpositions that may either precede or follow their complement. This avoids the awkwardness of the traditional view on which the former, for instance, is a preposition when preceding a complement and a homonymous adverb when following one. Based on 3,234 corpus instances of navzdory ‘despite,’ nevyjímaje ‘including,’ nemluvě o ‘not to mention,’ počínaje ‘starting with,’ konče ‘ending with,’ and počínaje – konče ‘from – to,’ the study examines the factors determining whether ambipositions in Czech precede or follow their complements. Special attention is paid to the length and the syntactic complexity of the complement, but also to the text type and the position of the adpositional phrase in the clause. The study uses the random forests algorithm to gauge the relative importance of the variables for each of the ambipositions examined. The length of the complement is systematically the best predictor of the position of ambipositions: the longer the complement, the more likely the ambiposition is to precede it. This is argued to follow primarily from the limits of the human working memory.

5

Evaluation of resampling methods in the class unbalance problem

63%

Kubus M.

Econometrics. Ekonometria. Advances in Applied Data Analytics

|

2020

|

issue vol. 24, nr 1

39-50

EN

The purpose of many real world applications is the prediction of rare events, and the training sets are then highly unbalanced. In this case, the classifiers are biased towards the correct prediction of the majority class and they misclassify a minority class, whereas rare events are of the greater interest. To handle this problem, numerous techniques were proposed that balance the data or modify the learning algorithms. The goal of this paper is a comparison of simple random balancing methods with more sophisticated resampling methods that appeared in the literature and are available in R program. Additionally, the authors ask whether learning on the original dataset and using a shifted threshold for classification is not more competitive. The authors provide a survey from the perspective of regularized logistic regression and random forests. The results show that combining random under-sampling with random forests has an advantage over other techniques while logistic regression can be competitive in the case of highly unbalanced data.

PL

Celem wielu praktycznych zastosowań modeli dyskryminacyjnych jest przewidywanie zdarzeń rzadkich. Zbiory uczące są wówczas niezbilansowane. W tym przypadku klasyfikatory mają tendencję do poprawnego klasyfikowania obiektów klasy większościowej i jednocześnie błędnie klasyfikują wiele obiektów klasy mniejszościowej, która jest przedmiotem szczególnego zainteresowania. W celu rozwiązania tego problemu zaproponowano wiele technik, które bilansują dane lub modyfikują algorytmy uczące. Celem artykułu jest porównanie prostych, losowych metod bilansowania z bardziej wyrafinowanymi, które pojawiły się w literaturze. Dodatkowo postawiono pytanie, czy konkurencyjnym podejściem nie jest budowa modelu na oryginalnym zbiorze danych i przesunięcie progu klasyfikacji. Badanie przedstawiono z perspektywy regularyzowanej regresji logistycznej i lasów losowych. Wyniki pokazują, że kombinacja metody under-sampling z lasami losowymi wykazuje przewagę nad innymi technikami, podczas gdy regresja logistyczna może być konkurencyjna w przypadku silnego niezbilansowania.

6

ZASTOSOWANIE ANALIZY SKUPIEŃ I LASÓW LOSOWYCH W KLASYFIKACJI GMIN W POLSCE NA SKALI POZIOMU ROZWOJU SPOŁECZNO-GOSPODARCZEGO

63%

Perdał R.

Metody Ilościowe w Badaniach Ekonomicznych

|

2018

|

vol. 19

|

issue 3

263-273

PL

W artykule przedstawiono algorytm klasyfikacji gmin na skali poziomu rozwoju społeczno-gospodarczego. Algorytm ten obejmuje cztery etapy: (1) dobór i redukcja zmiennych, (2) konstrukcja miernika syntetycznego i uszeregowanie liniowe gmin na skali poziomu rozwoju społeczno-gospodarczego, (3) grupowanie gmin metodą analizy skupień wg algorytmu k-średnich na podstawie wartości miernika syntetycznego, (4) weryfikacja klasyfikacji metodą lasów losowych. W wyniku procedury klasyfikacyjnej zidentyfikowano dywergencję rozwoju społeczno-gospodar¬czego w Polsce.

EN

"The article presents the algorithm of classification of communes on the scale of socio-economic development level. The algorithm includes four steps: (1) selection and reduction of variables, (2) construction of a synthetic measure and linear ordering of communes on the scale of socio-economic development level, (3) grouping of communes by cluster analysis (k-means algorithm) based on the synthetic measure, (4) verification of classification using the random forests method. As a result of the classification procedure was identified the progressive divergence of socio-economic development in Poland."

7

Diachronní korpusová analýza: slovosled českých posesivních adjektiv uvnitř nominální fráze

63%

Křivan J., Láznička M.

Studie z aplikované lingvistiky - Studies in Applied Linguistics

|

2018

|

vol. 9

|

issue Special Issue 2018

42-65

EN

This paper is concerned with the diachronic development of the placement of Czech possessive adjectives relative to the head noun in Old and Middle Czech. At the same time, the aim of this study is also to introduce a possible way of approaching complex language data. We base our analysis on cross-linguistic synchronic generalizations regarding possessor placement which connect monolexemic possessors (which are high on the nominal animacy hierarchy) to the prenominal position. A sample of 1417 possessive adjectives obtained from available sources of Old and Middle Czech texts was annotated for an array of semantic and syntactic variables. The relationship between these variables and the possessor placement was analysed using classification trees and random forests. The results do not support the synchronic generalizations. We interpret this finding by positing two frequent, lexically partially filled constructions, N Kristův ‘N of Christ’ and syn N-ův ‘son of N’. We conclude that the patterns observed in the data can be explained by the interaction of extralinguistic socio-cultural factors and the effects of frequency and similarity in these two constructions.

8

The Problem of Redundant Variables in Random Forests

51%

Kubus M.

Acta Universitatis Lodziensis. Folia Oeconomica

|

2018

|

vol. 6

|

issue 339

7-16

PL

Lasy losowe są obecnie jedną z najchętniej stosowanych przez praktyków metod klasyfikacji wzorcowej. Na jej popularność wpływ ma możliwość jej stosowania bez czasochłonnego, wstępnego przygotowywania danych do analizy. Las losowy można stosować dla różnego typu zmiennych, niezależnie od ich rozkładów. Metoda ta jest odporna na obserwacje nietypowe oraz ma wbudowany mechanizm doboru zmiennych. Można jednak zauważyć spadek dokładności klasyfikacji w przypadku występowania zmiennych redundantnych. W artykule omawiane są dwa podejścia do problemu zmiennych redundantnych. Rozważane są dwa sposoby przeszukiwania w podejściu polegającym na doborze zmiennych oraz dwa sposoby konstruowania zmiennych syntetycznych w podejściu wykorzystującym grupowanie zmiennych. W eksperymencie generowane są liniowo zależne predyktory i włączane do zbiorów danych rzeczywistych. Metody redukcji wymiarowości zwykle poprawiają dokładność lasów losowych, ale żadna z nich nie wykazuje wyraźnej przewagi.

EN

Random forests are currently one of the most preferable methods of supervised learning among practitioners. Their popularity is influenced by the possibility of applying this method without a time consuming pre‑processing step. Random forests can be used for mixed types of features, irrespectively of their distributions. The method is robust to outliers, and feature selection is built into the learning algorithm. However, a decrease of classification accuracy can be observed in the presence of redundant variables. In this paper, we discuss two approaches to the problem of redundant variables. We consider two strategies of searching for best feature subset as well as two formulas of aggregating the features in the clusters. In the empirical experiment, we generate collinear predictors and include them in the real datasets. Dimensionality reduction methods usually improve the accuracy of random forests, but none of them clearly outperforms the others.

Refine search results

3 Metody Ilościowe w Badaniach Ekonomicznych

2 Acta Universitatis Lodziensis. Folia Oeconomica

1 Econometrics. Ekonometria. Advances in Applied Data Analytics

1 Slovo a slovesnost: časopis pro otázky teorie a kultury jazyka (Slovo a slovesnost: A journal for the theory of language and language cultivation)

1 Studie z aplikované lingvistiky - Studies in Applied Linguistics

2 Grzybowska U.

2 Karwański M.

2 Kubus M.

1 Křivan J.

1 Láznička M.

1 Misztal M.

1 Perdał R.

1 Sláma J.

1 University of Lodz D. o. S. M.

1 Štěpánková B.

1 2023

1 2020

3 2018

1 2015

1 2014

1 2013

Some Remarks on the Data Imputation Using “missForest” Method

APPLICATION OF MIXED MODELS AND FAMILIES OF CLASSIFIERS TO ESTIMATION OF FINANCIAL RISK PARAMETERS

FAMILIES OF CLASSIFIERS – APPLICATION IN DATA

Postavení ambipozic v češtině

Evaluation of resampling methods in the class unbalance problem

ZASTOSOWANIE ANALIZY SKUPIEŃ I LASÓW LOSOWYCH W KLASYFIKACJI GMIN W POLSCE NA SKALI POZIOMU ROZWOJU SPOŁECZNO-GOSPODARCZEGO

Diachronní korpusová analýza: slovosled českých posesivních adjektiv uvnitř nominální fráze

The Problem of Redundant Variables in Random Forests