Search results

1

Regression analysis for interval-valued symbolic data versus noisy variables and outliers

100%

Pełka M., Dudek A.

Econometrics. Ekonometria. Advances in Applied Data Analytics

|

2016

|

issue 2 (52)

35-42

EN

Regression analysis is perhaps the best known and most widely used method used for the analysis of dependence; that is, for examining the relationship between a set of independent variables (X’s) and a single dependent variable (Y). In general regression, the model is a linear combination of independent variables that corresponds as closely as possible to the dependent variable [Lattin, Carroll, Green 2003, p. 38]. The aim of the article is to present two suitable adaptations for a regression analysis of symbolic interval-valued data (centre method and centre and range method) and to compare their usefulness when dealing with noisy variables and/or outliers. The empirical part of the paper presents the results of simulation studies based on artificial and real data, without noisy variables and/or outliers and with noisy variable and outliers. The results are compared according to the values of two coefficients of determination 2 RL and 2 . RU The results show that usually the centre and range method obtains better results even when the data set contains noisy variables and outliers, but in some cases the centre method obtains better results than the centre and range method.

2

A method for detecting outliers in fuzzy regression

100%

GŁADYSZ B.

Operations Research and Decisions

|

2010

|

vol. 20

|

issue 2

25-39

EN

In this article we propose a method for identifying outliers in fuzzy regression. Outliers in a sample may have an important influence on the form of the regression equation. For this reason there is great scientific interest in this issue. The method presented is analogous to the method of finding outliers based on the studentized distribution of residuals. In order to identify outliers, regression models are constructed with an additional explanatory variable for each observation. Next, the significance of a fuzzy regression coefficient is analysed considering this additional explanatory variable. Illustrative examples are presented.

3

Developing calibration estimators for population mean using robust measures of dispersion under stratified random sampling

100%

Audu A., Singh R., Khare S.

Statistics in Transition new series

|

2021

|

vol. 22

|

issue 2

125-142

EN

In this paper, two modified, design-based calibration ratio-type estimators are presented. The suggested estimators were developed under stratified random sampling using information on an auxiliary variable in the form of robust statistical measures, including Gini’s mean difference, Downton’s method and probability weighted moments. The properties (biases and MSEs) of the proposed estimators are studied up to the terms of firstorder approximation by means of Taylor’s Series approximation. The theoretical results were supported by a simulation study conducted on four bivariate populations and generated using normal, chi-square, exponential and gamma populations. The results of the study indicate that the proposed calibration scheme is more precise than any of the others considered in this paper.

4

Dispersion of estimates of linear regression parameters in case of the deepest regression method

100%

Pruska D., University of Łódź C. o. S. M.

Acta Universitatis Lodziensis. Folia Oeconomica

|

2008

|

vol. 216

EN

The deepest regression method is such a method of estimation of regression parameters that the maximal regression depth characterises the obtained model. In this paper the deeepest regression method is presented and the simulation analysis (Monte Carlo experiments) of dispersion of linear regression parameter estimates is conducted in case of data sets with different numbers of outliers. On the basis of the results of Monte Carlo experiments the characteristics of distribution of regression parameter estimates are determined and compared with the results of analogous experiments conducted with the use of the least square method.

PL

Metoda najgłębszej regresji polega na oszacowaniu parametrów liniowej funkcji regresji w taki sposób, aby uzyskanemu modelowi odpowiadała największa głębia regresyjna. W pracy przedstawiono charakterystykę metody najgłębszej regresji i przeprowadzono symulacyjną analizę (metodami Monte Carlo) zróżnicowania ocen parametrów modelu regresji liniowej uzyskanych tą metodą dla zbiorów danych zawierających różną liczbę obserwacji nietypowych. Na podstawie przeprowadzonych eksperymentów Monte Carlo wyznaczono charakterystyki rozkładu ocen parametrów i dokonano porównania otrzymanych wyników z wynikami analogicznych eksperymentów, w których do estymacji parametrów wykorzystano metodę najmniejszych kwadratów.

5

Impact of outliers on inequality measures – a comparison between Polish Voivodeships

88%

Ostasiewicz K.

Śląski Przegląd Statystyczny

|

2014

|

issue 12(18)

105-120

EN

It is known that outlying (large) incomes strongly influence the results of inequality measuring. Thus, there is a question how to deal with such observations. In this paper the rule of excluding observations based on (Q1 – 1.5Q; Q3 + 1.5Q) interval is investigated, for data from household budget survey in 2011, for Polish voivodeships. It is shown that although including more observations obviously changes the values of inequality measures, the relative values of them are surprisingly quite stable, with the rank correlation coefficient never over 0.9.

6

Analysis of Total, Direct and Indirect Cost Outliers in a Polish Specialist Hospital

88%

Cygańska M., Thoene M., Silva A.

Olsztyn Economic Journal

|

2017

|

vol. 12

|

issue 4

451-464

EN

The purpose of this study is to analyze the factors facilitating the identification of the three categories of cost outliers. They are known as total cost outliers (TCO), direct cost outliers (DCO), and indirect cost outliers (ICO). 4,570 patients have been analyzed. To evaluate the factors that influence the patient being a cost outlier in a hospital; age, length of stay, gender, type of admission, reason for discharge, and type of department were considered. Multivariable logistic regression was used in the study. In our research TCO comprised 9% of the study sample. The percentage of DCO was slightly higher (10%) and ICO was slightly lower (8%). Total cost outliers accounted for almost 37% of total hospital costs, 40% of direct costs, and 34% of indirect costs. The direct cost outliers accounted for 44.39% of direct costs, and indirect cost outliers accounted for 34.91% of indirect costs. It was discovered that, in terms of gender, men are positively correlated with higher cost utilization. The risk of being a cost outlier increases risk in terms of death and referral for further treatment. The type of admission factor can only be a predictor of being an ICO. The risk of a patient being a length of stay outlier increases far more for the ICO (more than 580 times) than in the case of a DCO (3.81 times) or a TCO (13.79 times). The analysis suggests that not only TCO, but also DCO and ICO, should have high priority for hospital managers concerned with variations in the costs of care.

7

Interval shrinkage estimation of the parameter of exponential distribution in the presence of outliers under loss functions

75%

Nasiri P.

Statistics in Transition new series

|

2022

|

vol. 23

|

issue 3

65-78

EN

In this paper, we studied estimators based on an interval shrinkage with equal weights point shrinkage estimators for all individual target points θ¯ ∈ (θ0, θ1) for exponentially distributed observations in the presence of outliers drawn from a uniform distribution. Estimators obtained from both shrinkage and interval shrinkage were compared, showing that the estimators obtained via the interval shrinkage method perform better. Symmetric and asymmetric loss functions were also used to calculate the estimators. Finally, a numerical study and illustrative examples were provided to describe the results.

8

Isolation Forests for Symbolic Data as a Tool for Outlier Mining

75%

Pełka M., Dudek A.

Econometrics. Ekonometria. Advances in Applied Data Analytics

|

2024

|

vol. 28

|

issue 1

1-10

PL

Cel: Identyfikacja obserwacji odstających stanowi kluczowy element w analizie danych. Pomimo że w literaturze funkcjonuje wiele różnych definicji, czym są obserwacje odstające, to ogólnie można stwierdzić, że są to obiekty różniące się od pozostałych obserwacji ze zbioru danych. Literatura przedmiotu wskazuje wiele różnorodnych metod, które można wykorzystać w przypadku danych klasycznych. Niestety w przypadku danych symbolicznych brakuje takich analiz. Celem artykułu jest zaproponowanie modyfikacji lasów separujących (isolation forests) dla danych symbolicznych. Metodyka: W artykule wykorzystano lasy separujące dla danych symbolicznych do identyfikacji obserwacji odstających w sztucznych zbiorach danych o znanej strukturze klas i znanej liczbie obserwacji odstających. Wyniki: Otrzymane wyniki wskazują, że lasy separujące dla danych symbolicznych są efektywnym i szybkim narzędziem w identyfikacji obserwacji odstających. Implikacje i rekomendacje: Ponieważ lasy separujące dla danych symbolicznych okazały się skutecznym narzędziem w identyfikacji obserwacji odstających, celem przyszłych badań powinno być przeanalizowanie skuteczności tej metody w przypadku rzeczywistych zbiorów danych (np. zbioru dotyczącego oszustw z użyciem kart kredytowych), a także porównanie tej metody z innymi metodami, które pozwalają odnaleźć obserwacje odstające (np. DBSCAN). Autorzy sugerują, by w przypadku lasów separujących dla danych symbolicznych stosować te same parametry, jakie zwykle stosuje się w przypadku lasów losowych dla danych klasycznych. Oryginalność/wartość: Artykuł nie tylko stanowi ujęcie teorii w zakresie obserwacji odstających, ale jednocześnie proponuje, jak zastosować lasy separujące w przypadku danych symbolicznych.

EN

Aim: Outlier detection is a key part of every data analysis. Although there are many definitions of outliers that can be found in the literature, all of them emphasise that outliers are objects that are in some way different from other objects in the dataset. There are many different approaches that have been proposed, compared, and analysed for the case of classical data. However, there are only few studies that deal with the problem of outlier detection in symbolic data analysis. The paper aimed to propose how to adapt isolation forest for symbolic data cases. Methodology: An isolation forest for symbolic data is used to detect outliers in four different artificial datasets with a known cluster structure and a known number of outliers Results: The results show that the isolation forest for symbolic data is a fast and efficient tool for outlier mining. Implications and recommendations: As the isolation forest for symbolic data appears to be an efficient tool for outlier detection for artificial data, further studies should focus on real data sets that contain outliers (i.e. credit card fraud dataset), and this approach should be compared with other outlier mining tools (i.e. DBCSAN). The authors recommend using the same initial settings for the isolation forest for symbolic data as the settings that are proposed for the isolation forest for classical data. Originality/value: This paper is the first of its kind, focusing not only on the problem of outlier detection in general, but also extending the well-known isolation forest model for symbolic data cases. Keywords: symbolic data analysis, isolation forest, outliers

9

On the method of identification of atypical observations in time series

63%

Oesterreich M.

Econometrics. Ekonometria. Advances in Applied Data Analytics

|

2020

|

issue vol. 24, nr 2

1-16

EN

The paper presents a method of detecting atypical observations in time series with or without seasonal fluctuations. Unlike classical methods of identifying outliers and influential observations, its essence consists in examining the impact of individual observations both on the fitted values of the model and the forecasts. The exemplification of theoretical considerations is the empirical example of modelling and forecasting daily sales of liquid fuels at X gas station in the period 2012-2014. As a predictor, a classic time series model was used, in which 7-day and 12-month cycle seasonality was described using dummy variables. The data for the period from 01.01.2012 to 30.06.2014 were for the estimation period and the second half of 2014 which was the period of empirical verification of forecasts. The obtained results were compared with other classical methods used to identify influential observations and outliers, i.e. standardized residuals, Cook distances and DFFIT. The calculations were carried out in the R environment and the Statistica package.

PL

W pracy zaproponowano metodę wykrywania obserwacji nietypowych w szeregach czasowych z wahaniami sezonowymi oraz bez tych wahań. Jej istota jej polega na badaniu wpływu poszczególnych obserwacji szeregu na wartości teoretyczne modelu oraz wielkości prognoz zbudowanych na jego podstawie. Egzemplifikacją rozważań o charakterze teoretycznym jest przykład empiryczny dotyczący modelowania i prognozowania dziennej sprzedaży paliw płynnych na stacji paliw X w latach 2012-2014. Dane za okres od 1.01.2012 do 30.06.2014 stanowią okres estymacyjny, a za II półrocze 2014 r. okres empirycznej weryfikacji prognoz. Wyniki otrzymane za jej pomocą zostały porównane z wynikami uzyskanymi innymi metodami służącymi do identyfikacji obserwacji wpływowych oraz odstających, w tym m.in.: reszt standaryzowanych, odległości Cooka oraz DFFIT. Obliczenia przeprowadzono w środowisku R oraz pakiecie Statistica.

10

Outliers vs Robustness in Nonparametric Methods of Regression

63%

Trzęsiok J.

Acta Universitatis Lodziensis. Folia Oeconomica

|

2018

|

vol. 4

|

issue 337

99-109

PL

Artykuł poświęcony jest zagadnieniu odporności metod regresji na obserwacje odstające występujące w zbiorze danych. W pierwszej części przedstawiono wybrane metody identyfikacji obserwacji nietypowych. Następnie badano odporność trzech nieparametrycznych metod regresji: PPR, POLYMARS i RANDOM FORESTS. Analiz dokonano za pomocą procedur symulacyjnych na zbiorach danych, w których wykryto obserwacje odstające. Mimo dosyć powszechnych przekonań o odporności regresji nieparametrycznej okazało się, że modele zbudowane na całych zbiorach danych mają istotnie mniejsze zdolności predykcyjne niż modele uzyskane na zbiorach, z których usunięto obserwacje nietypowe.

EN

The article addresses the question of how robust methods of regression are against outliers in a given data set. In the first part, we presented the selected methods used to detect outliers. Then, we tested the robustness of three nonparametric methods of regression: PPR, POLYMARS, and RANDOM FORESTS. The analysis was conducted applying simulation procedures to the data sets where outliers were detected. Contrary to a relatively common conviction about the robustness of nonparametric regression, the study revealed that the models built on the basis of complete data sets represent a significantly lower predictive capability than models based on the sets from which outliers were discarded.

11

Problem wartości odstających w badaniu kondycji finansowej przedsiębiorstw budowlanych w Polsce

63%

Kostrzewska J., Pawełek B., Lipieta A.

Zeszyty Naukowe Uniwersytetu Ekonomicznego w Krakowie

|

2016

|

issue 1(949)

23-41

PL

Wyniki analizy kondycji finansowej przedsiębiorstw są wykorzystywane m.in. w badaniach dotyczących zagrożenia upadłością. Do oceny kondycji finansowej przedsiębiorstw wykorzystuje się wskaźniki finansowe, podstawą badań są zatem dane pochodzące ze sprawozdań finansowych. Ocena jakości tych danych obejmuje m.in. wykrywanie wartości odstających. Celem artykułu jest przedstawienie wyników badań empirycznych nad wpływem wyboru metody wykrywania i eliminacji wartości odstających na skuteczność klasyfikacyjną modelu logitowego, budowanego na podstawie zbiorów uwzględniających lub pomijających wykryte wartości odstające. W badaniach empirycznych wykorzystano jedno- i wielowymiarowe metody wykrywania wartości odstających. Metody te dodatkowo połączono z analizą mocy dyskryminacyjnej wskaźników finansowych. Ocenę skuteczności modelu logitowego oparto na miernikach wrażliwości i specyficzności. Badaniem objęto przedsiębiorstwa budowlane w Polsce w latach 2005, 2007 i 2009.

EN

The results of an analysis of financial standing can be used to study the threat of going bankrupt. Financial indicators are used to evaluate enterprises’ financial standing. Thus, the data from financial statements is the basis for the examination of the financial position. The evaluation of data quality includes the identification of outliers, among other factors. This article presents the results of an empirical study done on how the method of detecting and eliminating outliers chosen influences the effectiveness of a logit model constructed on the basis of samples that either included the outliers or left them out. The research for the paper employed one- and multi-dimensional methods of detecting outliers and their combinations with an analysis of the discriminatory power of the financial indicators. Classification effectiveness of the logit model was assessed by sensitivity and specificity measures. The research covered the years 2005, 2007 and 2009.

12

Identyfikacja i znaczenie obserwacji nietypowych w modelach konwergencji dochodowej

63%

Batóg J.

Zeszyty Naukowe Uniwersytetu Ekonomicznego w Krakowie

|

2015

|

issue 5(941)

5-15

PL

Badanie zjawiska konwergencji dochodowej znajduje szerokie odzwierciedlenie w dotychczasowym dorobku nauki i praktyce gospodarczej. Otrzymywane rezultaty charakteryzują się jednak stosunkowo dużym zróżnicowaniem. Wielu autorów wskazuje na silne uzależnienie uzyskiwanych wyników od zakresu czasowego i przekrojowego prowadzonych analiz oraz stosowanych metod badawczych. Mało uwagi poświęca się jednak roli obserwacji nietypowych, które mogą być wynikiem błędnego pomiaru, wystąpienia zdarzenia losowego, niestandardowych warunków lub działań o charakterze celowym. Weryfikacji poddana została hipoteza o istotnym wpływie tych obserwacji na uzyskiwane wyniki procesu estymacji. Głównym celem pracy było ustalenie, czy występowanie obserwacji uznanych za nietypowe istotnie zmienia jakość modeli oraz szybkość procesu konwergencji dochodowej.

EN

The research of income convergence found remunerative findings in the existing literature and economic practice. The results obtained, however, show comparatively large differentiation. Many authors underline the strong dependence of the results obtained from the time and spatial character of the sample as well as the type of methods applied. Little attentions is placed on the role of non-typical observations (outliers) which can occur as a result of incorrect measurement, random error, non-standard circumstances or intentional impact. The hypothesis verified was that outliers exert an essential influence on estimation results. The main objective of the analyses provided was to determine if the occurrence of such observations significantly changes the quality of the models built and the speed of the process of income convergence.

13

M-estymacja w badaniu małych przedsiębiorstw

63%

Dehnel G., Gołata E.

Zeszyty Naukowe Uniwersytetu Ekonomicznego w Krakowie

|

2016

|

issue 1(949)

5-21

PL

W wielu badaniach z zakresu statystyki gospodarczej liczebność próby jest na tyle duża, że obserwacje odstające mają stosunkowo niewielki wpływ na wartości szacowanych parametrów. W badaniach prowadzonych na niskim poziomie agregacji w ramach statystyki krótkookresowej obecność obserwacji odstających może być jednak znacząca. Z tego powodu w przypadku populacji takich jak populacja przedsiębiorstw obok podejścia klasycznego w badaniach powinien być uwzględniany nurt metod odpornych na występowanie jednostek nietypowych. W literaturze przedmiotu zaproponowano wiele alternatywnych metod estymacji mniej wrażliwych na wartości odstające. W opracowaniu weryfikacji empirycznej poddano jedną z nich – M-estymację. Celem analizy była ocena jej użyteczności w odniesieniu do badania małych przedsiębiorstw.

EN

In many business surveys, sample sizes are large enough to compensate for the presence of outliers, which have a relatively small impact on estimates. However, at low levels of aggregation, the impact of outliers might be significant. Therefore, in the case of a population such as the population of enterprises, the classical approach should be accompanied by methods that resist the occurrence of outliers. To deal with this problem, several alternative technique of estimation, less sensitive to outliers, have been proposed in the statistics literature. In this paper we look at one of them – M-estimation, and compare its usefulness in the small businesses survey.

14

Dobór modelu a obciążenie szacunku na przykładzie estymatora GREG w badaniu małych przedsiębiorstw

63%

Dehnel G.

Zeszyty Naukowe Uniwersytetu Ekonomicznego w Krakowie

|

2017

|

issue 11(971)

5-25

PL

Estymacja dotycząca populacji charakteryzujących się silną asymetrią i obecnością obserwacji odstających jest zagadnieniem trudnym, zwłaszcza gdy prowadzona jest na niskim poziomie agregacji. Zastosowanie klasycznych, bezpośrednich metod estymacji nie pozwala na otrzymanie wiarygodnych szacunków. Potrzeba uzyskania szczegółowych informacji oraz szerszych możliwości wykorzystania danych pochodzących z rejestrów administracyjnych skłania do poszukiwania innych, nieklasycznych metod szacunku. Przykładem może być estymacja typu GREG. W artykule podjęto próbę zbadania wpływu wyboru modelu uwzględnionego w ramach estymatora GREG na jakość szacunku parametru populacji przedsiębiorstw. Analizę przeprowadzono na podstawie danych pochodzących z badania małych przedsiębiorstw. Badaną zmienną był przeciętny przychód przedsiębiorstwa. Jako zmienne pomocnicze wykorzystano zmienne opóźnione pochodzące z rejestrów administracyjnych. Badanie prowadzono w przekroju województw z uwzględnieniem rodzaju prowadzonej działalności gospodarczej.

EN

Estimation for a very skewed population containing extreme values is problematic, especially at a low level of aggregation. Traditional direct estimation methods do not provide satisfactory results. The growing demand for detailed information and the wider possibility of using data from administration registers has increased the importance of recognising more sophisticated estimation methods. Generalised Regression (GREG) estimation is an example of one such type. The paper examines the importance of the model chosen in GREG estimation in dealing with highly variable and outlier-prone populations. The model-assisted GREG estimator is applied to a real business survey. Lagged variables from administrative registers were used as the auxiliary variables. The variable of interest – mean revenue of small companies – was estimated for provinces cross-classified by categories of economic activity.

15

OCENA ZMIAN STOPNIA ZANIECZYSZCZANIA ŚRODOWISKA W POLSCE W LATACH 2004-2014 PRZY WYKORZYSTANIU PODSTAWOWYCH NARZĘDZI ANALITYCZNYCH

51%

Koszela G., Szczesny W.

Metody Ilościowe w Badaniach Ekonomicznych

|

2016

|

vol. 17

|

issue 3

95-107

PL

W artykule podjęto próbę oceny zmian stopnia zanieczyszczenie środowiska na poziomie województw w latach 2004-2014. Ocenę tą przeprowadzono przy pomocy budowy rankingów województw. Rankingi te utworzono na podstawie zmiennych syntetycznych powstałych w wyniku normalizacji zmiennych metodą unitaryzacji zerowanej oraz przekształcenia ilorazowego. Zwrócono również uwagę na problem obserwacji odstających. Okazuje się, że w zależności od podejścia do tego problemu, można uzyskać znacząco rózniące się wyniki dotyczące grupowania wojwództw w klasy.

EN

The aim of the paper was to attempt to evaluate changes in the degree of pollution at the level of Voivodeships in the years 2004-2014. Assessment was carried out by construction of Voivodeship rankings. These rankings were created on the basis of synthetic variables resulting from the normalization of variables by unitarisation zeroed method and the quotient mapping. It was also paid attention to the problem of outliers. It was proved that depending on the approach to this problem, it can be obtained significantly different results for clustering Voivodeships into classes.

Refine search results

Regression analysis for interval-valued symbolic data versus noisy variables and outliers

A method for detecting outliers in fuzzy regression

Developing calibration estimators for population mean using robust measures of dispersion under stratified random sampling

Dispersion of estimates of linear regression parameters in case of the deepest regression method

Impact of outliers on inequality measures – a comparison between Polish Voivodeships

Analysis of Total, Direct and Indirect Cost Outliers in a Polish Specialist Hospital

Interval shrinkage estimation of the parameter of exponential distribution in the presence of outliers under loss functions

Isolation Forests for Symbolic Data as a Tool for Outlier Mining

On the method of identification of atypical observations in time series

Outliers vs Robustness in Nonparametric Methods of Regression

Problem wartości odstających w badaniu kondycji finansowej przedsiębiorstw budowlanych w Polsce

Identyfikacja i znaczenie obserwacji nietypowych w modelach konwergencji dochodowej

M-estymacja w badaniu małych przedsiębiorstw

Dobór modelu a obciążenie szacunku na przykładzie estymatora GREG w badaniu małych przedsiębiorstw

OCENA ZMIAN STOPNIA ZANIECZYSZCZANIA ŚRODOWISKA W POLSCE W LATACH 2004-2014 PRZY WYKORZYSTANIU PODSTAWOWYCH NARZĘDZI ANALITYCZNYCH