Original Article

Investigation of the Change in the Correct Classification Ratios by Using the Richard Link Function in Logistic Regression: A Research on the Determination of Risk Factors in COPD

10.4274/hamidiyemedj.galenos.2022.15238

  • Kürşad Nuri Baydili
  • Mustafa Çörtük
  • Ahmet Dirican

Received Date: 06.09.2022 Accepted Date: 26.09.2022 Hamidiye Med J 2022;3(3):162-170

Background:

Regression analyses are used to explain the relationship between a dependent variable and independent variables using mathematical models. Logistic regression, which is used in cases where the dependent variable is categorical, is often used in analyzing health-related data.

Materials and Methods:

It is well known that the inflection point of the logistic regression curve sometimes corresponds to smaller or larger values on the horizontal axis, resulting in incorrect classifications. The present study aimed to increase correct classification rates by using the Richards link function to determine the most suitable inflection point for data.

Results:

In order to evaluate the performance of the Richards link function, four different simulation scenarios and applications were carried out with a total of 1.005 individuals, of whom 505 were non-chronic obstructive pulmonary disease (COPD) and 500 were COPD individuals. The data were divided into learning and test data. A logistic regression model was obtained from the learning data, and an increase in the correct classification rates was observed with the use of the Richards link function in this model. The model was applied to the test data with the m-value determined for the learning data set and achieved a higher correct classification rate than the current method.

Conclusion:

The present study indicated that certain percentage increases can be achieved in correct classification rates by using the Richards link function. However, it would be beneficial to conduct studies in which applications are made with data sets containing fewer and more independent variables, different sample sizes, and combinations of independent variable types.

Keywords: Logistic regression, Richard link function, correct classification

Introduction

Establishing a cause-effect relationship between the dependent variable and the independent variable(s) is one of the aims of scientific research (1). Univariate methods used when investigating causality make the comparison by assuming that all other factors other than the variable are homogeneous or constant. However, sometimes it is not possible to achieve homogeneity or stability in the real world. A variable can often change with one or more variables. The problem can be solved by including these co-variations through multivariate statistical analysis (2), which aims to predict an outcome with multiple independent variables (3).

Multivariate statistical analysis includes many techniques depending on the purpose of the study, the type of dependent and independent variable(s), and the fulfillment of certain conditions (2). Regression analysis is used to estimate relationships between a dependent variable and a set of independent variables using mathematical models (4). Researchers in the field of health generally aim to classify their observations and make inferences about future observations based on existing observations (5). The first known classification methods were cluster analysis, which was originated by Driver and Kroeber (6) in social and life sciences, and two-group discriminant analysis proposed by Fisher (7). The primary techniques used to classify observations are cluster analysis, discriminant analysis, and logistic regression analysis (8). In cluster analysis, where the number of groups is unknown, data are assigned to groups according to certain criteria (9). In discriminant analysis and logistic regression, although the number of groups is known, data are assigned to groups by using this information (8). In logistic regression, assumptions such as normality and homogeneity of variances required in discriminant analysis are not sought (4).

Berkson (10) was the first to publish an application of the logistic model in the field of biology. Logistic regression aims to reveal the model that has the highest fit with the least number of variables (9). Logistic regression can be used to estimate and summarize data, as well as for classification by examining the relationship between the dependent variable and the independent variable(s). Logistic regression is used when the dependent variable is in the form of qualitative data (4). There are three different types of logistic regression: Binary logistic regression is used when the dependent variable has two categories, multinomial logistic regression is used when the dependent variable has more than two categories, and ordinal logistic regression is used when the dependent variable is measured at the ordinal level (2,4,11).

Logistic regression is commonly used in such fields as economics, education and health (12). Binary logistic regression has become an increasingly employed statistical tool in medical research, and is generally concerned with whether there is a risk, such as disease (13), and is coded as 1 and 0. An odds ratio is used in risk estimation in retrospective studies. The significance of the odds ratio is determined by examining the confidence intervals. If the confidence interval for the odds ratio does not include the number 1, then the calculated odds ratio is considered statistically significant. If the calculated odds ratio is found to be significant, an odds ratio greater than 1 indicates that the factor is a risk factor, while an odds ratio of less than 1 indicates that it is a protective factor (4). In binary logistic regression, a logistic regression model is created [π(x)] by calculating the probability of Y being 1 P(Y=1|X=x) when the value of the independent variable (X) is known.

The logistic regression graph is an S-shaped sigmoid curve (14,15). The logistic curve was first used by Verhulst (16) to describe the growth in a population. The inflection point of this curve may sometimes correspond to smaller or larger x-values than it should be. Gürcan et al. (17) stated that in such cases, by using various link functions, the x-values corresponding to the inflection point of the curve may be more optimal, and thus, there may be an increase in the rates of correct classification. They found the inflection point by analyzing its second derivative of the curve proposed by Richards (18), and aimed to increase the correct classification rate of the model by applying the inflection points separately for the misclassified observations.


Material and Methods

Ethics committee approval with the number 18/1 was obtained from the Hamidiye Scientific Research Ethics Committee at the meeting numbered 2022/18 for the research. The present study aimed to increase the correct classification rate by using an alternative link function to the existing method used in binary logistic regression. In logistic regression, instead of the P=eα 1+eα  forwmula, the model was changed with 0.01 increments in the m (1,3) interval, the probability values were calculated using the Richards link function with the P=(1+(m-1).(e-α ))11-m formula, and the estimated classification values were obtained according to these probability values. Data were collected through face-to-face interviews using a questionnaire for a total of 1005 individuals, 505 without chronic obstructive pulmonary (COPD) and 500 with COPD, to evaluate the performance of the Richards link function. The demographic characteristics of the study participants are presented in Table 1. Applications were carried out in two different ways. In the first method, all the data were included in the logistic regression model and the probability and class values were obtained with the proposed method. Then, probability calculations were made using the Richards link function with the same coefficients. Subsequently, the m-value, which maximizes the correct classification percentage, was determined. In the second method, 74.6% (n=750) of the data were included in the logistic regression model, and the correct classification numbers and ratios for all m-values were presented with the coefficients. Next, probability and classification values were obtained for the remaining 25.4% (n=255) of the data using the m-value, which maximizes the correct classification percentage.


Results

In the data set in which all observations were included, the variables were individually included in the logistic regression model to select the variables suitable for the logistic regression model. It was concluded that the variables of gender (p<0.001), age (p<0.001), body mass index (p<0.001), duration of exposure to wood, dung or coal smoke (p<0.001), smoking status over 10 packs/year (p<0.001), having a relative with COPD (p<0.001), having a recent lung disease other than COPD (p<0.001), place of residence (p<0.001), and daily exercise for more than 1 hour (p<0.001) should be included in the model (Table 2).

Logistic regression in which all variables were included in the model showed that the variables of gender (p=0.946) and body mass index (p=0.307) had no effect on COPD status. It was found that a 1-unit increase in age was a 1.148-fold greater risk (p<0.001), and a 1-unit increase in exposure time to wood, dung or coal smoke was a 1.027-fold greater risk (p=0.011). It was determined that smoking more than 10 packs/year was a 7.832-fold greater risk (p<0.001), having COPD in first-degree relatives was a 2.792-fold greater risk (p<0.001), having individuals with lung disease other than COPD in first-degree relatives was a 4.068-fold greater risk (p<0.001), living in a metropolis was a 7.664-fold greater risk (p<0.001), and exercising for more than 1 hour a day was a 0.04-fold greater risk (25-fold protective factor) (p<0.001) (Table 3).

The correct classification rate obtained with the available variables was 93% (n=935) (Table 4). By using the Richards link function in probability calculations, it was found that the correct classification rate for the 10 value of m in the range (1.42,1.51) was higher than the correct classification rates for the other m-values. The results of the observations with changes in the estimated classification values for m=1.42 are presented in Table 5.

By using the Richards link function, for 6 observations with a change in classification values for m=1.42, the m-values that gave the highest correct classification rate for values varying in the range (1,6) were determined. It was determined that the current method gave the correct result in two of these 6 observations, while the proposed method gave the correct result in four of them. With the proposed method, it was observed that an increase of approximately 0.2% (n=2) occurred in the correct classification rates (Table 5).

In the selection of the variables suitable for the logistic regression model, the variables were examined one by one by including them in the logistic regression model. It was concluded that the following variables should be included in the model: Gender (p<0.001), age (p<0.001), body mass index (p<0.001), duration of exposure to wood, dung or coal smoke (p<0.001), smoking status over 10 packs/year (p<0.001), having a relative with COPD (p<0.001), having a relative with lung disease other than COPD (p<0.001), place of residence (p<0.001), and exercising for more than 1 hour daily (p<0.001) (Table 6).

The logistic regression model showed that gender (p=0.727) and body mass index (p=0.643) had no effect on having COPD. It was determined that a 1-unit increase in age was a 1.195-fold greater risk (p<0.001) and 1-unit increase in exposure time to wood, dung or coal smoke was 1.069-fold greater risk for having COPD (p<0.001). It was determined that smoking more than 10 packs/year was a 16.446-fold greater risk (p<0.001), having COPD in first-degree relatives was a 3.348-fold greater risk (p=0.002), having individuals with lung disease other than COPD in first-degree relatives was a 9.797-fold greater risk, living in a metropolitan area was a 17.288-fold greater risk (p<0.001), and exercising for more than 1 hour a day was a 71.43-fold protective factor (p<0.001) (Table 7).

The correct classification rate of the equation with the available variables was 94.5% (n=709 ) (Table 8). By using the Richards link function in probability calculations, it was determined that the correct classification rate for 23 different values of m in the (1.33; 1.44) and (1.56; 1.66) ranges was higher than the correct classification rates for the other m-values. The results of the observations with changes in the estimated classification values for m=1.66 are presented in Table 9.

By using the Richards link function, 6 observations with changes in classification values for m=1.66 were determined from 23 different m-values, which gave the highest correct classification rate for values varying in the (1,3) range. The current method gave the correct result in one of these six observations, while the proposed method gave the correct results in five of these six observations. It was observed that there was an increase of approximately 0.67% (n=5)
in the correct classification rate with the proposed method (Table 9).

The application of the models from the training data to the test data gave the following results: Correct classification was made for 221 observations with the current method and 222 observations with the proposed method. The probability value calculated with the current method was found to be above 0.5, while the probability values calculated by the proposed method for the observation estimated to be COPD were found to be below 0.5. The correct classification was made in the direction of not having COPD. As a result, the model, which increased the rate of correct classification by 0.67% in the training data, also provided an increase in correct classification of approximately 0.4% in the test data (Table 10).


Discussion

Human beings have been trying for centuries to instill some human skills in inanimate beings (19). In the 20th century, there has been an increasing interest in this subject among scientists. At the beginning of the second half of the 20th century, the Turkish scientist Arf (20) raised the question “Can Machines Think and How Can They Think?”. One of the tasks undertaken by artificial intelligence is to give machines the ability to make inferences and decisions based on past experiences (21). As in many fields, in the field of health, researchers aim to make inferences about future observations with the data of existing observations by classifying these observations (5). Methods such as discriminant analysis, cluster analysis, and logistic regression are some of the methods used for classification (8). Logistic regression is mostly used when the classes of observations are known (22). Many studies have been carried out to improve the predictions made with logistic regression and increase the rate of correct classification (12,23,24). The logistic regression plot is an S-shaped sigmoid curve (25). The inflection point of this curve may sometimes take smaller or larger values than it should be in logistic regression. Gürcan et al. (17) stated that if the inflection point of the logistic curve corresponds to smaller or larger x-values than it should be, a more optimal inflection point can be determined by using various link functions. They found the inflection point by analyzing its second derivative of the curve proposed by Richards (18), and aimed to increase the correct classification rate of the model by applying the inflection points separately for the misclassified observations. This study aimed to make the inflection point of the logistic regression curve more ideal by using the link function   equation proposed by Richards (18). Probability values were calculated separately for all observations and assigned to classes according to these values. Then, the m-values that maximized the correct classification rate of the model were determined. Application of the model with all data indicated that the percentage of correct classification, which was 93% with the current method, increased by approximately 0.2% for m=1.42 using the Richards link function. Then, the data were split into training (n=750) and test (n=255) data sets. In the training data set, the correct classification rate, which was 94.5% with the current method, was found to be 95.07% for 23 different m-values in the (1.33; 1.44) and (1.56; 1.66) intervals using the Richards link function. For these values, the same model was applied to the test data for m=1.66. Correctly classifying 1 observation that was misclassified by the current method provided an increase of approximately 0.4% in the correct classification rate for the test data.


Conclusion

The present study was carried out to make predictions with a higher percentage of correct classification in logistic regression. It can be concluded that certain percentage increases in the correct classification rates can be achieved by using the Richards link function. However, it would be beneficial to carry out studies in which simulation scenarios are made with data sets containing fewer and more independent variables, different sample sizes, and combinations of independent variable types.

Information: This study is derived from the thesis study of Kürşad Nuri Baydili, PhD student of İstanbul University-Cerrahpaşa, Cerrahpaşa Faculty of Medicine, Department of Biostatistics.

Ethics

Ethics Committee Approval: Ethics committee approval with the number 18/1 was obtained from the Hamidiye Scientific Research Ethics Committee at the meeting numbered 2022/18 for the research.

Informed Consent: Retrospective study.

Peer-review: Internally and externally peer-reviewed.

Authorship Contributions

Surgical and Medical Practices: A.D., Concept: K.N.B., M.Ç., A.D., Design: K.N.B., M.Ç., A.D., Data Collection or Processing: K.N.B., M.Ç., A.D., Analysis or Interpretation: K.N.B., A.D., Literature Search: K.N.B., A.D., Writing: K.N.B., M.Ç.

Conflict of Interest: No conflict of interest was declared by the authors.

Financial Disclosure: The authors declared that this study received no financial support.


  1. Karagöz Y. SPSS AMOS META Uygulamalı Nitel-Nicel Karma Bilimsel Araştırma Yöntemleri ve Yayın Etiği, 2. Basım, Ankara: Nobel Yayıncılık; 2019.
  2. Özdamar K. Paket Programlar ile İstatistiksel Veri Analizi 2, Kaan Kitabevi, Eskişehir; 2010.
  3. Katz MH. Multivariable analysis: a practical guide for clinicians and public health researchers. Cambridge university press, 2011.
  4. Alpar R. Uygulamalı çok değişkenli istatistiksel yöntemler. Detay Yayıncılık, 2017.
  5. Karakoyun M, Hacıbeyoğlu M. Biyomedikal Veri Kümeleri İle Makine Öğrenmesi Sınıflandırma Algoritmalarinin İstatistiksel Olarak Karşılaştırılması. Mühendislik Bilimleri Dergisi. 2014;16:30-42.
  6. Driver HE, Kroeber AL. Quantitative expression of cultural relationships (vol. 31, no. 4). Berkeley: University of California Press; 1932.
  7. Fisher RA. The use of multiple measurements in taxonomic problems. Annals of eugenics, 1936;7:179-188.
  8. Tatlıdil H. Uygulamalı Çok Değişkenli İstatistiksel Analiz. Cem Web Ofset, 1996.
  9. Bircan H. Lojistik regresyon analizi: Tıp verileri üzerine bir uygulama. Kocaeli Üniversitesi Sosyal Bilimler Enstitüsü Dergisi. 2004;2:185-208.
  10. Berkson J. Application of the logistic function to bio-assay. Journal of the American Statistical Association. 1994;39:357-365.
  11. Şenel S, Alatlı B. Lojistik regresyon analizinin kullanıldığı makaleler üzerine bir inceleme. Journal of Measurement and Evaluation in Education and Psychology. 2014;5:35-52.
  12. Sancar N, Inan D. A new alternative estimation method for Liu-type logistic estimator via particle swarm optimization: an application to data of collapse of Turkish commercial banks during the Asian financial crisis. J Appl Stat. 2021;48:2499-2514.
  13. Tabachnick BG, Fidell LS, Ullman JB. Using multivariate statistics. Boston, MA: Pearson; 2007;481-498.
  14. Seber GAF, Wild CJ. Nonlinear regression, 1989.
  15. Başarır G. Çok Değişkenli Verilerde ayrımsama sorunu ve lojistik regresyon Analizi. Hacettepe Üniversitesi Sosyal Bilimler Enstitüsü, Yayınlanmamış Doktora Tezi, 1990.
  16. Verhulst PF. Notice on the law that the population follows in its growth. Corresp Math Phys. 1938;10:113-126.
  17. Gürcan M, Kaya MO, Halisdemir N. Lojistik İncelemede Ayrımsama Performansının Değerlendirilmesi. Avrupa Bilim ve Teknoloji Dergisi. 2019;1008-1013.
  18. Richards FJ. A flexible growth function for empirical use. Journal of experimental Botany. 1959;10:290-301.
  19. Öztürk K, Şahin ME. Yapay sinir ağları ve yapay zekâ’ya genel bir bakış. Takvim-i Vekay. 2018;6:25-36.
  20. Arf C. Makineler Düşünebilir mi ve Nasıl Düşünebilir? Atatürk Üniversitesi 1958-1959 Öğretim Yılı Halk Konferansları. 1959;91-103.
  21. Demirhan A, Kılıç YA, İnan G. Tıpta yapay zeka uygulamaları. Yoğun Bakım Dergisi. 2010;9:31-41.
  22. Kleinbaum DG, Kupper LL, Muller KE, Nizam A. Applied Regression Analysis and Other Multivariable Methods, 4th ed. Thomson Learning. Inc., Belmont: California; 2007.
  23. Kibria BG, Shukur G. On Liu estimators for the logit regression model. Economic Modelling. 2012;29:1483-1488.
  24. Kang K, Gao F, Feng J. A new multi-layer classification method based on logistic regression. In 2018 13th International Conference on Computer Science & Education (ICCSE) (pp. 1-4). IEEE. 2018.
  25. Eberhardt LL, Breiwick JM. (2012). Models for population growth curves. International Scholarly Research Notices, 2012.