Purpose:The purpose of this study is to develop and compare model choice strategies in context of logistic regression.Model choice means the choice of the covariates to be included in the model.Design/methodology/appr...Purpose:The purpose of this study is to develop and compare model choice strategies in context of logistic regression.Model choice means the choice of the covariates to be included in the model.Design/methodology/approach:The study is based on Monte Carlo simulations.The methods are compared in terms of three measures of accuracy:specificity and two kinds of sensitivity.A loss function combining sensitivity and specificity is introduced and used for a final comparison.Findings:The choice of method depends on how much the users emphasize sensitivity against specificity.It also depends on the sample size.For a typical logistic regression setting with a moderate sample size and a small to moderate effect size,either BIC,BICc or Lasso seems to be optimal.Research limitations:Numerical simulations cannot cover the whole range of data-generating processes occurring with real-world data.Thus,more simulations are needed.Practical implications:Researchers can refer to these results if they believe that their data-generating process is somewhat similar to some of the scenarios presented in this paper.Alternatively,they could run their own simulations and calculate the loss function.Originality/value:This is a systematic comparison of model choice algorithms and heuristics in context of logistic regression.The distinction between two types of sensitivity and a comparison based on a loss function are methodological novelties.展开更多
For high-dimensional models with a focus on classification performance,the?1-penalized logistic regression is becoming important and popular.However,the Lasso estimates could be problematic when penalties of different...For high-dimensional models with a focus on classification performance,the?1-penalized logistic regression is becoming important and popular.However,the Lasso estimates could be problematic when penalties of different coefficients are all the same and not related to the data.We propose two types of weighted Lasso estimates,depending upon covariates determined by the Mc Diarmid inequality.Given sample size n and a dimension of covariates p,the finite sample behavior of our proposed method with a diverging number of predictors is illustrated by non-asymptotic oracle inequalities such as the?1-estimation error and the squared prediction error of the unknown parameters.We compare the performance of our method with that of former weighted estimates on simulated data,then apply it to do real data analysis.展开更多
On the basis of the newly developed regression diagnostic analysis, the diagnostic method with the assessment of the outliers of the logistic regression model was set up and it was used to analyze the prognosis of the...On the basis of the newly developed regression diagnostic analysis, the diagnostic method with the assessment of the outliers of the logistic regression model was set up and it was used to analyze the prognosis of the patients with acute lymphatic leukemia.展开更多
Recently, regression diagnostics is used not only in general linear models, but also in generalized linear models(as in the logistic regression model). But, there are some problems which remain to be solved owing to t...Recently, regression diagnostics is used not only in general linear models, but also in generalized linear models(as in the logistic regression model). But, there are some problems which remain to be solved owing to the differences between the general lin展开更多
In 2016 alone, around 4000 people died in crashes involving trucks in the USA, with 21% of these fatalities involving only single-unit trucks. Much research has identified the underlying factors for truck crashes.Howe...In 2016 alone, around 4000 people died in crashes involving trucks in the USA, with 21% of these fatalities involving only single-unit trucks. Much research has identified the underlying factors for truck crashes.However, few studies detected the factors unique to single and multiple crashes, and none have examined these underlying factors to severe truck crashes in conjunction with violation data. The current research assessed all of these factors using two approaches to improve truck safety.The first approach used ordinal logistic regression to investigate the contributory factors that increased the odds of severe single-truck and multiple-vehicle crashes, with involvement of at least one truck. The literature has indicated that past violations can be used to predict future violations and crashes. Therefore, the second approach used risky violations, related to truck crashes, to identify the contributory factors to the risky violations and truck crashes. Driver actions of failure to keep proper lane following too close and driving too fast for conditions accounted for about 40% of all the truck crashes. Therefore, the same violations as the aforementioned driver actions were included in the analysis. Based on ordinal logistic regression, the analysis for the first approach indicated that being under non-normal conditions at the time of crash, driving on dry-road condition and having a distraction in the cabin are some of the factors that increase the odds of severe single-truck crashes. On the other hand,speed compliance, alcohol involvement, and posted speed limits are some of the variables that impacted the severity of multiple-vehicle, truck-involved crashes. With the second approach, the violations related to risky driver actions,which were underlying causes of severe truck crashes, were identified and analysis was run to identify the groups at increased risk of truck-involved crashes. The results of violations indicated that being nonresident, driving offpeak hours, and driving on weekends could increase the risk of truck-involved crashes. This paper offers an insight into the capability of using violation data, in addition to crash data, in identification of possible countermeasures to reduce crash frequency.展开更多
文摘Purpose:The purpose of this study is to develop and compare model choice strategies in context of logistic regression.Model choice means the choice of the covariates to be included in the model.Design/methodology/approach:The study is based on Monte Carlo simulations.The methods are compared in terms of three measures of accuracy:specificity and two kinds of sensitivity.A loss function combining sensitivity and specificity is introduced and used for a final comparison.Findings:The choice of method depends on how much the users emphasize sensitivity against specificity.It also depends on the sample size.For a typical logistic regression setting with a moderate sample size and a small to moderate effect size,either BIC,BICc or Lasso seems to be optimal.Research limitations:Numerical simulations cannot cover the whole range of data-generating processes occurring with real-world data.Thus,more simulations are needed.Practical implications:Researchers can refer to these results if they believe that their data-generating process is somewhat similar to some of the scenarios presented in this paper.Alternatively,they could run their own simulations and calculate the loss function.Originality/value:This is a systematic comparison of model choice algorithms and heuristics in context of logistic regression.The distinction between two types of sensitivity and a comparison based on a loss function are methodological novelties.
基金Supported by the National Natural Science Foundation of China(61877023)the Fundamental Research Funds for the Central Universities(CCNU19TD009)。
文摘For high-dimensional models with a focus on classification performance,the?1-penalized logistic regression is becoming important and popular.However,the Lasso estimates could be problematic when penalties of different coefficients are all the same and not related to the data.We propose two types of weighted Lasso estimates,depending upon covariates determined by the Mc Diarmid inequality.Given sample size n and a dimension of covariates p,the finite sample behavior of our proposed method with a diverging number of predictors is illustrated by non-asymptotic oracle inequalities such as the?1-estimation error and the squared prediction error of the unknown parameters.We compare the performance of our method with that of former weighted estimates on simulated data,then apply it to do real data analysis.
文摘On the basis of the newly developed regression diagnostic analysis, the diagnostic method with the assessment of the outliers of the logistic regression model was set up and it was used to analyze the prognosis of the patients with acute lymphatic leukemia.
文摘Recently, regression diagnostics is used not only in general linear models, but also in generalized linear models(as in the logistic regression model). But, there are some problems which remain to be solved owing to the differences between the general lin
文摘In 2016 alone, around 4000 people died in crashes involving trucks in the USA, with 21% of these fatalities involving only single-unit trucks. Much research has identified the underlying factors for truck crashes.However, few studies detected the factors unique to single and multiple crashes, and none have examined these underlying factors to severe truck crashes in conjunction with violation data. The current research assessed all of these factors using two approaches to improve truck safety.The first approach used ordinal logistic regression to investigate the contributory factors that increased the odds of severe single-truck and multiple-vehicle crashes, with involvement of at least one truck. The literature has indicated that past violations can be used to predict future violations and crashes. Therefore, the second approach used risky violations, related to truck crashes, to identify the contributory factors to the risky violations and truck crashes. Driver actions of failure to keep proper lane following too close and driving too fast for conditions accounted for about 40% of all the truck crashes. Therefore, the same violations as the aforementioned driver actions were included in the analysis. Based on ordinal logistic regression, the analysis for the first approach indicated that being under non-normal conditions at the time of crash, driving on dry-road condition and having a distraction in the cabin are some of the factors that increase the odds of severe single-truck crashes. On the other hand,speed compliance, alcohol involvement, and posted speed limits are some of the variables that impacted the severity of multiple-vehicle, truck-involved crashes. With the second approach, the violations related to risky driver actions,which were underlying causes of severe truck crashes, were identified and analysis was run to identify the groups at increased risk of truck-involved crashes. The results of violations indicated that being nonresident, driving offpeak hours, and driving on weekends could increase the risk of truck-involved crashes. This paper offers an insight into the capability of using violation data, in addition to crash data, in identification of possible countermeasures to reduce crash frequency.