• Papadopoulos, A (2021). “Accounting for endogeneity in regression models using Copulas: A step-by-step guide for empirical studies.” Journal of Econometric Methods, https://doi.org/10.1515/jem-2020-0007 . Download the pre-print incl. the on-line supplement.
  • Abstract : We provide a detailed presentation and guide for the use of Copulas in order to account for endogeneity in linear regression models without the need for instrumental variables. We start by developing the model from first principles of likelihood inference, and then focus on the Gaussian Copula. We discuss its merits and propose diagnostics to assess its validity. We analyze in detail and provide solutions to the various issues that may arise in empirical applications for applying the method. We treat the cases of both continuous and discrete endogenous regressors. We present simulation evidence for the performance of the proposed model in finite samples, and we illustrate its application by a short empirical study. A supplementary file contains additional simulations and another empirical illustration.

    This has just been accepted in European Journal of Operational Research and the Author’s accepted version has been uploaded here. It has been written together with Christopher Parmeter of Miami University.

    There are some nice theoretical results related to Skewness and Excess Kurtosis for the composite distributions used in Stochastic Frontier Analysis, but to me the main contribution is a specification test that uses only OLS residuals and it appears the most powerful such test to date. With this test, one can first test for the error specification after just an OLS regression, and then code the maximum likelihood estimator.

    Abstract. The distributional specifications for the composite regression error term most
    often used in stochastic frontier analysis are inherently bounded as regards their skewness
    and excess kurtosis coefficients. We derive general expressions for the skewness and excess
    kurtosis of the composed error term in the stochastic frontier model based on the ratio
    of standard deviations of the two separate error components as well as theoretical ranges
    for the most popular empirical specifications. While these simple expressions can be used
    directly to assess the credibility of an assumed distributional pair, they are likely to over
    reject. Therefore, we develop a formal test based on the implied ratio of standard deviations
    for the skewness and the kurtosis. This test is shown to have impressive power compared
    with other tests of the specification of the composed error term. We deploy this test on
    a range of well-known datasets that have been used across the efficiency community. For
    many of them we find that the classic distribution assumptions cannot be rejected.

    UPDATE: The paper went on-line on February 2, 2021, https://link.springer.com/article/10.1007/s11123-020-00591-9

    The paper “Stochastic frontier models using the Generalized Exponential distribution” has just been approved for publication in the Journal of Productivity Analysis.

    Abstract: We present a new, single-parameter distributional specification for the one-sided error components in single-tier and two-tier stochastic frontier models. The distribution has its mode away from zero, and can represent cases where the most likely outcome is non-zero inefficiency. We present the necessary formulas for estimating production, cost and two-tier stochastic frontier models in logarithmic form. We pay particular attention to the use of the conditional mode as a predictor of individual inefficiency. We use simulations to assess the performance of existing models when the data include an inefficiency term with non-zero mode, and we also contrast the conditional mode to the conditional expectation as measures of individual (in)efficiency.

    Download the pre-print here.

    This survey has just been published in the collection Parmeter, C. F., & Sickles, R. C. (2020) Advances in Efficiency and Productivity Analysis. Springer. Naturally, it is based on my PhD, and it is a comprehensive survey of the state-of-the-art of the Two-tier Stochastic Frontier Framework, surveying theoretical foundations, estimation tools, and the large variety of application this modeling framework has been used for. Indicatively, it has been used to measure the impact of informational asymmetry in wage negotiations, in the house market, in the Health Services market, the impact of asymmetric bargaining power in International donors-recipients relationship but also in Tourist shopping, or to measure the effects of “optimism” and “pessimism” in self-reported quality of life. And may more, economic and not-so-economic situations.

    Anywhere where we can perceive of opposing latent forces operating on the outcome, this model can be applied. This is why I use as its pet name the “noisy Tug-of-War” model -“noisy” because there is also a “noise” component in the composed error specification.



    This paper is a joint effort with prof. Mike Tsionas. It has just been accepted for publication in Econometric Reviews. It really has a new least-squares method to propose that reduces the variance of the estimator in linear regression. And it is very easy to implement.

    ABSTRACT. In pursuit of efficiency, we propose a new way to construct least squares estimators, as the minimizers of an augmented objective function that takes explicitly into account the variability of the error term and the resulting uncertainty, as well as the possible existence of heteroskedasticity. We initially derive aninfeasible estimator which we then approximate using Ordinary Least Squares (OLS) residuals from a first-step regression to obtain the feasible “HOLS” estimator. This estimator has negligible bias, is consistent and outperforms OLS in terms of finite-sample Mean Squared Error, but also in terms of asymptotic efficiency, under all skedastic scenarios, including homoskedasticity. Analogous efficiency gains are obtained for the case of Instrumental Variables estimation. Theoretical results are accompanied by simulations that support them.

    Download the pre-print and the on-line Appendix.


    This has just been approved for publication in Empirical Economics.

    ABSTRACT. We revisit the production frontier of a firm and we examine the effects that the firm’s management has on output. In order to estimate these effects using a cross-sectional sample while avoiding the costly requirement of obtaining data on management as a production factor, we develop a two-tier stochastic frontier (2TSF) model where management is treated as a latent variable. The model is consistent with the microeconomic theory of the firm, and it can estimate the effect of management on the output of a firm in monetary terms from different angles, separately from inefficiency. The approach can thus contribute to the cost-benefit analysis related to the management system of a company, and it can facilitate research related to management pay and be used in studies of the determinants of management performance. We also present an empirical application, where we find that the estimates from our latent-variable model align with the results obtained when we use the World Management Survey scores that provide a measure of management.


    This came out of nowhere, but it led to a very interesting journey over 70 years of research in very different fields, and a 50-page survey on the many different ways scholars have attempted to define, measure and assess the effects of management on production, output, productivity and efficiency.

    TL;DR: we are still at the beginning of measuring the effects of management on production reliably.

    Here is the survey, destined to be a chapter in the Handbook of Production Economics vol.2


    A paper I wrote together with Christine Amsler and Peter Schmidt (yes, I cannot resist to say, the Peter Schmidt of the KPSS time series stationarity test, and one of the founders of Stochastic Frontier Analysis), has just been approved for publication in a special issue of Empirical Economics that will be dedicated to efficiency and productivity analysis. The paper is

    Amsler C, A Papadopoulos and P Schmidt (2020). “Evaluating the CDF of the Skew Normal distribution.” Forthcoming in Empirical Economics. Download the full paper incl. the supplementary file.

    ABSTRACT. In this paper we consider various methods of evaluating the cdf of the Skew Normal distribution. This distribution arises in the stochastic frontier model because it is the distribution of the composed error, which is the sum (or difference) of a Normal and a Half Normal random variable. The cdf must be evaluated in models in which
    the composed error is linked to other errors using a Copula, in some methods of goodness of fit testing, or in the likelihood of models with sample selection bias. We investigate the accuracy of the evaluation of the cdf using expressions based on the bivariate Normal distribution, and also using simulation methods and some approximations. We find that the expressions based on the bivariate Normal distribution are quite accurate in the central portion of the distribution, and we propose several new approximations that are accurate in the extreme tails. By a simulated example we show that the use of approximations instead of the theoretical exact expressions may be critical in obtaining meaningful and valid estimation results.


    The paper computes values of the Skew Normal distribution using 17 different mathematical formulas (approximations or exact), and/or algorithms and different software. with particular focus on the accuracy of computation of the Skew Normal CDF by the use of the Bivariate standard Normal CDF, since the latter is readily available, but also on what happens deep into the tails. There, the CDF values as so close to zero or unity that it would appear it wouldn’t matter for empirical studies, if one simply imposed a non-zero floor and a non-unity ceiling, and be ok. It is not ok. In Section 7 of the paper we show by a simulated example, that using the Bivariate standard Normal CDF only (with or without floor/ceiling) may lead to failed estimation, while inserting an approximate expression in its place for the left tail solves the problem. This is a result we did not anticipate: it says that approximate mathematical expressions may perform better than exact formulas due to computational limitations related to the latter.

    It only took 15 months and 3 revisions, but the paper

    Papadopoulos, A and Roland B. Stark (2019). “Does Home Health Care increase the probability of 30-day hospital re-admissions? Interpreting coefficient sign reversals, or their absence, in binary logistic regression analysis”.

    has now been accepted for publication in The American Statistician

    …and is now (Dec 17, 2019) on-line at https://doi.org/10.1080/00031305.2019.1704873

    The paper is very light on technical stuff, but heavy on concepts. The abstract reads : Data for 30-day readmission rates in American hospitals often show that patients that receive Home Health Care (HHC) have a higher probability of being readmitted to hospital than those that did not receive such services, but it is expected that when control variables are included in a regression we will obtain a “sign reversal” of the treatment effect. We map the real-world situation to the binary logistic regression model, and we construct a counterfactual probability metric that leads to necessary and sufficient conditions for the sign reversal to occur, conditions that show that logistic regression is an appropriate tool for this research purpose. This metric also permits us to obtain evidence related to the criteria used to assign HHC treatment. We examine seven data samples from different USA hospitals for the period 2011-2017. We find that in all cases the provision of HHC increased the probability of readmission of the treated patients. This casts doubt on the appropriateness of the 30-day readmission rate as an indicator of hospital performance and a criterion for hospital reimbursement, as it is currently used for Medicare patients.

    The main contributions of the paper can be distilled down to the following two: first, we show how the familiar binary logistic regression model can be reliably used to glean information as to whether assignment of Home Health Care (HHC) treatment, to patients that are discharged form the hospital, depends positively on the seriousness of their health status, or not (in which case we would have statistical evidence that administrators go for an “easy win” by assigning HHC to less needy patients).

    Second, we provide the theoretical framework to explain an ongoing “puzzle” in Healthcare, that HHC appears to increase the probability of hospital readmissions, even after risk-adjustment: in other words, we explain why the statement “Home Health Care is beneficial to the health of patients and it increases their probability of hospital readmission” is not a contradiction in terms.

    In (counterfactual) Treatment Effects Analysis, we learn that a fundamental condition in order to be able to estimate treatment effects reliably is that the treatment variable is “ignorable conditional on the control variables” (see Rosenbaum and Rubin 1983). When ignorability does not hold, as it happens with most cases of observational, non-randomized data, various methods have been developed to obtain ignorability, or in more precise words, to construct a sample (through “risk adjustment”, “balancing on propensity scores”, etc) that “imitates” a randomized one.

    We are also told that ignorability is analogous to regressor exogeneity in the linear regression setup, and so that when ignorability does not hold, essentially we have endogeneity and the estimation will produce inconsistent and so unreliable estimates, see e.g. Imbens (2004), or Guo and Fraser “Propensity Score Anaysis” (2010), 1st ed., pp 30-35.

    This is simply wrong. The treatment variable may not be ignorable and yet the estimator can be consistent. This means that we can estimate consistently the treatment effect even if the treatment is non-ignorable. We illustrate that non-ignorability does not necessarily imply inconsistency of the estimator, through the widely used Binary Logistic Regression model (BLR).

    The BLR model starts properly with a latent-variable regression, usually linear,

    y^{\ell}_i = \beta_0 + \beta_1T_i + \mathbf z'_i \gamma + u_i,\;\;\; i=1,...,n  \;\;\;\;(1)

    Where y^{\ell}_i is the unobservable (latent) variable, T_i is the treatment variable,  \mathbf z_i is the vector of controls and u_i is the error term. We obtain the BLR model if we assume that the error term follows the standard Logistic distribution conditional on the regressors, u_i | \{T_i, \mathbf z_i\} \sim \Lambda (0, \pi^2/3). Then we define the indicator variable y_i \equiv I\{y^{\ell}_i >0\}, which is observable, and we wonder what is the probability distribution of y_i conditional on the regressors. We obtain

    \Pr\left (y_i = 1 | \{T_i, \mathbf z_i\}\right) = \Lambda\left (\beta_0 + \beta_1T_i + \mathbf z'_i \gamma\right)\;\;\;\;(2)

    and in general,

    \Pr\left (y_i  | \{T_i, \mathbf z_i\}\right) = \left[\Lambda\left (\beta_0 + \beta_1T_i + \mathbf z'_i \gamma\right)\right]^{y_i}\cdot \left[1-\Lambda\left (\beta_0 + \beta_1T_i + \mathbf z'_i \gamma\right)\right]^{1-y_i}\;\;\;\;(3)

    This likelihood is estimated by the maximum likelihood estimator (MLE).

    Turning to ignorability, it can be expressed as

    \Pr \left (y_i | \{T_i, \mathbf z_i\}\right) = \Pr \left (y_i |\mathbf z_i\right)\;\;\;\;(4)

    Essentially ignorability means that the treatment variable is totally determined by the controls, or maybe, that if it is only partly determined by them, its other “part” is independent from the dependent variable/outcome.

    Comparing (4) with (3) we see that ignorability of treatment in the context of the BLR model, is equivalent to the assumption \beta_1=0.

    “Great”, you could say. “So run the model and let the data decide whether ignorability holds or not”. Well, the issue is whether, when ignorability does not hold, the MLE remains a consistent estimator so that we can have confidence in the estimates that we will obtain. And the assertion that we find in the literature, is that non-ignorability destroys consistency.

    Does it? Let’s see: in order for the MLE to be inconsistent, it must be the case that the regressors in the latent-variabe regression (eq. 1), are correlated with the error term. The controls are assumed independent from the error term from the outset. What is argued, is that if T_i is non-ignorable, then it is associated with u_i.

    We just have seen that ignorability implies that \beta_1 =0. So if non-ignorability is the case, we have that \beta_1 \neq 0. How does this imply the inconsistency condition “T_i is not independent from u_i“?

    It doesn’t. The (informal) argument is that if the treatment variable is not fully determined by the controls, it “must” be statistically associated with the unmodeled/random factors represented by u_i. But there is nothing here to support a priori this assertion. Whether the treatment variable is endogenous or not, must be argued per case, with respect to the actual situation that we analyze and model. Certainly, if the argument is that the treatment is ignorable, then, if the controls are exogenous to the error term (which is the maintained assumption), so will be the treatment variable also. But if it is non-ignorable, it does not follow automatically that it is endogenous.

    Therefore, depending on the real-world phenomenon under study and the available sample, we may very well have a consistent MLE in the BLR model, and so

    a) be able to test validly the ignorability assumption, and

    b) estimate treatment effects reliably even if the treatment is non-ignorable.–


    At the request of a comment, here is a quick Gretl code to simulate a situation where the Treatment is not ignorable, but it is independent from the error term and so it can be consistently estimated. Play around with the sample size (now n=5000) , or embed the script into a simple index loop (with matrices to hold the estimates for each run, then fill a series with the estimates from the matrix, then take basic statistics to see that the estimator is consistent).


    nulldata 5000

    set hc_version 2 #uses HC2 robust standard errors

    #Data generation

    genr U1 = randgen(U,0,1) #auxialiary variable
    genr Er = -log((1-U1)/U1) #Logistic error term Λ(0,1)
    genr X1 = randgen(G,1,2) # continuous regressor following Exponential
    genr N1 = randgen (N,0,1) # codetermines the assignment of treatment
    genr T = (X1+N1 >0) #Bernoulli treatment
    genr yL = -0.5 + 0.5*T + X1 + Er # latent dependent variable

    #The Treatment is not ignorable because it influences directly the latent dependent variable.
    genr Depvar = (yL >0) #obseravble dependent variable


    list Reglist = const T X1  #OLS estimation for starting values
    ols Depvar Reglist –quiet 

    matrix bcoeff = $coeff  #starting value scale parameter of the error term

    #This so that the names of the variables appear in the estimation output
    string varblnames = varname(Reglist)
    string allcoefnames = varblnames

    #command for maximum likelihood estimation
    catch mle logl = Depvar*log(CDF) + (1-Depvar)*log(1-CDF)
    series g = lincomb(Reglist,bcoeff)

    series CDF = 1/(1+ exp(-g)) #correct specification of the distribution of the error term

    params bcoeff
    param_names allcoefnames
    end mle –hessian