---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      name:  <unnamed>
       log:  /Users/kleintob/Documents/GitHub/econometricsResearchMaster/notes for lectures in class/stata 1 binary choice.log
  log type:  text
 opened on:   9 May 2026, 10:04:33

. 
. 
. 
. 
. /*
> preface: 80% of all packages for papers published in AEA
> journals use Stata, see 
> https://aeadataeditor.github.io/aea-supplement-migration/programs/aea201910-migration.html
> 
> That's why we want to also use Stata in class.
> 
> This file is about binary choice, i.e. about the probit, logit, and linear probability model,
> a few "nonparametric plots", and the Klein and Spady estimator
> */
. 
. 
. 
. 
. /*
> see e.g. "help logit" for the syntax (or "help regress" or "help probit"; 
> also "help logit postestimation", etc.)
> see also https://www.stata.com/manuals/rlogit.pdf
> */
. 
. 
. 
. 
. ** example taken from https://www.stata.com/manuals/rlogitpostestimation.pdf#rlogitpostestimation
. 
. * load data
. webuse lbw, clear
(Hosmer & Lemeshow data)

. 
. 
. 
. 
. ** logit model
. 
. * estimation
. logit low age lwt i.race smoke ptl ht ui

Iteration 0:  Log likelihood =   -117.336  
Iteration 1:  Log likelihood = -101.28644  
Iteration 2:  Log likelihood = -100.72617  
Iteration 3:  Log likelihood =   -100.724  
Iteration 4:  Log likelihood =   -100.724  

Logistic regression                                     Number of obs =    189
                                                        LR chi2(8)    =  33.22
                                                        Prob > chi2   = 0.0001
Log likelihood = -100.724                               Pseudo R2     = 0.1416

------------------------------------------------------------------------------
         low | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.0271003   .0364504    -0.74   0.457    -.0985418    .0443412
         lwt |  -.0151508   .0069259    -2.19   0.029    -.0287253   -.0015763
             |
        race |
      Black  |   1.262647   .5264101     2.40   0.016     .2309024    2.294392
      Other  |   .8620792   .4391532     1.96   0.050     .0013548    1.722804
             |
       smoke |   .9233448   .4008266     2.30   0.021      .137739    1.708951
         ptl |   .5418366    .346249     1.56   0.118     -.136799    1.220472
          ht |   1.832518   .6916292     2.65   0.008     .4769494    3.188086
          ui |   .7585135   .4593768     1.65   0.099    -.1418484    1.658875
       _cons |   .4612239    1.20459     0.38   0.702    -1.899729    2.822176
------------------------------------------------------------------------------

. 
. * average marginal effects (see slide 26)
. margins, dydx(*)

Average marginal effects                                   Number of obs = 189
Model VCE: OIM

Expression: Pr(low), predict()
dy/dx wrt:  age lwt 2.race 3.race smoke ptl ht ui

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.0048315   .0064747    -0.75   0.456    -.0175217    .0078587
         lwt |  -.0027011   .0011835    -2.28   0.022    -.0050207   -.0003816
             |
        race |
      Black  |   .2326941   .0995698     2.34   0.019     .0375409    .4278473
      Other  |   .1511004   .0760619     1.99   0.047     .0020217     .300179
             |
       smoke |   .1646164   .0681744     2.41   0.016     .0309971    .2982358
         ptl |   .0966001   .0602536     1.60   0.109    -.0214948    .2146951
          ht |   .3267063   .1148706     2.84   0.004     .1015641    .5518485
          ui |   .1352299   .0797297     1.70   0.090    -.0210375    .2914972
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

. 
. * manually calculate average marginal effect for change in age
. * see slide 36
. predict pHat, pr

. gen avMarginalEffectAge = pHat*(1-pHat)*(-.0271003)

. sum avMarginalEffectAge

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
avMarginal~e |        189   -.0048315    .0016661  -.0067749  -.0007185

. 
. * compare to marginal effect at average x
. margins, dydx(*) atmeans

Conditional marginal effects                               Number of obs = 189
Model VCE: OIM

Expression: Pr(low), predict()
dy/dx wrt:  age lwt 2.race 3.race smoke ptl ht ui
At: age    =  23.2381 (mean)
    lwt    = 129.8201 (mean)
    1.race = .5079365 (mean)
    2.race = .1375661 (mean)
    3.race = .3544974 (mean)
    smoke  = .3915344 (mean)
    ptl    = .1957672 (mean)
    ht     = .0634921 (mean)
    ui     = .1481481 (mean)

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.0054264   .0072688    -0.75   0.455     -.019673    .0088201
         lwt |  -.0030337   .0013669    -2.22   0.026    -.0057129   -.0003546
             |
        race |
      Black  |   .2643163     .11724     2.25   0.024     .0345301    .4941025
      Other  |   .1679337   .0863596     1.94   0.052    -.0013279    .3371953
             |
       smoke |   .1848858   .0790024     2.34   0.019      .030044    .3397276
         ptl |   .1084946   .0695517     1.56   0.119    -.0278243    .2448134
          ht |   .3669339   .1380704     2.66   0.008      .096321    .6375469
          ui |   .1518808   .0919254     1.65   0.098    -.0282897    .3320514
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

. 
. 
. 
. 
. ** probit model
. 
. * estimation
. probit low age lwt i.race smoke ptl ht ui

Iteration 0:  Log likelihood =   -117.336  
Iteration 1:  Log likelihood = -100.73961  
Iteration 2:  Log likelihood = -100.56113  
Iteration 3:  Log likelihood = -100.56095  
Iteration 4:  Log likelihood = -100.56095  

Probit regression                                       Number of obs =    189
                                                        LR chi2(8)    =  33.55
                                                        Prob > chi2   = 0.0000
Log likelihood = -100.56095                             Pseudo R2     = 0.1430

------------------------------------------------------------------------------
         low | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.0175445   .0216292    -0.81   0.417     -.059937    .0248481
         lwt |  -.0088205   .0039734    -2.22   0.026    -.0166082   -.0010327
             |
        race |
      Black  |   .7475256   .3166405     2.36   0.018     .1269216     1.36813
      Other  |   .5144711   .2555824     2.01   0.044     .0135389    1.015403
             |
       smoke |   .5627601   .2357783     2.39   0.017     .1006432    1.024877
         ptl |   .3178267   .2001253     1.59   0.112    -.0744117     .710065
          ht |   1.099451   .4192793     2.62   0.009     .2776784    1.921223
          ui |   .4627944   .2756093     1.68   0.093    -.0773899    1.002979
       _cons |   .2682753   .7015254     0.38   0.702    -1.106689     1.64324
------------------------------------------------------------------------------

. 
. * average marginal effects similar to the ones of logit model
. margins, dydx(*)

Average marginal effects                                   Number of obs = 189
Model VCE: OIM

Expression: Pr(low), predict()
dy/dx wrt:  age lwt 2.race 3.race smoke ptl ht ui

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.0052632    .006454    -0.82   0.415    -.0179129    .0073865
         lwt |  -.0026461   .0011521    -2.30   0.022    -.0049041    -.000388
             |
        race |
      Black  |   .2314958   .1012071     2.29   0.022     .0331334    .4298581
      Other  |   .1526453   .0752693     2.03   0.043     .0051203    .3001704
             |
       smoke |   .1688236   .0676518     2.50   0.013     .0362285    .3014186
         ptl |   .0953455    .058933     1.62   0.106     -.020161     .210852
          ht |   .3298265   .1190813     2.77   0.006     .0964315    .5632216
          ui |   .1388347   .0808244     1.72   0.086    -.0195782    .2972475
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.

. 
. 
. 
. 
. ** linear probability model
. 
. * here, we directly estimate marginal effects (see slide 37)
. 
. * estimate with robust standard errors (see slide 37)
. regress low age lwt i.race smoke ptl ht ui, vce(robust)

Linear regression                               Number of obs     =        189
                                                F(8, 180)         =       5.97
                                                Prob > F          =     0.0000
                                                R-squared         =     0.1635
                                                Root MSE          =     .43427

------------------------------------------------------------------------------
             |               Robust
         low | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.0034688   .0057272    -0.61   0.545    -.0147699    .0078323
         lwt |  -.0025213   .0010885    -2.32   0.022    -.0046691   -.0003735
             |
        race |
      Black  |   .2214043   .1035048     2.14   0.034     .0171654    .4256431
      Other  |   .1436247   .0736272     1.95   0.053    -.0016589    .2889082
             |
       smoke |   .1595568   .0696973     2.29   0.023      .022028    .2970857
         ptl |   .1153871   .0869949     1.33   0.186    -.0562739     .287048
          ht |   .3635326   .1360481     2.67   0.008     .0950783     .631987
          ui |   .1560515   .1078706     1.45   0.150     -.056802     .368905
       _cons |   .5074597   .2010367     2.52   0.012     .1107678    .9041516
------------------------------------------------------------------------------

. 
. * compare to usual standard errors
. * (there are not correct because of heteroskedasticity, see slide 28; 
. * note: point estimates the same)
. regress low age lwt i.race smoke ptl ht ui

      Source |       SS           df       MS      Number of obs   =       189
-------------+----------------------------------   F(8, 180)       =      4.40
       Model |    6.636136         8     .829517   Prob > F        =    0.0001
    Residual |  33.9458746       180  .188588192   R-squared       =    0.1635
-------------+----------------------------------   Adj R-squared   =    0.1263
       Total |  40.5820106       188  .215861758   Root MSE        =    .43427

------------------------------------------------------------------------------
         low | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.0034688   .0063194    -0.55   0.584    -.0159384    .0090008
         lwt |  -.0025213   .0011532    -2.19   0.030    -.0047969   -.0002457
             |
        race |
      Black  |   .2214043   .1001543     2.21   0.028     .0237767    .4190318
      Other  |   .1436247   .0765303     1.88   0.062    -.0073873    .2946366
             |
       smoke |   .1595568   .0710842     2.24   0.026     .0192912    .2998224
         ptl |   .1153871     .06806     1.70   0.092     -.018911    .2496852
          ht |   .3635326    .134455     2.70   0.008     .0982219    .6288433
          ui |   .1560515   .0927102     1.68   0.094    -.0268872    .3389902
       _cons |   .5074597   .2085242     2.43   0.016     .0959933    .9189261
------------------------------------------------------------------------------

. 
. 
. 
. 
. ** Klein and Spady estimator
. 
. * recall that the estimator performs a nonparametric regression
. * of low on xb for a candidate parameter value b and then constructs
. * the likelihood function
. * to illustrate: nonparametric regression for the parameter values from the logit model
. 
. logit low age lwt i.race smoke ptl ht ui

Iteration 0:  Log likelihood =   -117.336  
Iteration 1:  Log likelihood = -101.28644  
Iteration 2:  Log likelihood = -100.72617  
Iteration 3:  Log likelihood =   -100.724  
Iteration 4:  Log likelihood =   -100.724  

Logistic regression                                     Number of obs =    189
                                                        LR chi2(8)    =  33.22
                                                        Prob > chi2   = 0.0001
Log likelihood = -100.724                               Pseudo R2     = 0.1416

------------------------------------------------------------------------------
         low | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.0271003   .0364504    -0.74   0.457    -.0985418    .0443412
         lwt |  -.0151508   .0069259    -2.19   0.029    -.0287253   -.0015763
             |
        race |
      Black  |   1.262647   .5264101     2.40   0.016     .2309024    2.294392
      Other  |   .8620792   .4391532     1.96   0.050     .0013548    1.722804
             |
       smoke |   .9233448   .4008266     2.30   0.021      .137739    1.708951
         ptl |   .5418366    .346249     1.56   0.118     -.136799    1.220472
          ht |   1.832518   .6916292     2.65   0.008     .4769494    3.188086
          ui |   .7585135   .4593768     1.65   0.099    -.1418484    1.658875
       _cons |   .4612239    1.20459     0.38   0.702    -1.899729    2.822176
------------------------------------------------------------------------------

. predict xbLogit, xb

. lpoly low xbLogit, degress(1) ci noscatter

. 
. * first install package
. * see https://journals.sagepub.com/doi/pdf/10.1177/1536867X0800800203
. 
. net install st0144.pkg, from(http://www.stata-journal.com/software/sj8-2/)
checking st0144 consistency and verifying not already installed...
all files already exist and are up to date.

. 
. * now implement the estimator
. 
. tab race, gen(raceInd) // generate dummies

       Race |      Freq.     Percent        Cum.
------------+-----------------------------------
      White |         96       50.79       50.79
      Black |         26       13.76       64.55
      Other |         67       35.45      100.00
------------+-----------------------------------
      Total |        189      100.00

. 
. sml low age lwt raceInd2 raceInd3 ptl ht ui, offset(smoke)


Iteration 0:  Log likelihood = -104.85961  (not concave)
Iteration 1:  Log likelihood = -104.35909  (not concave)
Iteration 2:  Log likelihood = -104.25205  (not concave)
Iteration 3:  Log likelihood = -103.63407  
Iteration 4:  Log likelihood =  -103.5144  
Iteration 5:  Log likelihood = -103.27984  
Iteration 6:  Log likelihood = -103.27601  
Iteration 7:  Log likelihood = -103.27599  
Iteration 8:  Log likelihood = -103.27599  

SML Estimator - Klein & Spady (1993)                    Number of obs =    189
                                                        Wald chi2(7)  =  23.50
Log likelihood = -103.27599                             Prob > chi2   = 0.0014

------------------------------------------------------------------------------
         low | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.0500978   .0299194    -1.67   0.094    -.1087387    .0085432
         lwt |  -.0102034   .0041902    -2.44   0.015     -.018416   -.0019908
    raceInd2 |   1.440672    .445042     3.24   0.001     .5684062    2.312939
    raceInd3 |    .732146    .283806     2.58   0.010     .1758964    1.288396
         ptl |   .3543276   .2167198     1.63   0.102    -.0704355    .7790907
          ht |   1.994203   .4731155     4.22   0.000     1.066913    2.921492
          ui |   .5342482   .3714303     1.44   0.150    -.1937419    1.262238
       smoke |          1  (offset)
------------------------------------------------------------------------------

. predict ttt, xb

. lpoly low ttt, noscatter ci

. 
. 
. 
. 
. ** now take a step back and look at case of only one covariate: age
. 
. * histogram of age
. hist age
(bin=13, start=14, width=2.3846154)

. 
. * make plots for probit model
. probit low age

Iteration 0:  Log likelihood =   -117.336  
Iteration 1:  Log likelihood = -115.92724  
Iteration 2:  Log likelihood = -115.92448  
Iteration 3:  Log likelihood = -115.92448  

Probit regression                                       Number of obs =    189
                                                        LR chi2(1)    =   2.82
                                                        Prob > chi2   = 0.0929
Log likelihood = -115.92448                             Pseudo R2     = 0.0120

------------------------------------------------------------------------------
         low | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.0315483   .0190565    -1.66   0.098    -.0688985    .0058018
       _cons |   .2358955   .4461298     0.53   0.597    -.6385028    1.110294
------------------------------------------------------------------------------

. predict pHatSimpleCaseProbit, pr        // predicted probability

. predict xbSimpleCaseProbit, xb  // fitted value of xb

. scatter pHatSimpleCaseProbit xbSimpleCaseProbit // shows that fitted probability is c.d.f. at xb

. scatter pHatSimpleCaseProbit xbSimpleCaseProbit, msize(tiny) jitter(2)  // jitter so that we see the data points

. scatter pHatSimpleCaseProbit age        // fitted probability by age

. 
. * predicted probability for logit model
. logit low age

Iteration 0:  Log likelihood =   -117.336  
Iteration 1:  Log likelihood = -115.96259  
Iteration 2:  Log likelihood = -115.95598  
Iteration 3:  Log likelihood = -115.95598  

Logistic regression                                     Number of obs =    189
                                                        LR chi2(1)    =   2.76
                                                        Prob > chi2   = 0.0966
Log likelihood = -115.95598                             Pseudo R2     = 0.0118

------------------------------------------------------------------------------
         low | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.0511529   .0315138    -1.62   0.105    -.1129188    .0106129
       _cons |   .3845819   .7321251     0.53   0.599    -1.050357    1.819521
------------------------------------------------------------------------------

. predict pHatSimpleCaseLogit, pr

. scatter pHatSimpleCaseLogit age // fitted probability by age

. 
. * linear probability model
. regress low age, vce(robust)

Linear regression                               Number of obs     =        189
                                                F(1, 187)         =       3.46
                                                Prob > F          =     0.0644
                                                R-squared         =     0.0141
                                                Root MSE          =     .46254

------------------------------------------------------------------------------
             |               Robust
         low | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |  -.0104291   .0056068    -1.86   0.064    -.0214897    .0006316
       _cons |   .5545211   .1392388     3.98   0.000     .2798403    .8292018
------------------------------------------------------------------------------

. predict pHatSimpleCaseLpm, xb   // note: now xb

. scatter pHatSimpleCaseLpm age   // fitted probability by age

. 
. * can always make model more flexible by putting square terms etc. into xb
. gen age2 = age^2

. regress low age age2    // shows that this is too flexible: estimates become imprecise, reason: close multicollinearity

      Source |       SS           df       MS      Number of obs   =       189
-------------+----------------------------------   F(2, 186)       =      1.67
       Model |  .716128028         2  .358064014   Prob > F        =    0.1909
    Residual |  39.8658826       186  .214332702   R-squared       =    0.0176
-------------+----------------------------------   Adj R-squared   =    0.0071
       Total |  40.5820106       188  .215861758   Root MSE        =    .46296

------------------------------------------------------------------------------
         low | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .0239862   .0427544     0.56   0.575    -.0603596    .1083321
        age2 |  -.0006847   .0008411    -0.81   0.417    -.0023441    .0009747
       _cons |   .1436488   .5270788     0.27   0.786    -.8961725     1.18347
------------------------------------------------------------------------------

. predict pHatSimpleCaseLinQuad
(option xb assumed; fitted values)

. scatter pHatSimpleCaseLinQuad age

. 
. * nonparametric: take average by age
. bys age: egen pHatSimpleCaseAverage = mean(low)

. 
.         * number of observations and standard deviation by age
.         bys age: egen n_age = count(low)

.         bys age: egen sd_low_age = sd(low)
(2 missing values generated)

. 
.         * standard error of the age-specific mean
.         gen se_pHatSimpleCaseAverage = sd_low_age / sqrt(n_age)
(2 missing values generated)

. 
.         * 95% confidence interval
.         gen lb_pHatSimpleCaseAverage = pHatSimpleCaseAverage - ///
>                 1.96 * se_pHatSimpleCaseAverage
(2 missing values generated)

. 
.         gen ub_pHatSimpleCaseAverage = pHatSimpleCaseAverage + ///
>                 1.96 * se_pHatSimpleCaseAverage
(2 missing values generated)

. 
.         ** optional: keep CI inside [0,1]
.         *replace lb_pHatSimpleCaseAverage = 0 if lb_pHatSimpleCaseAverage < 0
.         *replace ub_pHatSimpleCaseAverage = 1 if ub_pHatSimpleCaseAverage > 1
. 
. * alternative to going fully nonparametric: nonparametric regression
. lpoly low age, degress(1) ci noscatter

. 
. * keep one observation per age for plotting
. preserve

. 
.         keep if n_age >= 5
(18 observations deleted)

. 
.         bys age: keep if _n == 1
(155 observations deleted)

.         sort age

.                         
.         * compare
.         twoway ///
>                 (rarea ub_pHatSimpleCaseAverage lb_pHatSimpleCaseAverage age, ///
>                         sort fcolor(gs12) fintensity(40) lcolor(gs12)) ///
>                 (scatter pHatSimpleCaseAverage age, ///
>                         msymbol(O) mcolor(red) msize(small)) ///
>                 (scatter pHatSimpleCaseProbit age, ///
>                         msymbol(X) mcolor(edkblue) msize(small)) ///
>                 (scatter pHatSimpleCaseLogit age, ///
>                         msymbol(Dh) mcolor(edkblue) msize(small)) ///
>                 (scatter pHatSimpleCaseLpm age, ///
>                         msymbol(Sh) mcolor(edkblue) msize(small)) ///
>                 (scatter pHatSimpleCaseLinQuad age, ///
>                         msymbol(Th) mcolor(edkblue) msize(small)), ///
>                 legend(position(2) ring(0) cols(2) ///
>                         order(2 "average data" ///
>                                   3 "probit" ///
>                                   4 "logit" ///
>                                   5 "linear probability model" ///
>                                   6 "linear probability model quadratic" ///
>                                   1 "95% CI")) ///
>                 yscale(range(-0.5 1.5)) ///
>                 ytitle("probability child has low birth weight") ///
>                 xtitle("age") ///
>                 note("plotted for ages with at least 5 mothers")

. 
. restore

.         
. 
.         
. 
. log close
      name:  <unnamed>
       log:  /Users/kleintob/Documents/GitHub/econometricsResearchMaster/notes for lectures in class/stata 1 binary choice.log
  log type:  text
 closed on:   9 May 2026, 10:05:14
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------