--------------------------------------------------------------------------------------------------------------------------------------------------------------------------- name: log: /Users/kleintob/Documents/GitHub/econometricsResearchMaster/notes for lectures in class/stata 1 binary choice.log log type: text opened on: 9 May 2026, 10:04:33 . . . . . /* > preface: 80% of all packages for papers published in AEA > journals use Stata, see > https://aeadataeditor.github.io/aea-supplement-migration/programs/aea201910-migration.html > > That's why we want to also use Stata in class. > > This file is about binary choice, i.e. about the probit, logit, and linear probability model, > a few "nonparametric plots", and the Klein and Spady estimator > */ . . . . . /* > see e.g. "help logit" for the syntax (or "help regress" or "help probit"; > also "help logit postestimation", etc.) > see also https://www.stata.com/manuals/rlogit.pdf > */ . . . . . ** example taken from https://www.stata.com/manuals/rlogitpostestimation.pdf#rlogitpostestimation . . * load data . webuse lbw, clear (Hosmer & Lemeshow data) . . . . . ** logit model . . * estimation . logit low age lwt i.race smoke ptl ht ui Iteration 0: Log likelihood = -117.336 Iteration 1: Log likelihood = -101.28644 Iteration 2: Log likelihood = -100.72617 Iteration 3: Log likelihood = -100.724 Iteration 4: Log likelihood = -100.724 Logistic regression Number of obs = 189 LR chi2(8) = 33.22 Prob > chi2 = 0.0001 Log likelihood = -100.724 Pseudo R2 = 0.1416 ------------------------------------------------------------------------------ low | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- age | -.0271003 .0364504 -0.74 0.457 -.0985418 .0443412 lwt | -.0151508 .0069259 -2.19 0.029 -.0287253 -.0015763 | race | Black | 1.262647 .5264101 2.40 0.016 .2309024 2.294392 Other | .8620792 .4391532 1.96 0.050 .0013548 1.722804 | smoke | .9233448 .4008266 2.30 0.021 .137739 1.708951 ptl | .5418366 .346249 1.56 0.118 -.136799 1.220472 ht | 1.832518 .6916292 2.65 0.008 .4769494 3.188086 ui | .7585135 .4593768 1.65 0.099 -.1418484 1.658875 _cons | .4612239 1.20459 0.38 0.702 -1.899729 2.822176 ------------------------------------------------------------------------------ . . * average marginal effects (see slide 26) . margins, dydx(*) Average marginal effects Number of obs = 189 Model VCE: OIM Expression: Pr(low), predict() dy/dx wrt: age lwt 2.race 3.race smoke ptl ht ui ------------------------------------------------------------------------------ | Delta-method | dy/dx std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- age | -.0048315 .0064747 -0.75 0.456 -.0175217 .0078587 lwt | -.0027011 .0011835 -2.28 0.022 -.0050207 -.0003816 | race | Black | .2326941 .0995698 2.34 0.019 .0375409 .4278473 Other | .1511004 .0760619 1.99 0.047 .0020217 .300179 | smoke | .1646164 .0681744 2.41 0.016 .0309971 .2982358 ptl | .0966001 .0602536 1.60 0.109 -.0214948 .2146951 ht | .3267063 .1148706 2.84 0.004 .1015641 .5518485 ui | .1352299 .0797297 1.70 0.090 -.0210375 .2914972 ------------------------------------------------------------------------------ Note: dy/dx for factor levels is the discrete change from the base level. . . * manually calculate average marginal effect for change in age . * see slide 36 . predict pHat, pr . gen avMarginalEffectAge = pHat*(1-pHat)*(-.0271003) . sum avMarginalEffectAge Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- avMarginal~e | 189 -.0048315 .0016661 -.0067749 -.0007185 . . * compare to marginal effect at average x . margins, dydx(*) atmeans Conditional marginal effects Number of obs = 189 Model VCE: OIM Expression: Pr(low), predict() dy/dx wrt: age lwt 2.race 3.race smoke ptl ht ui At: age = 23.2381 (mean) lwt = 129.8201 (mean) 1.race = .5079365 (mean) 2.race = .1375661 (mean) 3.race = .3544974 (mean) smoke = .3915344 (mean) ptl = .1957672 (mean) ht = .0634921 (mean) ui = .1481481 (mean) ------------------------------------------------------------------------------ | Delta-method | dy/dx std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- age | -.0054264 .0072688 -0.75 0.455 -.019673 .0088201 lwt | -.0030337 .0013669 -2.22 0.026 -.0057129 -.0003546 | race | Black | .2643163 .11724 2.25 0.024 .0345301 .4941025 Other | .1679337 .0863596 1.94 0.052 -.0013279 .3371953 | smoke | .1848858 .0790024 2.34 0.019 .030044 .3397276 ptl | .1084946 .0695517 1.56 0.119 -.0278243 .2448134 ht | .3669339 .1380704 2.66 0.008 .096321 .6375469 ui | .1518808 .0919254 1.65 0.098 -.0282897 .3320514 ------------------------------------------------------------------------------ Note: dy/dx for factor levels is the discrete change from the base level. . . . . . ** probit model . . * estimation . probit low age lwt i.race smoke ptl ht ui Iteration 0: Log likelihood = -117.336 Iteration 1: Log likelihood = -100.73961 Iteration 2: Log likelihood = -100.56113 Iteration 3: Log likelihood = -100.56095 Iteration 4: Log likelihood = -100.56095 Probit regression Number of obs = 189 LR chi2(8) = 33.55 Prob > chi2 = 0.0000 Log likelihood = -100.56095 Pseudo R2 = 0.1430 ------------------------------------------------------------------------------ low | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- age | -.0175445 .0216292 -0.81 0.417 -.059937 .0248481 lwt | -.0088205 .0039734 -2.22 0.026 -.0166082 -.0010327 | race | Black | .7475256 .3166405 2.36 0.018 .1269216 1.36813 Other | .5144711 .2555824 2.01 0.044 .0135389 1.015403 | smoke | .5627601 .2357783 2.39 0.017 .1006432 1.024877 ptl | .3178267 .2001253 1.59 0.112 -.0744117 .710065 ht | 1.099451 .4192793 2.62 0.009 .2776784 1.921223 ui | .4627944 .2756093 1.68 0.093 -.0773899 1.002979 _cons | .2682753 .7015254 0.38 0.702 -1.106689 1.64324 ------------------------------------------------------------------------------ . . * average marginal effects similar to the ones of logit model . margins, dydx(*) Average marginal effects Number of obs = 189 Model VCE: OIM Expression: Pr(low), predict() dy/dx wrt: age lwt 2.race 3.race smoke ptl ht ui ------------------------------------------------------------------------------ | Delta-method | dy/dx std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- age | -.0052632 .006454 -0.82 0.415 -.0179129 .0073865 lwt | -.0026461 .0011521 -2.30 0.022 -.0049041 -.000388 | race | Black | .2314958 .1012071 2.29 0.022 .0331334 .4298581 Other | .1526453 .0752693 2.03 0.043 .0051203 .3001704 | smoke | .1688236 .0676518 2.50 0.013 .0362285 .3014186 ptl | .0953455 .058933 1.62 0.106 -.020161 .210852 ht | .3298265 .1190813 2.77 0.006 .0964315 .5632216 ui | .1388347 .0808244 1.72 0.086 -.0195782 .2972475 ------------------------------------------------------------------------------ Note: dy/dx for factor levels is the discrete change from the base level. . . . . . ** linear probability model . . * here, we directly estimate marginal effects (see slide 37) . . * estimate with robust standard errors (see slide 37) . regress low age lwt i.race smoke ptl ht ui, vce(robust) Linear regression Number of obs = 189 F(8, 180) = 5.97 Prob > F = 0.0000 R-squared = 0.1635 Root MSE = .43427 ------------------------------------------------------------------------------ | Robust low | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- age | -.0034688 .0057272 -0.61 0.545 -.0147699 .0078323 lwt | -.0025213 .0010885 -2.32 0.022 -.0046691 -.0003735 | race | Black | .2214043 .1035048 2.14 0.034 .0171654 .4256431 Other | .1436247 .0736272 1.95 0.053 -.0016589 .2889082 | smoke | .1595568 .0696973 2.29 0.023 .022028 .2970857 ptl | .1153871 .0869949 1.33 0.186 -.0562739 .287048 ht | .3635326 .1360481 2.67 0.008 .0950783 .631987 ui | .1560515 .1078706 1.45 0.150 -.056802 .368905 _cons | .5074597 .2010367 2.52 0.012 .1107678 .9041516 ------------------------------------------------------------------------------ . . * compare to usual standard errors . * (there are not correct because of heteroskedasticity, see slide 28; . * note: point estimates the same) . regress low age lwt i.race smoke ptl ht ui Source | SS df MS Number of obs = 189 -------------+---------------------------------- F(8, 180) = 4.40 Model | 6.636136 8 .829517 Prob > F = 0.0001 Residual | 33.9458746 180 .188588192 R-squared = 0.1635 -------------+---------------------------------- Adj R-squared = 0.1263 Total | 40.5820106 188 .215861758 Root MSE = .43427 ------------------------------------------------------------------------------ low | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- age | -.0034688 .0063194 -0.55 0.584 -.0159384 .0090008 lwt | -.0025213 .0011532 -2.19 0.030 -.0047969 -.0002457 | race | Black | .2214043 .1001543 2.21 0.028 .0237767 .4190318 Other | .1436247 .0765303 1.88 0.062 -.0073873 .2946366 | smoke | .1595568 .0710842 2.24 0.026 .0192912 .2998224 ptl | .1153871 .06806 1.70 0.092 -.018911 .2496852 ht | .3635326 .134455 2.70 0.008 .0982219 .6288433 ui | .1560515 .0927102 1.68 0.094 -.0268872 .3389902 _cons | .5074597 .2085242 2.43 0.016 .0959933 .9189261 ------------------------------------------------------------------------------ . . . . . ** Klein and Spady estimator . . * recall that the estimator performs a nonparametric regression . * of low on xb for a candidate parameter value b and then constructs . * the likelihood function . * to illustrate: nonparametric regression for the parameter values from the logit model . . logit low age lwt i.race smoke ptl ht ui Iteration 0: Log likelihood = -117.336 Iteration 1: Log likelihood = -101.28644 Iteration 2: Log likelihood = -100.72617 Iteration 3: Log likelihood = -100.724 Iteration 4: Log likelihood = -100.724 Logistic regression Number of obs = 189 LR chi2(8) = 33.22 Prob > chi2 = 0.0001 Log likelihood = -100.724 Pseudo R2 = 0.1416 ------------------------------------------------------------------------------ low | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- age | -.0271003 .0364504 -0.74 0.457 -.0985418 .0443412 lwt | -.0151508 .0069259 -2.19 0.029 -.0287253 -.0015763 | race | Black | 1.262647 .5264101 2.40 0.016 .2309024 2.294392 Other | .8620792 .4391532 1.96 0.050 .0013548 1.722804 | smoke | .9233448 .4008266 2.30 0.021 .137739 1.708951 ptl | .5418366 .346249 1.56 0.118 -.136799 1.220472 ht | 1.832518 .6916292 2.65 0.008 .4769494 3.188086 ui | .7585135 .4593768 1.65 0.099 -.1418484 1.658875 _cons | .4612239 1.20459 0.38 0.702 -1.899729 2.822176 ------------------------------------------------------------------------------ . predict xbLogit, xb . lpoly low xbLogit, degress(1) ci noscatter . . * first install package . * see https://journals.sagepub.com/doi/pdf/10.1177/1536867X0800800203 . . net install st0144.pkg, from(http://www.stata-journal.com/software/sj8-2/) checking st0144 consistency and verifying not already installed... all files already exist and are up to date. . . * now implement the estimator . . tab race, gen(raceInd) // generate dummies Race | Freq. Percent Cum. ------------+----------------------------------- White | 96 50.79 50.79 Black | 26 13.76 64.55 Other | 67 35.45 100.00 ------------+----------------------------------- Total | 189 100.00 . . sml low age lwt raceInd2 raceInd3 ptl ht ui, offset(smoke) Iteration 0: Log likelihood = -104.85961 (not concave) Iteration 1: Log likelihood = -104.35909 (not concave) Iteration 2: Log likelihood = -104.25205 (not concave) Iteration 3: Log likelihood = -103.63407 Iteration 4: Log likelihood = -103.5144 Iteration 5: Log likelihood = -103.27984 Iteration 6: Log likelihood = -103.27601 Iteration 7: Log likelihood = -103.27599 Iteration 8: Log likelihood = -103.27599 SML Estimator - Klein & Spady (1993) Number of obs = 189 Wald chi2(7) = 23.50 Log likelihood = -103.27599 Prob > chi2 = 0.0014 ------------------------------------------------------------------------------ low | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- age | -.0500978 .0299194 -1.67 0.094 -.1087387 .0085432 lwt | -.0102034 .0041902 -2.44 0.015 -.018416 -.0019908 raceInd2 | 1.440672 .445042 3.24 0.001 .5684062 2.312939 raceInd3 | .732146 .283806 2.58 0.010 .1758964 1.288396 ptl | .3543276 .2167198 1.63 0.102 -.0704355 .7790907 ht | 1.994203 .4731155 4.22 0.000 1.066913 2.921492 ui | .5342482 .3714303 1.44 0.150 -.1937419 1.262238 smoke | 1 (offset) ------------------------------------------------------------------------------ . predict ttt, xb . lpoly low ttt, noscatter ci . . . . . ** now take a step back and look at case of only one covariate: age . . * histogram of age . hist age (bin=13, start=14, width=2.3846154) . . * make plots for probit model . probit low age Iteration 0: Log likelihood = -117.336 Iteration 1: Log likelihood = -115.92724 Iteration 2: Log likelihood = -115.92448 Iteration 3: Log likelihood = -115.92448 Probit regression Number of obs = 189 LR chi2(1) = 2.82 Prob > chi2 = 0.0929 Log likelihood = -115.92448 Pseudo R2 = 0.0120 ------------------------------------------------------------------------------ low | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- age | -.0315483 .0190565 -1.66 0.098 -.0688985 .0058018 _cons | .2358955 .4461298 0.53 0.597 -.6385028 1.110294 ------------------------------------------------------------------------------ . predict pHatSimpleCaseProbit, pr // predicted probability . predict xbSimpleCaseProbit, xb // fitted value of xb . scatter pHatSimpleCaseProbit xbSimpleCaseProbit // shows that fitted probability is c.d.f. at xb . scatter pHatSimpleCaseProbit xbSimpleCaseProbit, msize(tiny) jitter(2) // jitter so that we see the data points . scatter pHatSimpleCaseProbit age // fitted probability by age . . * predicted probability for logit model . logit low age Iteration 0: Log likelihood = -117.336 Iteration 1: Log likelihood = -115.96259 Iteration 2: Log likelihood = -115.95598 Iteration 3: Log likelihood = -115.95598 Logistic regression Number of obs = 189 LR chi2(1) = 2.76 Prob > chi2 = 0.0966 Log likelihood = -115.95598 Pseudo R2 = 0.0118 ------------------------------------------------------------------------------ low | Coefficient Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- age | -.0511529 .0315138 -1.62 0.105 -.1129188 .0106129 _cons | .3845819 .7321251 0.53 0.599 -1.050357 1.819521 ------------------------------------------------------------------------------ . predict pHatSimpleCaseLogit, pr . scatter pHatSimpleCaseLogit age // fitted probability by age . . * linear probability model . regress low age, vce(robust) Linear regression Number of obs = 189 F(1, 187) = 3.46 Prob > F = 0.0644 R-squared = 0.0141 Root MSE = .46254 ------------------------------------------------------------------------------ | Robust low | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- age | -.0104291 .0056068 -1.86 0.064 -.0214897 .0006316 _cons | .5545211 .1392388 3.98 0.000 .2798403 .8292018 ------------------------------------------------------------------------------ . predict pHatSimpleCaseLpm, xb // note: now xb . scatter pHatSimpleCaseLpm age // fitted probability by age . . * can always make model more flexible by putting square terms etc. into xb . gen age2 = age^2 . regress low age age2 // shows that this is too flexible: estimates become imprecise, reason: close multicollinearity Source | SS df MS Number of obs = 189 -------------+---------------------------------- F(2, 186) = 1.67 Model | .716128028 2 .358064014 Prob > F = 0.1909 Residual | 39.8658826 186 .214332702 R-squared = 0.0176 -------------+---------------------------------- Adj R-squared = 0.0071 Total | 40.5820106 188 .215861758 Root MSE = .46296 ------------------------------------------------------------------------------ low | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- age | .0239862 .0427544 0.56 0.575 -.0603596 .1083321 age2 | -.0006847 .0008411 -0.81 0.417 -.0023441 .0009747 _cons | .1436488 .5270788 0.27 0.786 -.8961725 1.18347 ------------------------------------------------------------------------------ . predict pHatSimpleCaseLinQuad (option xb assumed; fitted values) . scatter pHatSimpleCaseLinQuad age . . * nonparametric: take average by age . bys age: egen pHatSimpleCaseAverage = mean(low) . . * number of observations and standard deviation by age . bys age: egen n_age = count(low) . bys age: egen sd_low_age = sd(low) (2 missing values generated) . . * standard error of the age-specific mean . gen se_pHatSimpleCaseAverage = sd_low_age / sqrt(n_age) (2 missing values generated) . . * 95% confidence interval . gen lb_pHatSimpleCaseAverage = pHatSimpleCaseAverage - /// > 1.96 * se_pHatSimpleCaseAverage (2 missing values generated) . . gen ub_pHatSimpleCaseAverage = pHatSimpleCaseAverage + /// > 1.96 * se_pHatSimpleCaseAverage (2 missing values generated) . . ** optional: keep CI inside [0,1] . *replace lb_pHatSimpleCaseAverage = 0 if lb_pHatSimpleCaseAverage < 0 . *replace ub_pHatSimpleCaseAverage = 1 if ub_pHatSimpleCaseAverage > 1 . . * alternative to going fully nonparametric: nonparametric regression . lpoly low age, degress(1) ci noscatter . . * keep one observation per age for plotting . preserve . . keep if n_age >= 5 (18 observations deleted) . . bys age: keep if _n == 1 (155 observations deleted) . sort age . . * compare . twoway /// > (rarea ub_pHatSimpleCaseAverage lb_pHatSimpleCaseAverage age, /// > sort fcolor(gs12) fintensity(40) lcolor(gs12)) /// > (scatter pHatSimpleCaseAverage age, /// > msymbol(O) mcolor(red) msize(small)) /// > (scatter pHatSimpleCaseProbit age, /// > msymbol(X) mcolor(edkblue) msize(small)) /// > (scatter pHatSimpleCaseLogit age, /// > msymbol(Dh) mcolor(edkblue) msize(small)) /// > (scatter pHatSimpleCaseLpm age, /// > msymbol(Sh) mcolor(edkblue) msize(small)) /// > (scatter pHatSimpleCaseLinQuad age, /// > msymbol(Th) mcolor(edkblue) msize(small)), /// > legend(position(2) ring(0) cols(2) /// > order(2 "average data" /// > 3 "probit" /// > 4 "logit" /// > 5 "linear probability model" /// > 6 "linear probability model quadratic" /// > 1 "95% CI")) /// > yscale(range(-0.5 1.5)) /// > ytitle("probability child has low birth weight") /// > xtitle("age") /// > note("plotted for ages with at least 5 mothers") . . restore . . . . . log close name: log: /Users/kleintob/Documents/GitHub/econometricsResearchMaster/notes for lectures in class/stata 1 binary choice.log log type: text closed on: 9 May 2026, 10:05:14 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------