lost-stats / lost-stats.github.io Goto Github PK

Source code for the Library of Statistical Techniques

Home Page: https://lost-stats.github.io/

R 3.79% HTML 1.24% Ruby 21.62% SCSS 17.68% Python 55.67%

lost-stats.github.io's Introduction

LOST

This is the official repo for Library of Statistical Techniques (LOST) website: https://lost-stats.github.io/

LOST is a publicly-editable website with the goal of making it easy to execute statistical techniques in statistical software.

Building locally

If you are interested in local development, we use ruby 2.6.4. From there, you can run

bundle install
bundle exec jekyll serve

If you'd like to check for broken links, you can run

bundle exec jekyll build
bundle exec htmlproofer --assume-extension --allow-hash-href ./_site

Testing code samples

We have some facilities for testing to make sure that all the code samples in this repository work, at least fo the open source languages. You will need a few extra requirements for this section.

CAVEAT EMPTOR The following commands run arbitrary code samples on your machine from this repository. They are run inside isolated docker containers, but currently those containers have no ulimits. Thus, it is possible that they could, e.g., download giant files and cause your machine to come to a crawl.

Requirements

You will first need to install Docker. You will also need Python 3.8 or above. After this, you will need to run the following commands:

python3 -m venv venv
source venv/bin/activate
pip install 'mistune==2.0.0rc1' 'py.test==6.1.1' 'pytest-xdist==2.1.0'

docker pull ghcr.io/lost-stats/lost-docker-images/tester-r
docker pull ghcr.io/lost-stats/lost-docker-images/tester-python

At this point, the docker images will not be updated unless you explicitly repull them.

Running tests

After completing the setup, you can simply run

source venv/bin/activate
py.test test_samples.py

Note that this will take a long time to run. You can reduce the set of tests run using the --mdpath option. For instance, to find and run all the code samples in the Time_Series and Presentation folders, you can run

py.test test_samples.py --mdpath Time_Series --mdpath Presentation

Furthermore, you can run tests in parallel by adding the -n parameter:

py.test test_samples.py -n 3 --mdpath Time_Series

Adding dependencies

The docker images in which these tests are run are managed in the lost-docker-images repo. See instructions there for adding dependencies. After which, you will need to refresh your docker images with:

docker pull ghcr.io/lost-stats/lost-docker-images/tester-r
docker pull ghcr.io/lost-stats/lost-docker-images/tester-python

Connecting code samples

Note that a lot of code samples in this repository are broken up by raw markdown text. If you would like to connect these in a single runtime, you should specify the language as language?example=some_id for each code sample in the chain. For instance, a Python example might be specified as python?example=seaborn as you can see in the Line Graphs Example.

lost-stats.github.io's People

Contributors

Stargazers

Watchers

Forkers

nickch-k philip-khor fadobyan rtol snowdj salubanski clibassi kevinsong5 shanzhang5 garrettstanford21 promisekamanga bradjbailey smshihab benjaminschefrin robmcdonough jholste rcberg cobriant fhoces nniiicc anhnguyendepocen feiyishao elsmontoya marykmcd evanmj7 wenjingb zuzhangjin infinitemu lvought mwagnon zengyous aijiezhao salvatorepg slumdog92 aliarsalankazmi aeturrell apoorvalal edwardhuh epogrebnyak ganzeb jwbowers vikjam brockmwilson gionikola wensongz-uo ms-holloway kim12100 jbazzle tamararen boyoonchang emmettsaulnier tannerbivins connortwiegand ajdickinson fxyw murattasdemir rohit-21 overdispersed avilieber pgnepal masoncarhart turveyanalytics derekholste pengfei1005 jonasbowman matt-mccoy clairefield tillmandegens kkrnacik tjschechter devinbunch hughpham wilburtownsend zl-shawn wenyiyin wjksiazek dgmcgill rosaliesherry pomang-211 deluair teora pablogarriga theelem yuegu1994 rossmck94 danferno guysutton zmughal nathalienf dr-joe-roberts almahaa jnaecker ahsutoss ekeneobi alunap taowang-eco campbell-miller smudna limeiji eshbanyounas

lost-stats.github.io's Issues

Build error

I made a few commits and kept getting the following error when it tries to push to Pages:

Error: Unable to process command '::set-env name=DEPLOYMENT_STATUS::success' successfully.
Error: The `set-env` command is disabled. Please upgrade to using Environment Files or opt into unsecure command execution by setting the `ACTIONS_ALLOW_UNSECURE_COMMANDS` environment variable to `true`. For more information see: https://github.blog/changelog/2020-10-01-github-actions-deprecating-set-env-and-add-path-commands/

Any idea how to fix this?

Add content on best practice for reproducibility?

Given the twitter conversation that some of us participated in today, and the need for reproducibility in projects, should we add a section/page on best practice for reproducibility or would that be out of scope? Personally, I think it could be really useful for site users but I'd like to hear everyone's thoughts.

I guess we might want to cover:

reproducing a set of packages (eg pip freeze > requirements.txt or conda env export --from-history -f environment.yml and the R equivalents)
reproducing the execution sequence of a series of scripts and commands (eg using, and perhaps drawing, a makefile or using a tool like ploomber)
reproducing the operating system (Docker)
anything else?

We could draw on the relevant sections of The Turing Way if necessary.

Event study in Stata

First of all, thank you for this wonderful resource!

I am confused by the Stata event study code, and think it might not be totally correct. For reference, here it is

use "https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Model_Estimation/Data/Event_Study_DiD/bacon_example.dta", clear

* create the lag/lead for treated states
* fill in control obs with 0
* This allows for the interaction between `treat` and `time_to_treat` to occur for each state.
* Otherwise, there may be some NAs and the estimations will be off.
g time_to_treat = year - _nfd
replace time_to_treat = 0 if missing(_nfd)
* this will determine the difference
* btw controls and treated states
g treat = !missing(_nfd)

* Stata won't allow factors with negative values, so let's shift
* time-to-treat to start at 0, keeping track of where the true -1 is
summ time_to_treat
g shifted_ttt = time_to_treat - r(min)
summ shifted_ttt if time_to_treat == -1
local true_neg1 = r(mean)

* Regress on our interaction terms with FEs for group and year,
* clustering at the group (state) level
* use ib# to specify our reference group
reghdfe asmrs ib`true_neg1'.shifted_ttt pcinc asmrh cases, a(stfips year) vce(cluster stfips)

My problem stems from the line

replace time_to_treat = 0 if missing(_nfd)

This means that states which are not treated are given 0, meaning they are treated in that year. This gives the following


time_to_tre	
at	Freq.	Percent	Cum.
			
-21	1	0.06	0.06
-20	2	0.12	0.19
-19	2	0.12	0.31
-18	2	0.12	0.43
-17	2	0.12	0.56
-16	3	0.19	0.74
-15	3	0.19	0.93
-14	3	0.19	1.11
-13	6	0.37	1.48
-12	7	0.43	1.92
-11	9	0.56	2.47
-10	12	0.74	3.22
-9	22	1.36	4.58
-8	25	1.55	6.12
-7	32	1.98	8.10
-6	34	2.10	10.20
-5	36	2.23	12.43
-4	36	2.23	14.66
-3	36	2.23	16.88
-2	36	2.23	19.11
-1	36	2.23	21.34
0	465	28.76	50.09
1	36	2.23	52.32
2	36	2.23	54.55
3	36	2.23	56.77
4	36	2.23	59.00
5	36	2.23	61.22
6	36	2.23	63.45
7	36	2.23	65.68
8	36	2.23	67.90
9	36	2.23	70.13
10	36	2.23	72.36
11	36	2.23	74.58
12	35	2.16	76.75
13	34	2.10	78.85
14	34	2.10	80.95
15	34	2.10	83.06
16	34	2.10	85.16
17	33	2.04	87.20
18	33	2.04	89.24
19	33	2.04	91.28
20	30	1.86	93.14
21	29	1.79	94.93
22	27	1.67	96.60
23	24	1.48	98.08
24	14	0.87	98.95
25	11	0.68	99.63
26	4	0.25	99.88
27	2	0.12	100.00
			
Total	1,617	100.00

It's possible that because in control units, time_to_treat does not vary across years, the state (stfips) fixed effects "take care" of this. But I can't intuitively reason about what's really happening given 0 stands for both untreated and treated, but year 0.

I would recommend making the time_to_treat variable 100 or the maximum plus 100, to avoid this confusion. The values don't matter since they are used as fixed effects anyways.

how to save the stargazer output table into a file/png using Python

Hi I have used Python package stargazer to generate a nice table to showcase all the regression output in html format.
I am wondering how shall I save this format into a certain folder so that I could share it with my co-authors?

Added as a contributer

Hello, I would like to be added as a contributor to the LOST repo. I want to add a page on color palettes, I found it in the non-existing pages tab, and I think it would be really useful for others to have as a resource.

data.table variants

Thinking of going through at least the data manipulation pages and adding data.table versions for all the R examples. Is this different enough to be worth it?

Event study in Stata

First, thanks a lot for putting this out for learners like us. I was wondering if i can use the following for repeated cross sectional data as well?

Code copy and pasted from the diff-in-diff event study section************

use "https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Model_Estimation/Data/Event_Study_DiD/bacon_example.dta", clear

create the lag/lead for treated states
fill in control obs with 0
This allows for the interaction between treat and time_to_treat to occur for each state.
Otherwise, there may be some NAs and the estimations will be off.
g time_to_treat = year - _nfd
replace time_to_treat = 0 if missing(_nfd)
this will determine the difference
btw controls and treated states
g treat = !missing(_nfd)
Stata won't allow factors with negative values, so let's shift
time-to-treat to start at 0, keeping track of where the true -1 is
summ time_to_treat
g shifted_ttt = time_to_treat - r(min)
summ shifted_ttt if time_to_treat == -1
local true_neg1 = r(mean)
Regress on our interaction terms with FEs for group and year,
clustering at the group (state) level
use ib# to specify our reference group
reghdfe asmrs ib`true_neg1'.shifted_ttt pcinc asmrh cases, a(stfips year) vce(cluster stfips)
Pull out the coefficients and SEs
g coef = .
g se = .
levelsof shifted_ttt, l(times)
foreach t in times' { replace coef = _b[t'.shifted_ttt] if shifted_ttt == t' replace se = _se[t'.shifted_ttt] if shifted_ttt == `t'
}
Make confidence intervals
g ci_top = coef+1.96se
g ci_bottom = coef - 1.96se
Limit ourselves to one observation per quarter
now switch back to time_to_treat to get original timing
keep time_to_treat coef se ci_*
duplicates drop

sort time_to_treat

Create connected scatterplot of coefficients
with CIs included with rcap
and a line at 0 both horizontally and vertically
summ ci_top
local top_range = r(max)
summ ci_bottom
local bottom_range = r(min)

twoway (sc coef time_to_treat, connect(line)) ///
(rcap ci_top ci_bottom time_to_treat) ///
(function y = 0, range(time_to_treat)) ///
(function y = 0, range(bottom_range' top_range') horiz), ///
xtitle("Time to Treatment") caption("95% Confidence Intervals Shown")

Correct syntax highlighting for all languages

Currently, native syntax formatting only works for some languages (Python, R, SAS). Other languages like Stata, Matlab, GRETL, etc. are just formatted as plain text chunks. See here for some examples: https://lost-stats.github.io/Model_Estimation/ordinary_least_squares.html#implementations

Do we need to add syntax support for these other languages? (Maybe from Pygments or some other source?)

PR Processing

Excited about all the new PRs! @FeiyiShao @marykmcd @Evanmj7

I will be processing one PR a day, so it may take me a bit to get to yours, but don't worry, it's happening.

Markdown tables not compiling

I was just taking a look at the page for Combining Datasets Overview and the Markdown tables used at the top are not being shown as tables. They show up properly in the code/Markdown preview for that page, but not the final version. How can we fix this?

nokogiri dependency

I keep getting warnings that there is a security hole in the nokogiri dependency. Is this something to be worried about (or fixed) @khwilson ?

Broken links

Didn't realize this wasn't an open issue. Just finished fixing all the broken links (except for the ones that don't lead to existing pages of course). Opening this issue to close it.

Add vector autoregression and impulse response functions to granger causality section

I would like to add a section at the end of the granger causality page dedicated to impulse response functions. I think they are a very helpful way to visualize granger causality. Would it be possible to make me a contributor to add these changes?

Page not updating?

The master branch does not appear to be auto-updating. I made an update to source about six hours ago and so far there hasn't been an update in master or on the website. Does something need to be turned back on @khwilson ?

Post contribution warning and possible mistake changing new page template

Hello,
I am happy to have contributed on LOST however, after I edited the new page template into the new Tobit.md, I realized I've proposed erasing the new page template and replacing it with my contribution. Also, I got this message "This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository." I'm not exactly sure how to get my page I wrote into the correct place on a new page.. Help please.

Add a test script?

Many of the examples in this are broken, for example in the balance test, the guide instructs us to do

bal.text(treat ~ foreign, data = mtcars)

Despite the fact that neither treat nor foreign exist in the mtcars dataset. Looking through the code-base, there doesn't seem to be an automated testing for this repository. Is this something on the radar?

Adding python code for fixed effect and data.table for reshaping in R

Hi,

I made a few edits in the Data Manipulations by add data.table code for reshaping (both long-to-wide and wide-to-long), and also in OLS by adding Python code for fixed effects regression.

Would you please add me as the contributor so that I can push my changes to the repo and make a pull request? Currently, when I push, it consistently says "The requested URL returned error: 403".

Best,
Wensong

access to the repo

Hi,
I have created 'faceted graphs' R markdown document but I am not able to create pull request to submit it. Pls give me access to LOST-STATS repo so that I can push my work to the repo. My git handle is pramod-dudhe.

Thanks,
Pramod

Encourage use of same datasets across implementations

In looking over several examples, I've come around to the idea that we should strongly encourage re-use of the same datasets across language implementations (i.e for a specific page). Advantages include:

Common input and output for direct comparisons.
Avoids duplicate typing up of the task (we can write that once under the implementation header) and unnecessary in-code commenting (which, frankly, I think we have too much of atm and personally find quite distracting).

I recently did this for the Collapse a Dataset page and, personally, think it's a lot easier to read and compare the code examples now. @vincentarelbundock's Rdatasets is a very useful resource in this regard, since it provides a ton of datasets that can be directly read in as CSVs. (Both Julia and Python statsmodels have nice wrappers for it too.)

Question: Do others agree? If so, I'll add some text to Contributing page laying this out.

PS. I also think we should discourage use of really large files, especially since this is going to start becoming a drag on our GA Actions builds. There is one big offender here that I'll try to swap out when I get a sec. (Sorry, that's my student and I should have warned her about it.)

Stylistic (documentation) changes

@NickCH-K and I have discussed some of these over Twitter, but I'm going to jot down a laundry list of stylistic elements that I feel could be improved/changed. These are pretty opinionated, so I'd welcome thoughts from others.

Package names: Following JOSS, the R Journal and others, package names in the text should be in bold rather than code. If possible, the package homepage should also be hyperlinked when it is first mentioned. i.e. "...using the lfe package (link) we can...", rather than "...using the lfe package we can...". Code should be reserved for, well, code. Similarly, we can revert to code when referring to a specific function from that package. (E.g. "Use the lfe::felm() function..." is fine.)
Use of grandchild pages. We have support for grandchild pages and I think we should use them, or else individual topics are going to be become pretty unwieldy. Conversely, the default alphabetical ordering means that closely related topics might not be indexed in an optimal way (e.g. logit and probit models). Another example might be non-standard errors: HC, HAC and clustered SEs probably all deserve their own page. A three-level page hierarchy of Model estimation -> Non-standard errors -> HC (and HAC and Clustered SEs) makes sense to me.
Implementation position: Should we move the "Implementation" sections higher up on the page? I feel like this is the main reason people are going to be using LOST and at present they first have to scroll down passed the "Keep in Mind" and "Also consider" sections before hitting pay dirt. Maybe these two sections could be moved below "Implementations" instead?
Section TOC: Relatedly...This one is potentially more complicated, but should we have a section TOC at the top of every page? I think this could be automated with git-hooks and doctoc. But I'll let @khwilson weigh in here.

Pushed color palette image files to wrong folder

Hi all, I recently uploaded a page on color palettes and pushed the image files to the figures subdirectory instead of the images subdirectory within the figures subdirectory. Any idea on how I can move them to the proper location so that the graphs can show up on the LOST page?

Add GRETL Lexer

Per #3, we would like to support syntax highlighting for all the relevant languages here. https://github.com/rouge-ruby/rouge doesn't have a Stata or GRETL lexer. I've written a Stata one, but don't want to fix up all the test issues to get it through Rouge's process, so for now it lives in the _plugins directory.

If you'd like to add a GRETL lexer, feel free to follow the same process as for the Stata lexer. :)

Better syntax highlighting

This is fairly minor as issues go, but the code syntax highlighting on the site is subtle and, in some examples, looks like plain text. I wondered if you'd be interested in changing the settings in a way that produces more contrast between different code elements.

I think the site uses rouge for syntax highlighting but it seems like the rouge default is more colourful than what is being currently displayed on the site.

It seems like rouge settings are configured in _config.yml but I'm not sure how one would change them to get the rouge default syntax highlighting to display instead of the current setup.

New name for the Generalized Least Squares category

Having just added quantile regression to the GLS category, I'm realizing that the category name doesn't really work that well. Quantile regression is not a least-squares method but there's not really another good place for it to go. Some of the other pages in that category are pretty tenuous too.

I think the idea behind the GLS category is linear-index models that aren't just extensions of OLS. Probably the name that would make the most sense is "generalized linear models" or at least that would make the most sense if that weren't a specific thing that wouldn't actually apply to all the methods.

"Other Linear Models"?
"Linear-in-Predictors Models" (ew)
"Other Regression Models"?
Drop the category and merge it back with OLS, retitling the whole thing "Regression Models"?
Do the HLM thing and call them Fixed Effects Models, thereby confusing everybody? :P

Drawing a blank for anything better than that. Any ideas? Updating every page that links there will be a bit of a chore so hoping to get it right!

Multilevel Models

I couldn't find any mention of general multilevel models on the Desired Nonexistent Pages page.
Did I miss the right page, or is there a specific reasons against providing a basic explanation of multilevel models?

Site Down

It looks like the recent github changes have made the site kick the can. Working on getting it back up. But just FYI may cause some more downtime. :-/

Code samples which don't pass tests

There are a few code samples which don't quite work. For Python, they are here. For R they are here.

Likely reasons for breaking

Split blocks

Most likely, this is because code samples are split by text. For example, if the Markdown looked like this:

Set _x_ to 1.

```r
x <- 1
```

Then add 1 to it.

```r
x <- x + 1
```

In this case, x is not defined in the second code block. To fix this, you can name the code block, e.g.,

Set _x_ to 1.

```r?example=simple
x <- 1
```

Then add 1 to it.

```r?example=simple
x <- x + 1
```

The code tester will gather each of these code blocks together and run them sequentially.

Examples that aren't meant to work

There are a few examples where the example isn't literally meant to work. For example,

```python
import pandas as pd

df = pd.read_csv("name_of_file")
```

In this case, "name_of_file" is obviously a placeholder for the path to the file. But the code doesn't literally work, so it will cause a failure.

As a solution, you can indicate you want the system to skip it:

```python?skip=true&skipReason=file_does_not_exist
import pandas as pd

df = pd.read_csv("name_of_file")
```

Here note that we added ?skip=true, which tells the test runner to ignore the test. We also add &skipReason=file_does_not_exist, which is just an optional explanation for why we are skipping the test.

Rerunning the tests

If you are fixing these locally, you can (if you have python, poetry, and docker installed) run the tests locally with:

poetry install
poetry run py.test -n 4

Alternatively, you can rerun the tests on Github by:

Clicking the Actions tab
Choose Run monthly [python|r] tests
Find the Run workflow button
And then click Run workflow

Standard Environments

I think it would be great if all the examples here assumed standard environments and we added a page about it. It would also allow #6 to actually happen.

E.g., for R, you should install.packages(c('tidyverse', 'mass')) or whatnot.

Thoughts?

Did I successfully suggest changes?

I think maybe I am messing up how I am suggesting edits to pages but I'm not sure. I know I'm a contributor, so I don't think I'm supposed to be doing PRs, but I have been submitting my edits by just clicking the green "Propose changes" button at the bottom of each page. Should I be doing anything else? Over the last few days I made some modifications on the synthetic control page, the density plots page, and the support vector machines page.

Create page on package creation

Hi all,

This might be tangential to #73 , but I was thinking it would be useful to have a page on basic package creation and/or namespace management.

I'd be willing to write brief R and Julia tutorials. Also, as a learner I would love to see best practices for Python!

My apologies if this already exists and I'm failing to notice it :~)

Add to KNN page - R walkthrough

R

The simplest way to perform KNN in R is with the package class. It has a KNN function that is rather user friendly and does not require you to do distance computing as it runs everything with euclidean distance. For more advanced types nearest neighbors testing it would be best to use the matchit function from the matchit package. To verify results this example also used the confusionMatrix function from the package caret.
Due to how this package is designed the easiest room for error would be during normalization by normalizing variables such as character or other ones that do not require normalization. Another good source of error is not including drop = TRUE for your target, or y, vector which will prevent the model from running. Finally, the way this example verifies results it is vital to convert the target into a factor as the data has to be in similar kind in order for R to give you an output.


library(tidyverse)
library(readr)

#For KNN
library(class)
library(caret)


#Import the Dataset
df <- read_csv("wdbc.csv")
view(df)

#the first column is an identifier so remove that, anything that does not aid in classifying can be removed
df <- df[-1]


#See the count of the target, either B, benign, or M, malignant
table(df[1])

#Normalize the Dataset

normal<- function(x) { return ((x - min(x)) / (max(x) - min(x))) }

#Apply to what needs to be normalized, in this case not the target
df_norm <- as.data.frame(lapply(df[2:31], normal))

#Verify that normalization has occurred
summary(df_norm[1])
summary(df_norm[3])
summary(df_norm[11])
summary(df_norm[23])


#Split the dataframe into test and train datasets - note there are two dataframes
#First test and train from the features, here is an example of about a 70/30 split for testing and training

x_train <- df_norm[1:397,]

x_test <- df_norm[398:568,]


#Now test and train for the target - here is import that you do ", 1" to indicate only one column
#It will not work unless you use drop = TRUE
y_train <- df[1:397, 1, drop = TRUE]

y_test <- df[398:568, 1, drop = TRUE]


#The purpose of installing those packages were to use these next functions, first KNN
#Like the python example states, best practice for K unless assigned is the square root of the number of observations
pred <- knn(train = x_train, test = x_test, cl = y_train, k = 23)

#Confusion Matrix from Caret

#KNN converts to a factor with two levels so we need to make sure the test dataset is similar
y_test <- y_test %>% factor(levels = c("B", "M"))

#See how well the model did
confusionMatrix(y_test, pred)

References for R walkthrough

The dataset used is from the UCI Machine Learning Repository under Breast Cancer Wisconsin (Diagnostic) Data Set. Rdocumentation for KNN was used in order to work on this example. Also, statology's "how to create a confusion matrix"
wdbc.csv

Fix marginal effects plot (with categorical interactions) page

The fixest code is all broken here: https://lost-stats.github.io/Presentation/Figures/marginal_effects_plots_for_interactions_with_categorical_variables.html

(Likely due to changes to i() introduced around version 7.0.0).

The solution is something like:

library(fixest)

od = read.csv('https://github.com/LOST-STATS/lost-stats.github.io/raw/source/Presentation/Figures/Data/Marginal_Effects_Plots_For_Interactions_With_Categorical_Variables/organ_donation.csv')

od = 
  within(od, {
    Date = as.Date(paste(substr(Quarter, 3, 7), 
                         as.integer(substr(Quarter, 2, 2))*3-2, 
                         1, 
                         sep = "-"))
    Treated = State == 'California'
})

fmod = feols(Rate ~ i(Date, Treated, ref = "2011-04-01") | State + Date, 
             data = od)

coefplot(fmod)
iplot(fmod)

But stepping back, I actually think we should change the dataset for this page. It requires several tedious data cleaning steps across the different languages and ultimately ends up producing an event study plot because of the time dimension (which is confusing in of itself). Aren't we just looking for something like

mod = lm(mpg ~ factor(vs) * factor(am), mtcars)
summary(mod)
marginaleffects::plot_cap(mod, condition = "am")

? (And obvs the equivalent in other languages)

Broken links for data importing pages?

I had on my list of additions I was planning to make to LOST some of the Stata code for importing various kinds of files, but the only importing page I can currently find is the Import a Foreign Data File page, and the "Also consider" links on that page seem to be broken. I may be missing them somewhere else, but just wanted to flag it.

Event study in stata with reghdfe

For this code:

use "https://raw.githubusercontent.com/LOST-STATS/LOST-STATS.github.io/master/Model_Estimation/Data/Event_Study_DiD/bacon_example.dta", clear

create the lag/lead for treated states
fill in control obs with 0
This allows for the interaction between treat and time_to_treat to occur for each state.
Otherwise, there may be some NAs and the estimations will be off.
g time_to_treat = year - _nfd
replace time_to_treat = 0 if missing(_nfd)
this will determine the difference
btw controls and treated states
g treat = !missing(_nfd)
Stata won't allow factors with negative values, so let's shift
time-to-treat to start at 0, keeping track of where the true -1 is
summ time_to_treat
g shifted_ttt = time_to_treat - r(min)
summ shifted_ttt if time_to_treat == -1
local true_neg1 = r(mean)
Regress on our interaction terms with FEs for group and year,
clustering at the group (state) level
use ib# to specify our reference group
reghdfe asmrs ib`true_neg1'.shifted_ttt pcinc asmrh cases, a(stfips year) vce(cluster stfips)

--> For the regression in the last line, i do not see the interaction term: treat## ib`true_neg1'.shifted_ttt. Is it a typo? Or is there a logic behind not including the interaction with "treat" variable?

Thank you guys!

Build failure

@khwilson I keep getting failed build warnings on all the new security updates, and so haven't been merging them. Is there a place I can look to figure out what's going on? I'm not quite able to figure it out from the logs.

New security advisory

There's a "high severity" security issue listed here. I think I see the fix but don't want to break whatever the GH-actions flow is. @khwilson how should this be fixed?

DiD Event Study Code in Python Interactions in the Wrong Order

The diff-in-diff event study code in Python generate the interactions in the wrong order INX_0, INX_1, INX_10, INX_11, INX_12, ... instead of INX_0, INX_1, INX_2, ... INX_11, ...

The resulting figure also has evidence of this

I think the best place to reorder these variables in the df is right before running the regression, but I have been getting tripped up with the factors.union line.

An Example of Web Scraping in R

web_scraping.md

Nonexistent pages

I've been working through and fixing all the broken links. There are, unsurprisingly, a bunch of links to pages that don't exist yet. This makes it annoying to fix the broken links, since you have to remember what pages exist, and also is probably annoying for anyone who clicks on them. What would be the best way to make these links "work?"

Having them go to a page that says 'this page doesn't exist yet' seems ideal. I assume it wouldn't quite work to do that automatically, Wikipedia-style, right? If not I can set something up manually.

Convert Absolute Links to Relative Links

Breaking out off-topic discussion from #55 into a new issue....

In order to support better proofing and modularity, we should change all the absolute links (e.g., https://lost-stats.github.io/Time_Series/model.html) to Liquid-friendly relative links, i.e.,

{{ "/Time_Series/model.html" | relative_url }}

This will allow #59 to more easily pick out dead links as well as making TOC migrations easier.

Causal Forest Walkthrough is wrong for the stata - R interface example

Hey,

Really cool resource guys!
There is a mistake on this tutorial explaining how to train causal forests:
https://lost-stats.github.io/Machine_Learning/causal_forest.html

In the stata part, you split data into test and training using
g split = runiform() > .5

and you send over the testing data with the preserve block and calling
keep if split == 0

But at not point do you call
keep if split == 1

to explicitly only keep the training data for the training!

I believe after the preserve block

preserve
* Keep the predictors from the holding data, send it over, so later we can make an X matrix to predict with
keep if split == 0
keep year prbarr prbconv prbpris avgsen polpc density taxpc regionn smsan pctmin wcon
* R needs that data pre-processed! So using the same variables as in the main model, process the variables
fvrevar year prbarr prbconv prbpris avgsen polpc density taxpc i.regionn i.smsan pctmin wcon
keep `r(varlist)'
* Then send the data to R
rcall: df.hold <- st.data()
restore

You just need to add one line which is
keep if split == 1

So that you train with the training data and not the whole dataset.

Thanks!

Page nesting

See: #5 (comment)

We want to tidy up page hierarchies, starting with Model Estimation and maybe move on to some of the other sections. The tricky thing is deciding the best categories, although some overlap is probably unavoidable. Here's a first stab... mostly working off the existing pages, but with one or two non-existing (but obvious) counterparts thrown in. Feel to edit or make changes.

Ordinary Least Squares
- (Simple?) Linear Regression
- Fixed Effects
- Interaction terms and Polys
Generalised Least Squares
- Logit
- Probit
- Tobit
- Heckman
- McFadden / clogit
Multilevel (hierarchical?) models
- Random effects
- Linear mixed effects
- Bayesian hierarchical models
- etc
Inference (maybe "Design"?)
- DiD
- RDD
- IV
- Linear Hypothesis Tests
Nonstandard errors
- This sub-section is already nested