/*************************************************************************************;
Empirical p macro (with replacement);
Written by ?????, updated by Jeri Seidman 01/08/2016;
dsetin = original dataset, which will only be changed if dsetout is blank;
you want the dataset to include only observations in the regression, both so that the random pool includes only observations with all the necessary variables
and because fewer observations will speed up the macro processing
dsetout = leave blank to overwrite dsetin;
Regression model is &yvar = &xvars &controls--all xvars and controls will be included in model;
yvar = can only manage one indpendent variable per macro statement;
xvars = independent variables to be randomly assigned;
controls = independent variables that will not be randomly assigned but will remain with their original dependent variable;
leave blank if all independent variables in the model are to be randomly assigned (and so are in the xvars list);
numiter = the number of Monte Carlo procedures desired (i.e., 1000 or 10,000 iterations);
fixed = fixed effect;
the macro can only handle ONE set of fixed effects;
to have additional fixed effects, create binary variables for each category and include the binary variables as controls;
leave blank if no fixed effects desired;
example: %empiricalp(dsetin=mydata, dsetout=mydata2, yvar=ret, xvars=chgTI chgPI, controls=year1 year2 year3, numiter=1000, fixed=);
Will rematch (chgTI and chgPI) as a group with (ret, year1, year2 and year3) 1000 times, with replacement;
Printed output will be the p value, where the p value is calculated as the rank divided by (number of iterations + 1), subtracted from 1.
For example, on 1000 iterations, a p < 0.95 indicates that the original coefficient was ranked 50, or that it was higher than 951 of the random coefficients. A p < 0.05 would also
be considered statistically significant but would indicate that the original coefficient was ranked 950, or that it was lower than 951 of the random coefficients.
The dsetout file include both the ranking of xvars relative to (number+1) and the p value.
The seed number in PROC SURVEYSELECT is not required. If you do not specify the SEED= option, SURVEYSELECT uses the time of day from the computer's clock to obtain the initial seed.
Without an initial seed, the random selection inherent in the empirical p method generates slightly different p values on repeated runs of the same specification. I encourage you to
change the value of the initial seed to check the sensitivity of your specification but the value must always be a positive integer.
WARNING: The macro can take a long time to run. I recommend only 10 iterations for your first run, to test that you have everything specified correctly;
**************************************************************************************/
options mprint ls=80;
%macro empiricalp(dsetin=, dsetout=, yvar=, xvars=, controls=, numiter=, fixed=);
data ytemp (KEEP = &yvar &controls random);
set &dsetin;
random = normal(16);
RUN;
proc sort data=ytemp;
by random;
RUN;
data ytemp;
set ytemp;
count+1;
RUN;
data xtemp (DROP = &yvar &controls);
set &dsetin;
RUN;
proc surveyselect data=work.xtemp out=outp
seed = 3659246
method = urs
samprate = 1
outhits
rep = &numiter;
RUN;
data work.outp;
set work.outp;
by replicate;
count+1;
if first.replicate then count=1;
drop NumberHits;
RUN;
%if &fixed = %then %let sortorder = replicate;
%else %let sortorder = replicate, &fixed;
PROC SQL;
CREATE TABLE work.outp2 AS
SELECT a.*, b.*
FROM work.ytemp a, work.outp b
WHERE a.count=b.count
ORDER by &sortorder;
QUIT;
ods output parameterestimates = fakeparms;
%if &fixed = %then %do;
proc glm data=outp2;
by replicate;
model &yvar = &xvars &controls / solution;
run;
%end;
%else %do;
proc glm data=outp2;
by replicate;
model &yvar = &xvars &controls / solution;
absorb &fixed;
run;
%end;
ods output parameterestimates = realparms;
%if &fixed = %then %do;
proc glm data=&dsetin;
model &yvar = &xvars &controls / solution;
run;
%end;
%else %do;
proc glm data=&dsetin;
model &yvar = &xvars &controls / solution;
absorb &fixed;
run;
%end;
data XXXparms;
set realparms (in=a) fakeparms;
if a then real=1;
drop replicate dependent;
RUN;
proc sort data=XXXparms;
by parameter estimate;
RUN;
data XXXparms;
set XXXparms;
by parameter estimate;
rank+1;
if first.parameter then rank=1;
RUN;
data &dsetout;
set XXXparms;
if real = 1;
*drop rank real;
pvalue = 1 - rank / (&numiter + 1);
RUN;
%mend empiricalp;
%empiricalp(dsetin=, dsetout=, yvar=, xvars=, controls=, numiter=, fixed=);
run;