top of page

SCIRPT

Instruction

 

 

 

Hello, guys. This our final project and we also build the website to display our project and the relationship between statistics and reality well. Well, let's start the video.





 

Explanation 

 

 

 

2a. 

Goal: Fit a linear regression

We use two methods to achieve the goal.

  1. R built-in function lm(y~x+1,data)

  2. Gitbook function fit.linear.model(covariate, outcome)

Construct a matrix using the covariate (treatment.assigned) to calculate the beta hats.

Comparing the results from the two methods, we find that the estimates of betas are the same.

Then, we use ggplot to visualize the data and the regression.

 

2b.

Goal: Construct the confidence intervals for regression coefficients

In 2b, we decide to construct a 98.5% confidence interval.

Thus, alpha=1-0.985=0.015

We use two methods to achieve the goal.

  1. R built-in function confint(object, level)

  2. Gitbook function conf.int(alpha,type,covariate,outcome)

  • The function conf.int takes in four parameters as shown above and outputs the confidence interval.

  • A series of functions are embedded within it, including:

    • fit.linear.model(covariate, outcome) => outputs beta hats

    • estimate.coef.sd(beta, covariate, outcome) => standard errors of betas

      • estimate.coef.var(beta,covariate,outcome) => variance of betas

        • estimate.sigma.sq(beta,covariate,outcome) => sigma squares

          • sum.of.squares(beta, covariate, outcome) => sum of the square of each residual

            • linear.model(beta, covariate) => a simple linear model

    • conf.int.quantile(alpha, type) => quantiles of confidence intervals

      • Our type is ‘t’ because we use qt(p, df) to calculate the quantiles for a t distribution model

Then, we display the results from both methods.

Comparing the results, we find that the confidence intervals we constructed using the two methods are the same.


 

2c. 

Beta1  refers to the slope of the linear regression model

Beta 0 refers to the intercept of the linear regression model 

Null hypothesis: beta1  is equal to zero test assuming there is no association between x and y 

Alternative hypothesis: beta1 is not equal to zero assuming there is association between x and y 

Rejecting hypothesis: if the null is not true or unlikely to be true if the observed data is very extreme under the null hypothesis

 We use z - test for large sample 

Calculate the standard error :

  1. residuals = y-y_hat 

  2. Mse = estimator of the sigma 

  3. Variance 

  4. Standard error

t= (beta-0 )/s.e.{beta1} ~N(0,1)

covariate= beta 1 hat 

P- value is the probability of more extreme t test results assuming null hypothesis is correct. Smaller p-value indicating stronger evidence against the null hypothesis 

Since beta 1 is not equal to 0 we have a 2 tailed test 

P - Value= 2*probability( z< -|t|)  

p-value =  2*pnorm(-abs(t-stat)

Since p- value is 0.192 which is greater than the significant level of 0.05, thus we cannot reject the null hypothesis . 






 

2d.

The reason why we set m to be 1000 is we want to test 1000 hypotheses and we can set whatever amount we want to test since the question doesn’t mention the requirement. 

 

And then we use numeric() which creates a double-precision vector of the specified length with each element equal to 0.

 

And then, we use the for loop from 1 to m which is the amount of the hypotheses. And we need to calculate the p values by using the calculate.pvalue() function.

 

Then we set the alpha.Bonf to be alpha/m, our purpose is letting the alpha to be divided by the number of hypotheses. 

 

The rej.flag = pvalues < alpha.Bonf will return a vector of True or False. In our case, all the the elements returned TRUE.

 

In the end, we can use what we get there to figure out all the cases of our Bonferroni Correction, true positive, false positive, true negative, false negative by using sum(). Sum function returns the sum of all the values present in its arguments.

 

For the true positive, we got 100 and we got 900 for the false positive and the true negative and the false negative are 0. The reason why we only got the numbers for the first 2 lines of sum() is our rej,flag returns all True there so when we put !rej.flag in the parenthesis it will always be false and then sum() will return 0.


 

2d.

For the 2d, we need to graph the raw data and the fitted values here and I made them into 2 graphs. I first  input the library tidyverse which includes lots of functions likes ggplot() and etc. 

 

And then, I input the library to show you the head and tail of the data frame.

 

We use the ggplot to visualize the raw data here. There are only 4 points here if we only use geom_point() so geom_jitter() is better to see.

 

And then, we just need to plot the fitted value--beta.hat.

 

3a

Now I'm going to explain how to write the third question.

The first part is my friend's method, and then this part is what I'm going to explain. Of course, the answer is the same. Then this method I get is from the last assignment. First, I created three variables: age, gender, and flu receive

Then we can see what is x2 dummy, you can show all the numbers with 0 and 1 in order. Then 3 and 4 are the same. Then I will use a fitted model to find the regression line, and then we can get four beta values. Then we can get this equation from these jobs.:

an=0.1433463876-0.0225112650*x2-0.0006683895*x3+0.0001501796*x4

 

# We say that  We estimate that if two groups of people with the same age and same gender, and if the people have vaccinated before, then the probability that he gets flu (the hospital visit) again will increase 0.0001501796.

Go to previous, we know, when we only consider the flu outcome or flu receive, we get formula,

 

y= -0.01351732*x+0.09168443

 We estimate that without considering the age and gender, if the people have vaccinated before, then the probability that he gets the flu (the hospital visit) again will increase 0.0001501796.

So, what the equation represents.

So, it is interesting, without considering the age and gender, flu receive can lead to a decrease in flu-outcome.

However, with considering the age and gender ( keep the same), flu receive can lead to an increase of flu-outcome.

 

 In addition,from code, we know the p value of coefficient of x4, flu receive, is 0.98, which is larger than 0.05. so fail To reject h0. In the other word, the coefficient can be zero, so the variable of flu receive is not a very important factor when we consider the age and the gender at the same time in the equation

bottom of page