Using computer simulation
Replace all the instances of ???. These are answers on your moodle quiz.
Run all the individual code chunks to make sure the answers in this file correspond with your quiz answers
After you check all your code chunks run then you can knit it. It won’t knit until the ??? are replaced
Save a plot to be your preview plot
Question: t-test The data this quiz is a subset of HR Look at the variable definitions Note that the variables evaluation and salary have been recoded to be represented as words instead of numbers Set random seed generator to 123set.seed(123)
SEE QUIZ is the name of your data subset
Read it into and assign to hr
Note: col_types = “fddfff” defines the column types factor-double-double-factor-factor-factorhr <- read_csv("https://estanny.com/static/week13/data/hr_1_tidy.csv",
col_types = "fddfff")
skim(hr)
Name | hr |
Number of rows | 500 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
factor | 4 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
gender | 0 | 1 | FALSE | 2 | fem: 260, mal: 240 |
evaluation | 0 | 1 | FALSE | 4 | bad: 153, fai: 142, goo: 106, ver: 99 |
salary | 0 | 1 | FALSE | 6 | lev: 93, lev: 92, lev: 91, lev: 84 |
status | 0 | 1 | FALSE | 3 | fir: 185, pro: 162, ok: 153 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
age | 0 | 1 | 40.60 | 11.58 | 20.2 | 30.37 | 41.00 | 50.82 | 59.9 | ▇▇▇▇▇ |
hours | 0 | 1 | 49.32 | 13.13 | 35.0 | 37.55 | 45.25 | 58.45 | 79.7 | ▇▂▃▂▂ |
The mean hours worked per week is: 49.3
Q: Is the mean number of hours worked per week 48? specify that hours is the variable of interestResponse: hours (numeric)
# A tibble: 500 x 1
hours
<dbl>
1 36.5
2 55.8
3 35
4 52
5 35.1
6 36.3
7 40.1
8 42.7
9 66.6
10 35.5
# ... with 490 more rows
hr %>%
specify(response = hours) %>%
hypothesize(null = "point", mu = 48)
Response: hours (numeric)
Null Hypothesis: point
# A tibble: 500 x 1
hours
<dbl>
1 36.5
2 55.8
3 35
4 52
5 35.1
6 36.3
7 40.1
8 42.7
9 66.6
10 35.5
# ... with 490 more rows
hr %>%
specify(response = hours) %>%
hypothesize(null = "point", mu = 48) %>%
generate(reps = 1000, type = "bootstrap")
Response: hours (numeric)
Null Hypothesis: point
# A tibble: 500,000 x 2
# Groups: replicate [1,000]
replicate hours
<int> <dbl>
1 1 33.7
2 1 34.9
3 1 46.6
4 1 33.8
5 1 61.2
6 1 34.7
7 1 37.9
8 1 39.0
9 1 62.8
10 1 50.9
# ... with 499,990 more rows
The output has 500,000 rows calculate the distribution of statistics from the generated data
Assign the output null_t_distribution
Display null_t_distributionnull_t_distribution <- hr %>%
specify(response = hours) %>%
hypothesize(null = "point", mu = 48) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "t")
null_t_distribution
Response: hours (numeric)
Null Hypothesis: point
# A tibble: 1,000 x 2
replicate stat
<int> <dbl>
1 1 -0.222
2 2 -0.912
3 3 1.61
4 4 0.318
5 5 -0.915
6 6 -0.538
7 7 0.307
8 8 -0.147
9 9 -0.520
10 10 -0.238
# ... with 990 more rows
visualize(null_t_distribution)
calculate the statistic from your observed data
Assign the output observed_t_statistic
Display observed_t_statisticobserved_t_statistic <- hr %>%
specify(response = hours) %>%
hypothesize(null = "point", mu = 48) %>%
calculate(stat = "t")
observed_t_statistic
Response: hours (numeric)
Null Hypothesis: point
# A tibble: 1 x 1
stat
<dbl>
1 2.25
null_t_distribution %>%
get_p_value(obs_stat = observed_t_statistic , direction = "two-sided")
# A tibble: 1 x 1
p_value
<dbl>
1 0.028
null_t_distribution %>%
visualize() +
shade_p_value(obs_stat = observed_t_statistic, direction = "two-sided")
If the p-value < 0.05? yes (yes/no)
Does your analysis support the null hypothesis that the true mean number of hours worked was 48? ??? (yes/no)
Question: 2 sample t-test SEE QUIZ is the name of your data subset
Read it into and assign to hr_2
Note: col_types = “fddfff” defines the column types factor-double-double-factor-factor-factorhr_2 <- read_csv("https://estanny.com/static/week13/data/hr_1_tidy.csv",
col_types = "fddfff")
Name | Piped data |
Number of rows | 500 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
factor | 3 |
numeric | 2 |
________________________ | |
Group variables | gender |
Variable type: factor
skim_variable | gender | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|---|
evaluation | female | 0 | 1 | FALSE | 4 | fai: 81, bad: 71, ver: 57, goo: 51 |
evaluation | male | 0 | 1 | FALSE | 4 | bad: 82, fai: 61, goo: 55, ver: 42 |
salary | female | 0 | 1 | FALSE | 6 | lev: 54, lev: 50, lev: 44, lev: 41 |
salary | male | 0 | 1 | FALSE | 6 | lev: 52, lev: 47, lev: 46, lev: 39 |
status | female | 0 | 1 | FALSE | 3 | fir: 96, pro: 87, ok: 77 |
status | male | 0 | 1 | FALSE | 3 | fir: 89, ok: 76, pro: 75 |
Variable type: numeric
skim_variable | gender | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
age | female | 0 | 1 | 41.78 | 11.50 | 20.5 | 32.15 | 42.35 | 51.62 | 59.9 | ▆▅▇▆▇ |
age | male | 0 | 1 | 39.32 | 11.55 | 20.2 | 28.70 | 38.55 | 49.52 | 59.7 | ▇▇▆▇▆ |
hours | female | 0 | 1 | 50.32 | 13.23 | 35.0 | 38.38 | 47.80 | 60.40 | 79.7 | ▇▃▃▂▂ |
hours | male | 0 | 1 | 48.24 | 12.95 | 35.0 | 37.00 | 42.40 | 57.00 | 78.1 | ▇▂▂▁▂ |
Females worked an average of 50.3 hours per week
Males worked an average of 48.2 hours per week
Use geom_boxplot to plot distributions of hours worked by genderhr_2 %>%
ggplot(aes(x = gender, y = hours)) +
geom_boxplot()
Response: hours (numeric)
Explanatory: gender (factor)
# A tibble: 500 x 2
hours gender
<dbl> <fct>
1 36.5 female
2 55.8 female
3 35 male
4 52 female
5 35.1 male
6 36.3 female
7 40.1 female
8 42.7 female
9 66.6 male
10 35.5 male
# ... with 490 more rows
hr_2 %>%
specify(response = hours, explanatory = gender) %>%
hypothesize(null = "independence")
Response: hours (numeric)
Explanatory: gender (factor)
Null Hypothesis: independence
# A tibble: 500 x 2
hours gender
<dbl> <fct>
1 36.5 female
2 55.8 female
3 35 male
4 52 female
5 35.1 male
6 36.3 female
7 40.1 female
8 42.7 female
9 66.6 male
10 35.5 male
# ... with 490 more rows
hr_2 %>%
specify(response = hours, explanatory = gender) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute")
Response: hours (numeric)
Explanatory: gender (factor)
Null Hypothesis: independence
# A tibble: 500,000 x 3
# Groups: replicate [1,000]
hours gender replicate
<dbl> <fct> <int>
1 36.4 female 1
2 35.8 female 1
3 35.6 male 1
4 39.6 female 1
5 35.8 male 1
6 55.8 female 1
7 63.8 female 1
8 40.3 female 1
9 56.5 male 1
10 50.1 male 1
# ... with 499,990 more rows
The output has 500000 rows
calculate the distribution of statistics from the generated data
Assign the output null_distribution_2_sample_permute
Display null_distribution_2_sample_permutenull_distribution_2_sample_permute <- hr_2 %>%
specify(response = hours, explanatory = gender) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "t", order = c("female", "male"))
null_distribution_2_sample_permute
Response: hours (numeric)
Explanatory: gender (factor)
Null Hypothesis: independence
# A tibble: 1,000 x 2
replicate stat
<int> <dbl>
1 1 -0.208
2 2 -0.328
3 3 -2.28
4 4 0.528
5 5 1.60
6 6 0.795
7 7 1.24
8 8 -3.31
9 9 0.517
10 10 0.949
# ... with 990 more rows
visualize(null_distribution_2_sample_permute)
calculate the statistic from your observed data
Assign the output observed_t_2_sample_stat
Display observed_t_2_sample_statobserved_t_2_sample_stat <- hr_2 %>%
specify(response = hours, explanatory = gender) %>%
calculate(stat = "t", order = c("female", "male"))
observed_t_2_sample_stat
Response: hours (numeric)
Explanatory: gender (factor)
# A tibble: 1 x 1
stat
<dbl>
1 1.78
null_t_distribution %>%
get_p_value(obs_stat = observed_t_2_sample_stat , direction = "two-sided")
# A tibble: 1 x 1
p_value
<dbl>
1 0.072
null_t_distribution %>%
visualize() +
shade_p_value(obs_stat = observed_t_2_sample_stat, direction = "two-sided")
If the p-value < 0.05? ??? (yes/no)
Does your analysis support the null hypothesis that the true mean number of hours worked by female and male employees was the same? ??? (yes/no)
Question: ANOVA SEE QUIZ is the name of your data subset
Read it into and assign to hr_anova
Note: col_types = “fddfff” defines the column types factor-double-double-factor-factor-factorhr_anova <- read_csv("https://estanny.com/static/week13/data/hr_1_tidy.csv",
col_types = "fddfff")
Q: Is the average number of hours worked the same for all three status (fired, ok and promoted) ? use skim to summarize the data in hr_anova by status
Name | Piped data |
Number of rows | 500 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
factor | 3 |
numeric | 2 |
________________________ | |
Group variables | status |
Variable type: factor
skim_variable | status | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|---|
gender | fired | 0 | 1 | FALSE | 2 | fem: 96, mal: 89 |
gender | ok | 0 | 1 | FALSE | 2 | fem: 77, mal: 76 |
gender | promoted | 0 | 1 | FALSE | 2 | fem: 87, mal: 75 |
evaluation | fired | 0 | 1 | FALSE | 4 | bad: 65, fai: 63, goo: 31, ver: 26 |
evaluation | ok | 0 | 1 | FALSE | 4 | bad: 69, fai: 59, goo: 15, ver: 10 |
evaluation | promoted | 0 | 1 | FALSE | 4 | ver: 63, goo: 60, fai: 20, bad: 19 |
salary | fired | 0 | 1 | FALSE | 6 | lev: 41, lev: 37, lev: 32, lev: 32 |
salary | ok | 0 | 1 | FALSE | 6 | lev: 40, lev: 37, lev: 29, lev: 23 |
salary | promoted | 0 | 1 | FALSE | 6 | lev: 37, lev: 35, lev: 29, lev: 23 |
Variable type: numeric
skim_variable | status | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
age | fired | 0 | 1 | 38.64 | 11.43 | 20.2 | 28.30 | 38.30 | 47.60 | 59.6 | ▇▇▇▅▆ |
age | ok | 0 | 1 | 41.34 | 12.11 | 20.3 | 31.00 | 42.10 | 51.70 | 59.9 | ▆▆▆▆▇ |
age | promoted | 0 | 1 | 42.13 | 10.98 | 21.0 | 33.40 | 42.95 | 50.98 | 59.9 | ▆▅▆▇▇ |
hours | fired | 0 | 1 | 41.67 | 7.88 | 35.0 | 36.10 | 38.90 | 43.90 | 75.5 | ▇▂▁▁▁ |
hours | ok | 0 | 1 | 48.05 | 11.65 | 35.0 | 37.70 | 45.60 | 56.10 | 78.2 | ▇▃▃▂▁ |
hours | promoted | 0 | 1 | 59.27 | 12.90 | 35.0 | 51.12 | 60.10 | 70.15 | 79.7 | ▆▅▇▇▇ |
Employees that were fired worked an average of 41.7 hours per week
Employees that were ok worked an average of 48.0 hours per week
Employees that were promoted worked an average of 59.3 hours per week
Use geom_boxplot to plot distributions of hours worked by statushr_anova %>%
ggplot(aes(x = status, y = hours)) +
geom_boxplot()
Response: hours (numeric)
Explanatory: status (factor)
# A tibble: 500 x 2
hours status
<dbl> <fct>
1 36.5 fired
2 55.8 ok
3 35 fired
4 52 promoted
5 35.1 ok
6 36.3 ok
7 40.1 promoted
8 42.7 fired
9 66.6 promoted
10 35.5 ok
# ... with 490 more rows
hr_anova %>%
specify(response = hours, explanatory = status) %>%
hypothesize(null = "independence")
Response: hours (numeric)
Explanatory: status (factor)
Null Hypothesis: independence
# A tibble: 500 x 2
hours status
<dbl> <fct>
1 36.5 fired
2 55.8 ok
3 35 fired
4 52 promoted
5 35.1 ok
6 36.3 ok
7 40.1 promoted
8 42.7 fired
9 66.6 promoted
10 35.5 ok
# ... with 490 more rows
hr_anova %>%
specify(response = hours, explanatory = status) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute")
Response: hours (numeric)
Explanatory: status (factor)
Null Hypothesis: independence
# A tibble: 500,000 x 3
# Groups: replicate [1,000]
hours status replicate
<dbl> <fct> <int>
1 40.3 fired 1
2 40.3 ok 1
3 37.3 fired 1
4 50.5 promoted 1
5 35.1 ok 1
6 67.8 ok 1
7 39.3 promoted 1
8 35.7 fired 1
9 40.2 promoted 1
10 38.4 ok 1
# ... with 499,990 more rows
The output has 500,000 rows
calculate the distribution of statistics from the generated data
Assign the output null_distribution_anova
Display null_distribution_anovanull_distribution_anova <- hr_anova %>%
specify(response = hours, explanatory = status) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "F")
null_distribution_anova
Response: hours (numeric)
Explanatory: status (factor)
Null Hypothesis: independence
# A tibble: 1,000 x 2
replicate stat
<int> <dbl>
1 1 0.667
2 2 2.78
3 3 1.24
4 4 0.330
5 5 2.08
6 6 1.95
7 7 0.243
8 8 0.312
9 9 0.440
10 10 0.0281
# ... with 990 more rows
visualize(null_distribution_anova)
calculate the statistic from your observed data
Assign the output observed_f_sample_stat
Display observed_f_sample_statobserved_f_sample_stat <- hr_anova %>%
specify(response = hours, explanatory = status) %>%
calculate(stat = "F")
observed_f_sample_stat
Response: hours (numeric)
Explanatory: status (factor)
# A tibble: 1 x 1
stat
<dbl>
1 115.
null_distribution_anova %>%
get_p_value(obs_stat = observed_f_sample_stat , direction = "greater")
# A tibble: 1 x 1
p_value
<dbl>
1 0
null_t_distribution %>%
visualize() +
shade_p_value(obs_stat = observed_f_sample_stat, direction = "greater")
If the p-value < 0.05? yes (yes/no)
Does your analysis support the null hypothesis that the true means of the number of hours worked for those that were “fired”, “ok” and “promoted” were the same? no (yes/no)