-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathproposal.qmd
200 lines (112 loc) · 10.6 KB
/
proposal.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
---
title: "Team 2cool4school"
subtitle: "Project Proposal"
format: html
editor: visual
editor_options:
chunk_output_type: console
---
```{r}
#| label: load-pkgs
#| message: false
library(tidyverse)
```
# Data 1
## Introduction and data
- The dataset `lemur_data.csv` comes from [Kaggle.com](https://www.kaggle.com/datasets/jessemostipak/duke-lemur-center-data) and was released by Jesse Mostipak.
- The dataset was orginally collected from the [2019 data release from the Duke Lemur Center Database](https://lemur.duke.edu/duke-lemur-center-database/) and further compiled by Zehr, SM, Roach RG, Haring D, Taylor J, Cameron FH, Yoder AD in their [research paper](https://www.nature.com/articles/sdata201419).
- Animal data have been collected and entered by Duke Lemur Center staff according to standard operating procedures and USDA, AZA, and IACUC guidelines throughout the history of the center (United States Department of Agriculture, Association of Zoos and Aquariums, Institutional Animal Care and Use Committee respectively). Births, deaths, weights, enclosure moves, behaviors, and other significant events are recorded daily by animal care, veterinary, and research staff and subsequently entered into the permanent records by the DLC Registrar.
- This dataset contains information on over 3,500 observations. Each observation represent a lemur, including lemur-information such as ancestry, reproduction, longevity, and body mass (in total 54 columns).
## Research question
- What are the top 3 factors that influence the lifespan of lemurs?
- Lemurs are the most endangered group of mammals. In fact, 98% of lemur species are endangered, and 31% of species are critically endangered nowadays. Therefore, it is crucial to conduct researches on the lifespan of lemurs to save the endangered species. Moreover, Duke Lemur Center has been the world leader in the study, care, and protection of lemurs since it was founded in 1966. As Duke students, we are able to utilize the resources at DLC to research on lemurs and potentially contribute to their studies.
- Through our preliminary investigation, we found that the death age of lemurs varies a lot ranging from 0 to 35. Thus, we are interested in researching on what are the determining factors of lemurs' lifespan. Our hypotheses is that taxon, sex, weight are the top 3 factors that would affect the lifespan of lemurs.
- We plan to research on all the variables within the dataset that could potentially affect lemur's lifespan. There are both categorical and quantitative variables involved in our research questions since categorical variables such as taxon and quantitative variables such as weight can all play a role in their lifespan.
## Glimpse of data
```{r}
#| label: load-lemur-data
lemur <- read_csv("data/Miscellaneous/lemur_data.csv")
glimpse(lemur)
```
# Data 2
## Introduction and data
- The data comes from CORGIS (Collection of Really Great, Interesting, Situated Datasets) website.
- The earthquake data was originally collected on 6/7/2016 by simply collecting information from the United States Geological Survey by Ryan Whitcomb.
- Each observation contains a unique earthquake and all the information surrounding the circumstances of the earthquake, including but not limited to: coordinates of where the earthquake occurred, its magnitude, time of earthquake, and the depth of the earthquake.
- There are no ethical concerns in this data.
## Research question
- Does the frequency of earthquakes at a certain location have any relationship to their magnitude or significance?
- This question is important because by determining if there are relationships between the variables, we can then consider the living conditions of people that live in those locations. There might be certain locations in which earthquakes are at their strongest and people should not be living there, or maybe locations in which we have to consider what kinds of infrastructure buildings should have to prevent them from collapsing as a result of multiple strong earthquakes.
- A rough look at the first 50 observations showed that a lot of earthquakes happened in Alaska and California. This made us wonder why this was the case and to see if those patterns are consistent throughout the rest of the data. Furthermore, because earthquakes happen more often in those states, we wonder if they are stronger than earthquakes that only occur once at a certain location and if they overall have a higher significance. Our hypothesis is that frequency and magnitude will have a positive correlation, and that the average significance of the more frequent locations will be higher than the average significance of less frequent locations.
- Our research question focuses on four variables: location, frequency, magnitude, and significance. For location, we can consider the nominal categorical variable `location.name` which gives us the state/country in which the earthquake occurred. We can also consider `location.longitude` and `location.latitude` which gives us the exact coordinates of an earthquake's location, which would be continuous numerical variables. `impact.magnitude` displays the magnitude of an earthquake which is a continuous numerical variable. We will also consider `impact.significance`, which is a discrete numerical variable and gives the significance of each earthquake based on a range of factors including estimated impact, magnitude, and felt reports. Lastly, we will calculate frequency, a discrete numerical variable, based on the location data.
## Glimpse of data
```{r}
#| label: load-earthquake-data
earthquakes <- read.csv("data/Miscellaneous/earthquakes.csv")
glimpse(earthquakes)
```
# Data 3
## Introduction and data
To answer our question, we want to join two data sets. Both are listed below.
**Data Set #1: Niche**
- The first data set comes from Niche's "2023 Best Colleges in America" list
(https://www.niche.com/colleges/search/best-colleges/)
- Niche aggregates data from a variety of sources, including the US Department of Education and reviews from students and alumni, to build their list of college rankings. The rankings list is updated monthly to reflect new data that Niche receives; although, data from the US Department o Education is only received on an annual basis. The Niche data was scraped by Maia on October 17-19 2022.
- There are 500 observations, representing the top 500 schools in the United States. Each observation has two variables: the name of the school and the rank of the school.
**Data Set #2: US Department of Education**
- The second data set comes from the US Department of Education's College Scorecard, which is an exhaustive summary of characteristics and statistics for all colleges and universities in the United States. There were 2,989 variables in the original data set, many of which we don't need to answer our question, and since this data set was too large to load into RStudio, we used Excel to select the variables we needed to import.
(https://collegescorecard.ed.gov/data/)
- The College Scorecard is updated by the Education Department as it collects new data. Data used in the scorecard comes comes from data reported by the institutions, data on federal financial aid, data from taxes, and data from other federal agencies.
- There are 6681 observations in the data set, representing all of the colleges and universities in the United States. There are 63 variables in the data set, which give information about each different school.
- There are no obvious ethical or privacy concerns with this data. The schools did not report data on individual students but rather summary data of all students.
The Department of Education data set was left joined to Niche dataset by school name to make a new data set with 500 observations and 64 variables.
## Research question
- Here are three potential research questions:
1. How do rankings compare across different types of schools (by region, by racial makeup, by school type)?
2. Which three variable have the greatest impact on a college's ranking?
3. Do "input variables" like ACT/SAT, GPA, and type of high school matter more to a college's ranking than "output variable" like median income and graduation rate?
- These questions are important because students applying to college trust these rankings and weigh them into their college decisions. Due to large impact that a college has on a student's life, it is important to know where these ranking come from and what they actually measure.
- In general, we want to examine how different variables affect a school's ranking on the Niche College Ranking List. We plan to look at variables that are typically thought to influence school rank such as average SAT score and acceptance rate, but we also want to look at variables that aren't typically thought of such as geographic region or endowment size.
We hypothesize that:
1. schools that are from the northeast, are majority-white, and are private will have higher rankings on average.
2. ACT/SAT test score, acceptance rate, and median earnings 10 years after graduation will have the biggest impact on college rank.
3. input variables will matter more because they are more important to students applying to college, and the ranking list is geared towards them.
- Here are the types of variables involved in our research questions:
- Categorical (nominal):
- city
- state
- zip
- accreditor institution
- control (public vs. private)
- Carnegie classification
- Numerical (continuous):
- latitude, longitude
- admission rate
- Total share of enrollment by racial groups
- average net price
- average cost of attendance
- average faculty salary
- percentage of undergraduates who receive a Pell grant
- completion rate
- average age of entry
- share of female students
- share of married students
- share of first generation students
- median & mean family income
- median & mean earning 10 years after entry
- Value of school's endowment
- Numerical (discrete)
- 25th/50th/75th/mean ACT/SAT scores for sub-tests and composite
- Enrollment of undergraduate certificate/degree-seeking students
## Glimpse of data
```{r}
#| label: load-college-data
niche_data_500 <- read_csv("data/niche_data_500.csv")
glimpse(niche_data_500)
us_dep_of_ed <- read_csv("data/us_dep_of_ed.csv")
glimpse(us_dep_of_ed)
colleges <- niche_data_500 |>
left_join(us_dep_of_ed, by = c("college" = "INSTNM"))
write_csv(colleges, "data/colleges.csv")
glimpse(colleges)
```