-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path11-workflow.Rmd
250 lines (139 loc) · 8.22 KB
/
11-workflow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
# Workflow {#workflow}
> It is an excellent time to review and sort out the data analysis workflow using R Studio.
```{r}
library(tidyverse)
library(WDI)
library(readxl)
```
## EDA Step 0
1. Choose and clarify a topic to study.
2. List questions to study
3. Find data:
- link to data with a url: universal resource locator in a webpage
- download data in csv, Excel, etc.
Repeat the process during your EDA.

## EDA by R Studio: Step 1
In RStudio,
1.1. Project
* Create a new project: File > New Project; or
* Open a project: File > Open Project, Open Project in New Session, Open Recent Project
- It is easier to find an existing project from: File > Recent Project
* _Check there is a file `project_name.Rproj` in your project folder (directory)_
1.2. data folder (directory) `data`
* Create a data folder: Press New Folder at the right bottom pane; or
* Confirm the data folder previously created: Press Files at the right bottom pane
* _If you follow 1, the data folder exists in your project folder_
1.3. Move (or copy) data for the project to the data folder
* If you downloaded the data, it is in your Download folder. Move it to `data`.
* _Check in your RStudio that your data is in `data`: Press Files at the right bottom pane and click `data`, the data folder._
## EDA by R Studio: Step 2
2.1. Project Notebook: Memo
- Create an R Notebook: File > New File > R Notebook
+ You can use R Notebook template in Moodle by moving the template (template.Rmd or template.nb.Rmd) file in your project folder or copy and paste the text file into your new R Notebook.
+ If you use template.nb.Rmd (R Notebook File), choose Open in Editor.
- Add descriptive title.
2.2. Setup Code Chunk
- Create a code chunk and add packages to use in the project and RUN the code.
+ library(tidyverse)
+ library(WDI)
+ or any other packages
2.3. Choose `Source` or `Visual` editor mode, and start editing Project Notebook
- Set up Headings such as: About, Data, Analysis and Visualizations, Conclusions
- Under About or Data, paste url of the sites and/or the data
+ eg. World Development Indicator:
https://datatopics.worldbank.org/world-development-indicators/)
+ eg. Public expenditure on education:
https://data.un.org/_Docs/SYB/CSV/SYB65_245_202209_Public%
20expenditure%20on%20education.csv)
2.4. Edit a new file by saving as for a report
- File > Save As...
## EDA by R Studio: Step 3 - Importing Data
Assign a name you can recall easily when you import data. You may need to reload the data with options.
3.1. Use a package:
* WDI, wir, eurostat, etc/
* `wdi_shortname <- WDI(indicator = "indicator's name", ... )
* Store the data and use it: `write_csv(wdi_shortname, "./data/wdi_shortname.csv")`
* `wdi_shortname <- read_csv("./data/wdi_shortname.csv")`
3.2. Use `readr` to read from `data`, your data folder
* `df1_shortname <- read_csv("./data/file_name.csv")`
3.3. Use `readr` to read using the url of the data
* `df2_shortname <- read_csv("url_of_the_data")`
* Store the data and use it: `write_csv(df2_shortname, "./data/df2_shortname.csv")`
* `df2_shortname <- read_csv("./data/df2_shortname.csv")`
3.5. Use `readxl` to read Excel data. Add `library(readxl)` in the setup and run.
* `df4 <- read_excel("./data/file_name.xlsx", sheet = 1)`
References: Cheat Sheet - `readr`, [readr](https://readr.tidyverse.org), [readxl](https://readxl.tidyverse.org)
## EDA by R Studio: Step 4 - Data Trasnformation
4.1. Look at the data: suppose `df` is the data frame
* It is a good option to change into a tibble: `dt <- as_tibble(df)`
* `head(df)`, `str(df)`, `summary(df)`, `dt`, `glimpse(dt)`
4.2. Look at each variable
* categorical? numerical?
* factor? - [forcats](https://forcats.tidyverse.org)
4.3. Variation of each data: suppose `x1` is a column name.
* `df %>% ggplot() + geom_histogram(aes(x1), bins = 30)`
* `df %>% drop_na(x1)`: see the rows with a value in `x1`. If the value is NA, the row is not shown.
- `df_wo_na <- df %>% drop_na(x1)` if you want to use only the rows without NA in `x1`
4.4. Use `dpylr` and `tidyr` to change column names, tidy data, and/or summarize data
* `rename`, `select`, `filter`, `arrange`, `mutate`, `pivot_longer()`, `pivot_wider()`, `group_by` and `summarize`
References: Cheat Sheet - `dplyr` and `tidyr`, [dplyr](https://dplyr.tidyverse.org), [tidyr](https://tidyr.tidyverse.org)
## EDA by R Studio: Step 5 - Visualize Data
5.1. In combination with Stap 4 - data transformation, try various data visualization.
* What type of variation occurs within my variables?
* What type of covariation occurs between my variables?
5.2. Keep a record of what you can observe by the visualization
5.3. Edit the list of questions by adding or polishing
5.4. Select several informative chart and add options
5.5. Look at examples from the textbooks or teaching site to have better visualization
References: Cheat Sheet - `ggplot2` [ggplot2](https://ggplot2.tidyverse.org), [ggplot2 book](https://ggplot2-book.org)
## EDA by R Studio: Step 6 - Conclusions and Questions for Further Study
1. EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:
2. Generate questions about your data
3. Search for answers by visualising, transforming, and/or modeling your data
Use what you learn to refine your questions and/or generate new questions
EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not.
## Example: WDI
* Government expenditure on education, total (% of GDP)
- https://data.worldbank.org/indicator/SE.XPD.TOTL.GD.ZS
* ID: SE.XPD.TOTL.GD.ZS
## Example: WIR2022
```{r}
df_f8 <- read_excel("./data/WIR2022s.xlsx", sheet = "data-F8")
df_f8
```
```{r warning=FALSE, echo=FALSE}
df_f8 %>%
select(year, Germany_public = Germany, Germany_private = 'Germany (private)',
Spain_public = Spain, Spain_private = 'Spain (private)',
France_public = France, France_private = 'France (private)',
UK_public = UK, UK_private = 'UK (private)',
Japan_public = Japan, Japan_private = 'Japan (private)',
Norway_public = Norway, Norway_private = 'Norway (private)',
USA_public = USA, USA_private = 'USA (private)') %>%
pivot_longer(!year, names_to = c("country",".value"), names_sep = "_") %>%
pivot_longer(3:4, names_to = "type", values_to = "value") %>%
ggplot() +
stat_smooth(aes(x = year, y = value, color = country, linetype = type), formula = y~x, method = "loess", span = 0.25, se = FALSE, size=0.75) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(title = "Figure 8. The rise of private versus the decline of public wealth \nin rich countries, 1970-2020",
x = "", y = "wealth as as % of national income", color = "", type = "")
```
## The Week Five Assignment (in Moodle)
**`tidyr` and WIR2022**
* Create an R Notebook of a Data Analysis containing the following and submit the rendered HTML file (eg. `a3_123456.nb.html` by replacing 123456 with your ID)
1. create an R Notebook using the R Notebook Template in Moodle, save as `a3_123456.Rmd`,
2. write your name and ID and the contents,
3. run each code block,
4. preview to create `a3_123456.nb.html`,
5. submit `a3_123456.nb.html` to Moodle.
1. Choose a data with at least two categorical variables and at least two numerical variables.
- Information of the data: Name, Indicator, Description, Source, etc.
- Explain why you chose the indicator
- List questions you want to study
2. Explore the data using visualization using `ggplot2`
- Create various charts
- Create at least one chart with at least two categocial variables and at least one numerical variable.
- Create at least one chart with at least two numerical variables and at least one categorical variable.
3. Observations based on your data visualization, and difficulties and questions encountered if any.
**Due:** 2023-01-23 23:59:00. Submit your R Notebook file in Moodle (The Fourth Assignment). Due on Monday!