forked from Github-MS/Shark
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathSharkManualv4.Rmd
974 lines (629 loc) · 42.5 KB
/
SharkManualv4.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
---
title: "Shark manual"
subtitle: "Version 4"
author: "Jeffrey Durieux, Elise Dusseldorp, Wouter van Loon & Juan Claramunt"
date: "14/01/2022"
output:
pdf_document:
number_sections: yes
params:
working_directory: "C:/Users/jclaramunt/Documents/Downloads/Shark-master/Shark-master"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
In this document, we describe the set-up of a simulation study and the performance of it on a cluster computer (SHARK). Also, we introduce basic knowledge on how to use the SHARK cluster.
This version introduces the new changes (different settings, commands...) for the new workload manager of the cluster (SLURM).
An example is included to illustrate each step of the simulation study and its computation using a cluster computer.
The document is structured as follows:
* Chapter 2: How to connect to the SHARK cluster;
* Chapter 3: File management on a cluster;
* Chapter 4: Specification of a simulation study;
* Chapter 5: Preparation of a simulation study;
* Chapter 6: Performance of a simulation study on a cluster computer;
* Chapter 7: Collecting the results from the cluster computer in one large matrix and evaluation of the results by ANOVA;
* Chapter 8: Email notifications;
* Chapter 9: Installing R packages on Shark;
* Chapter 10: Warning for writing shell scripts on Windows;
* Chapter 11: Useful Linux commands.
* Chapter 12: Matlab
**Note:** *You may find all functions required for the example in the MS-Github page [click here](https://github.com/Github-MS)*
# How to connect to the SHARK cluster
There are several options to connect to SHARK. They depend on your operating system (Windows, Linux, Mac OS, etc.).
## Connection using a device that is connected to the FSW network by cable (not wifi).
\underline{Details of the connection:}
\noindent\fbox{%
\parbox{\textwidth}{%
\underline{Server information of 1st login node:}\\
\textbf{IP address:} 145.88.76.217\\
\textbf{Hostname:} res-hpc-lo02\\
\textbf{Port:} 22}\\
}
\noindent\fbox{%
\parbox{\textwidth}{%
\underline{Server information of 2nd login node:}\\
\textbf{IP address:} 145.88.76.219\\
\textbf{Hostname:} res-hpc-lo04\\
\textbf{Port:} 22}%
}
### Windows users
In order to connect to a cluster computer using Windows it is necessary to install a client SSH software. The most common options are Putty and MobaXterm (recommended option).
**Connect with Putty**
1. Open putty.
2. Introduce IP address or hostname (see details above).
3. Select ssh.
4. (Optional) Set a name under saved sessions and save.
5. (Optional) At the connection setting, fill in 60 at the Seconds between keepalives.
If needed, enable the Enable TCP keepalives.
6. (Optional) At the X11 setting, you can enable the Enable X11 forwarding.
Do this if you need graphical output, but for this to get this working,
you need to install separately a X11 server for Windows.
5. Left click on open.
6. Introduce your SHARKusername.
7. Introduce your password.
8. Congratulations! You are now in SHARK.
**Connect with MobaXterm** (recommended)
First option (the session is saved for future connections):
1. Open MobaXterm.
2. Click on sessions.
3. Click on ssh.
4. Fill in the 'Remote host' (res-hpc-lo02 or 145.88.76.217, or res-hpc-lo04 or 145.88.76.219).
5. Select specify username and introduce your SHARKusername, and set port to 22.
6. (Optional) Check the 'Follow SSH path' in the 'Advanced SSH settings'-TAB.
7. (Optional) Un-check the 'Display reconnection message at session end' in
the 'Bookmark settings'-TAB.
8. Click on OK.
9. Introduce your password.
10. Congratulations! You are now in SHARK.
Second option (faster):
1. Open MobaXterm
2. Click on start local terminal.
3. Type: ssh <SHARKusername>@145.88.76.217 or ssh <SHARKusername>@145.88.76.219 .
4. Introduce your password.
5. Congratulations! You are now in SHARK.
### Linux users
First option (recommended):
1. Open a Terminal and create a directory (if needed) .ssh in your home directory.
2. Create a config file inside your .ssh directory and add the following: vi ~/.ssh/config
(only the first time)
3. Add to the config file: (only the first time)
Host res-hpc-lo02
Hostname 145.88.76.217
User <SHARKUsername>
ServerAliveInterval 60
4. Set the permissions for the user for the .ssh directory: chmod 700 ~/.ssh (only the first time)
5. Set the permissions for the user on all the files in your .ssh directory: chmod 600 ~/.ssh/*
(only the first time)
6. Type ssh <SHARKUsername>@res-hpc-lo02
7. Type your password.
8. Congratulations! You are now in SHARK.
Second option (faster):
1. Open the terminal.
2. Type: ssh -XY <SHARKusername>@res-hpc-lo02
3. Type your password.
4. Congratulations! You are now in SHARK.
### Mac users
First option (recommended):
1. Open a Terminal and create a directory (if needed) .ssh in your home directory.
2. Create a config file inside your .ssh directory and add the following: vi ~/.ssh/config
(only the first time)
3. Add to the config file: (only the first time)
Host res-hpc-lo02
Hostname 145.88.76.217
User <SHARKUsername>
ServerAliveInterval 60
4. Set the permissions for the user for the .ssh directory: chmod 700 ~/.ssh (only the first time)
5. Set the permissions for the user on all the files in your .ssh directory: chmod 600 ~/.ssh/*
(only the first time)
6. Type ssh <SHARKUsername>@res-hpc-lo02
7. Type your password.
8. Congratulations! You are now in SHARK.
Second option (faster):
1. Open the terminal.
2. Type: ssh -XY <SHARKusername>@res-hpc-lo02
3. Type your password.
4. Congratulations! You are now in SHARK.
## Connection using any device.
For any device on any location, follow the next steps.
\underline{Details of the connection:}
\noindent\fbox{%
\parbox{\textwidth}{%
\underline{Research Jump server (Proxy server):}\\
\textbf{IP address:} 145.88.35.10\\
\textbf{Hostname:} res-ssh-alg01.researchlumc.nl\\
\textbf{Port:} 22}%
}
### Windows users
**Connect with Putty**
1. Open putty.
2. Introduce IP address or hostname (see details above).
3. Select ssh.
4. (Optional) Set a name under saved sessions and save.
5. (Optional) At the connection setting, fill in 60 at the Seconds between keepalives.
If needed, enable the Enable TCP keepalives.
6. (Optional) At the X11 setting, you can enable the Enable X11 forwarding.
Do this if you need graphical output, but for this to get this working,
you need to install separately a X11 server for Windows.
7. Left click on open.
8. Introduce your SHARKusername.
9. Introduce your password.
10. Type: ssh shark or ssh 145.88.76.217
11. Introduce your password again.
12. Congratulations! You are now in SHARK.
**Connect with MobaXterm** (recommended)
First option (the session is saved for future connections):
1. Open MobaXterm.
2. Click on sessions.
3. Click on ssh.
4. Type 'res-hpc-lo02' or '145.88.76.217' in 'Remote host'.
5. Select specify username and introduce your SHARKusername, and set port to 22.
6. (Optional) Check the 'Follow SSH path' in the 'Advanced SSH settings'-TAB.
7. Check the 'Connect through SSH gateway (jump host)' in the 'Network settings'-TAB.
8. Fill the gateway SSH server (145.88.35.10), the port (22) and the user (SHARKusername)
in the 'Network settings'-TAB.
9. (Optional) Un-check the 'Display reconnection message at session end' in
the 'Bookmark settings'-TAB.
10. Click on OK.
11. Introduce your password.
12. Once in the terminal, introduce your password again.
13. Congratulations! You are now in SHARK.
Second option (faster)
1. Open MobaXterm
2. Click on start local terminal.
3. Type: ssh SHARKusername@145.88.35.10 .
4. Introduce your password.
5. Type: ssh res-hpc-lo02 or ssh 145.88.76.217 .
6. Introduce your password again.
7. Congratulations! You are now in SHARK.
### Linux users
Recommended option by the mantainers of SHARK:
Before connecting the first time follow the next steps:
1. Open a Terminal and create a directory (if needed) .ssh in your home directory.
2. Create a config file inside your .ssh directory and add the following: vi ~/.ssh/config
3. Add also:
Host resshark
Hostname 145.88.76.217
Localforward 6001 145.88.76.217:22
User <SHARKUsername>
ProxyCommand ssh -q -X -C SHARKUsername@145.88.35.10 nc %h %p
4. Set the permissions for the user for the .ssh directory: chmod 700 ~/.ssh
5. Set the permissions for the user on all the files in your .ssh directory: chmod 600 ~/.ssh/*
Once the previous steps are finished, follow the next ones to connect:
1. Type ssh resshark
2. Introduce your password.
3. Introduce your password again.
4. Congratulations! You are now in SHARK.
Faster option (does not require the config file):
1. Open a Terminal.
2. Type ssh -X SHARKusername@res-ssh-alg01.researchlumc.nl
3. Type ssh res-hpc-lo02.
4. Congratulations! You are now in SHARK.
You can also install X2Go (similar to mobaXterm). You can obtain more information on how to set up X2Go [here.](https://git.lumc.nl/shark/shark-centos-slurm-user-guide/-/wikis/home#a-remote-desktop)
### Mac users
Recommended option by the maintainers of SHARK.
Before connecting the first time follow the following steps:
1. Open a Terminal and create a directory (if needed) .ssh in your home directory.
2. Create a config file inside your .ssh directory and add the following: vi ~/.ssh/config
3. Add also:
Host resshark
Hostname 145.88.76.217
Localforward 6001 145.88.76.217:22
User <SHARKUsername>
ProxyCommand ssh -q -X -C SHARKUsername@145.88.35.10 nc %h %p
4. Set the permissions for the user for the .ssh directory: chmod 700 ~/.ssh
5. Set the permissions for the user on all the files in your .ssh directory: chmod 600 ~/.ssh/*
Once the previous steps are finished, follow the next ones to connect:
1. Type ssh resshark
2. Introduce your password.
3. Introduce your password again.
4. Congratulations! You are now in SHARK.
Fast option (does not require the config file):
1. Open a Terminal.
2. Type ssh -Y SHARKusername@res-ssh-alg01.researchlumc.nl
3. Type ssh res-hpc-lo02.
4. Congratulations! You are now in SHARK.
You can find more information [here](https://git.lumc.nl/mwjhhuijberts/Shark4FSW/wikis/how_to_connect).
# File management in a cluster
The file management (copy, move, upload, download files/folders) also depends on your operating system.
## Windows users
Windows users need a SSH File Transfer Protocol (SFTP) client software. MobaXterm (recommended SSH client software) includes a SFTP, however, Putty does not. Therefore, in case you use Putty, other software such as Filezilla or WinSCP is needed.
### MobaXterm users
The SFTP in **MobaXterm** can be found in the TABs on the left side (see Figure 1). In order to use it, you must connect to SHARK first. Once in SHARK, the SFTP tab contains icons to:
- Move to the parent directory
- Download files
- Upload files
- Refresh the folder
- Create a new directory
- Create a new file
- Delete a new file
`)
Below these icons there is the path corresponding to your current folder (directory) in SHARK.
Below the path you can find the files and folders in the current path.
### Putty users
If you are using **Putty** to connect to SHARK, the process is far more difficult. It is necessary to download a software such as Filezilla or WinSCP and create an SSH tunnel (in case you are not connected to the FSW network by cable). Unluckily, creating these tunnels is not possible with Filezilla.
\underline{Using cable connection:}
First **Filezilla** is installed. Fill the host (145.88.76.217), the username (SHARKusername), the password (SHARKpassword) and the port (22) as in Figure 2 (Watch out! The host name of the image is not the correct one), and press the ENTER key.
`)
On the left side of the screen you will have the files/folders on your computer and on the right side of the screen the files/folders on the SHARK cluster. You can manage your files in a similar way as in the Windows environment.
\underline{Not using cable connection:}
First **WinSCP** is installed. Click on New Session. Go to advanced settings and select tunnel on the left side (see Figure 3). Select "connect through SSH tunnel". Fill the hostname (145.88.35.10), Port number (22), user name (SHARKusername) and password (SHARKpassword) and click on OK. Then, fill in the new window the hostname (145.88.76.217), Port number (22), user name (SHARKusername) and password (SHARKpassword).
Once connected, it works similarly to Filezilla and Windows.
`)
**NOTE**: It is important to realize that the first hostname is not the same as the second host name!
## Linux and Mac users
Open the terminal and use the command **scp** to transfer files to/from the cluster.
The basic syntax of scp is:
**scp [from] [to]**
where "from" can be a filename or a folder and "to" contains your netid, the hostname of the cluster login node, and the destination directory if you are tranfering files to SHARK and vice versa if you are downloading files from SHARK. For example, to \underline{upload} files:
**scp myfile.txt SHARKusername@res-hpc-lo01:/exports/fsw**
In this example, "myfile.txt" is copied to the directory /exports/fsw of SHARK.
If you want to transfer the whole folder with its content you need to add -r, for example:
scp -r mydirectory SHARKusername@res-hpc-lo02:/exports/fsw
To \underline{download} files/folder from SHARK just interchange the paths:
**scp SHARKusername@res-hpc-lo02:/exports/fsw/myfile.txt MyDocuments**
In this example, myfile.txt is downloaded from SHARK to your local folder MyDocuments.
**Remark**: If you are not using a device that is connected to the FSW network by cable (not wifi) the files must be transfered through the jump host. This means that you need to transfer the files from your PC/laptop to the jump host using scp and then from the jump host to SHARK (or vice versa if you want to download files from SHARK). Let's see how to do it with two examples:
\underline{Transfer the file text.txt from your laptop/PC to SHARK.}
First, transfer the file to the jump host:
**scp text.txt SHARKusername@145.88.35.10:/home/SHARKusername**
Second, connect to the jump host typing
**ssh SHARKusername@145.88.35.10**
Third, transfer the file to SHARK (home folder):
**scp text.txt SHARKusername@res-hpc-lo02:/home/SHARKusername**
Optional: Remove the files from the jump host.
\underline{Transfer the file text.txt from SHARK to your laptop/PC.}
First, connect to the jump host:
**ssh SHARKusername@145.88.35.10**
Second, transfer the file from SHARK (home directory) to the jump host:
**scp SHARKusername@res-hpc-lo02:/home/SHARKusername/text.txt /home/SHARKusername**
Third, exit the jump host. Go back to your own PC/laptop terminal.
Fourth, transfer the file from the jump host to your laptop by typing on your terminal:
**scp SHARKusername@145.88.35.10:/home/SHARKusername/text.txt MyDocuments**
Optional: Remove the files from the jump host.
Alternatively, if you have created the **resshark** config file to connect it is possible to transfer the files without using the scp command twice:
\underline{Transfer the file text.txt from laptop/PC to your SHARK.}
Open a terminal and type:
**scp text.txt resshark:/home/SHARKusername**
\underline{Transfer the file text.txt from SHARK to your laptop/PC.}
Open a terminal and type:
**scp resshark:/home/SHARKusername/text.txt MyDocuments/text.txt**
### Transfer the files using SFTP client for Mac (Fugu) and Linux (Filezilla, X2Go)
You can also transfer files between your local computer and a cluster using an SFTP client software, such as Fugu (OSX) or FileZilla (Linux).
For more information on how to transfer files from or to a cluster, [*click here*](https://research.computing.yale.edu/support/hpc/user-guide/transfer-files-or-cluster).
**\underline{ Fugu (Mac users) }**
Start by install Fugu. Then, open Fugu and in the menu go to SSH and afterward New SSH tunnel (cmd T). Fill Create Tunnel to (145.88.76.217), Service or Port (22), Tunnel Host(145.88.35.10), Username (SHARKusername) and Port (22).
Next, type your password.
An additional window appears that displays your tunnel. The Local port was set automatically. Check this number (the local port), you will need it.
In the main window, fill out the field Connect to: with "localhost"", type your SHARKusername,and the number of your local port in the field "Port:". Then, press connect, and enter your password.
On the left side of the screen you will have the files/folders on your computer and on the right side of the screen the files/folders on the SHARK cluster. ou can now transfer files by dragging them between the left and right screen.
**\underline{ Filezilla (Linux users) }**
*Warning: Filezilla only works if you are connected via cable to the FSW network.*
First **Filezilla** is installed. Fill the host (145.88.76.217), the username (SHARKusername), the password (SHARKpassword) and the port (22) as in Figure 2, and press the ENTER key.
On the left side of the screen you will have the files/folders on your computer and on the right side of the screen the files/folders on the SHARK cluster. You can manage your files in a similar way as in the Windows environment.
**\underline{ X2Go (Linux users) }**
While installing X2Go we can configure the protocol to transfer files. For more information on how to configure it, [click here.](https://wiki.x2go.org/doku.php/doc:howto:x2goclient-file-sharing)
## Update your password
It is important to take into account that SHARK passwords expire. Therefore, they should be periodically updated.
In order to know when does our password expire, type AD-pwd-expiration.sh and SHARK will return the number of days until the expiration date.
To change the password, you should type in the command line *passwd*. Then, enter your old password once and your new password twice.
**Note**: Remember to update your password in all the SHARK related software, otherwise you will probably get in the screen a warning message: *REMOTE HOST IDENTIFICATION HAS CHANGED!*. This warning also appears if the remote machine has a new update.
To solve this problem open the file where your key's are saved and remove the key that is old. In the warning message it says which is the key to be changed. This means we need to remove the corresponding line from the file "known_hosts".
# Specification of a simulation study
In a full factorial simulation study, we consider the following 4 steps:
* Step 1: Research Question
* Step 2: Gauge and Design
* Step 3: Statistical Analysis
* Step 4: Evaluation Criteria
## Step 1: Research Question
First, you formulate the research question you would like to answer by means of a simulation study.
In our example, the research questions are: Does the Mann-Whitney U test show higher power (i.e., lower Type II error) than the independent samples t-test? And does the difference in power between the two methods depend on the total sample size and the mean difference between the two groups?
We hypothesize that the power of the Mann-Whitney test is higher than the independent samples t-test, especially in smaller samples.
## Step 2: Gauge and design
Start with the formulation of:
* The true scenario(s)/model(s) from which you will generate your artificial data, and specify from which distributions you generate the variables that are included in the model.
- In our example, we have a continuous, standardized outcome variable and one variable group that has two categories (i.e., two groups).
* Specify the design factors (i.e. the factors you are going to vary systematically).
In our example, there are two design factors:
- Samp (sample size, having 4 levels: N = 10, 20, 40, and 80 in the example);
- Es (effect size, having 3 levels: d = 0.2, 0.5 and 0.8 in the example). In this example, the effect size is the standardized difference in means between the two groups.
* Number of replications within each cell: *k*. In our small example, k = 10.
Then, our design contains in total 4 (Samp) by 3 (Es) 12 cells, and for each cell we will generate k = 10 data sets. Thus, in total, we analyze 120 data sets.
## Step 3: Statistical Analysis
Specify the type of analysis you want to perform for each dataset (of each cell of the design). Often you want to compare two types of analysis methods (a new method compared to an old method).
The two methods used in the statistical analysis of the example are the t-test and the Mann-Whitney test which can be found in the *Method_old.R* and *Method_new.R* files respectively.
## Step 4: Evaluation Criteria
Specify the evaluation criteria which you want to use to compare the two methods.
In our example, the evaluation criterion used to compare both methods (the t-test and the Mann-Whitney) is the number of times that the test is significant using $\alpha = 0.05$. If it is significant it means that the test correcly rejects the null hypothesis of no difference in means between the two groups.
# Preparation of a simulation study
Write the following functions in \texttt{R}:
* A function that generates one data set for each true scenario in the design. The input arguments of this function are the type of true model and the levels of the factors of the simulation design. In our example, we refer to this function as *MyDataGeneration.R*. The output of this function is a simulated data set (SimData);
* One or two functions that analyse the data according to one or two methods. In our example, we refer to these functions as *Method_old.R* and *Method_new.R*. The input argument of these functions is a dataset (and possibly more input arguments). The output of this function contains the relevant results. In our example, we save the output of these functions as MyAnalysisResult (see *MainSimulationScriptSLURM.R* below);
* A function that applies the evaluation criteria. In our example, we refer to this function as *MyEvaluation.R*. The input of this function is MyAnalysisResult, this is, the output of the methods.
**Note:** *If you are a Matlab user, go to chapter 12.*
# Performance of a simulation study on a cluster computer
## The main simulation script:
Below you can see the main simulation script called *MainSimulationScriptSLURM.R*. This is the script that contains your whole simulation study. Here, your design is coded, analyses are specified and evaluation criteria are computed. Moreover, saving your output to your directory is specified.
Some usefull and sometimes necessary functions:
* commandArgs()
- captures supplied command line arguments (see section 6.2 and 6.3)
* expand.grid()
- usefull function that creates a data frame from all combinations of factor variables
* set.seed()
- important for replication of your simulation study.
* do.call()
- neat function to execute a function with a list of supplied arguments
```{#numCode1 .R .numberLines}
#!/bin/bash
# MainSimulationScriptSLURM.R
# args is a stringvector
args <- commandArgs(TRUE)
args <- as.numeric(args)
RowOfDesign <- args[1]
Replication <- args[2]
### Simulation design example
samp <- c(10, 20, 40, 80)
es <- c(0.2, 0.5, 0.8)
# Design is a data.frame with all possible combinations of the factor levels
# Each row of the design matrix represents a cell of your simulation design
Design <- expand.grid(samp = samp, es = es)
# set a random number seed to be able to replicate the result exactly
set.seed((Replication + 1000)*RowOfDesign)
###Preparation:
##Install R packages first in separate job file (see chapter 9)
#Always use library() herento activate the non-standard R-packages, for example
#library(ica)
#By the way: this is just to illustrate, we do not use this package for our example
source("/home/jclaramuntganzalez/sharktutorial/MyDataGeneration.R")
source("/home/jclaramuntganzalez/sharktutorial/Method_new.R")
source("/home/jclaramuntganzalez/sharktutorial/Method_old.R")
source("/home/jclaramuntganzalez/sharktutorial/MyEvaluation.R")
#An alternative to the previous lines is to create your own directory
#and upload the files to it. For example:
#source("/home/SHARKusername/yourdirectory/MyDataGeneration.R")
#source("/home/SHARKusername/yourdirectory/Method_new.R")
#source("/home/SHARKusername/yourdirectory/Method_old.R")
#source("/home/SHARKusername/yourdirectory/MyEvaluation.R")
#IMPORTANT WARNING!!!
#When creating your own simulation study,
#make sure to creare your own directory on your Home directory.
#For example: /home/SHARKusername/projectName
######### simulation ###########
# Generate data
SimData <- do.call(MyDataGeneration, Design[RowOfDesign, ] )
# Analyze data set with Method_new
tmp <- proc.time()
MyAnalysisResult1 <- Method_new(SimData)
#Analyze data set with Method_old
MyAnalysisResult2 <- Method_old(SimData)
time <- proc.time() - tmp #save the time to run the analyses of one cell of the design.
# Also possible to time both analyses seperately
#Combine relevant results of the analysis by the two methods in a vector (optional)
MyAnalysisResult <- c(MyAnalysisResult1$p.value, MyAnalysisResult2$p.value)
#Evaluate the analysis results of Method_new and Mehod_old
MyResult1 <- MyEvaluation(MyAnalysisResult1)
MyResult2 <- MyEvaluation(MyAnalysisResult2)
#combine the results in a vector:
MyResult <- c(MyResult1, MyResult2)
# save stuff on export
setwd("/exports/fsw/claramuntj/sharktutorial")
#Alternatively, you can also create your folder in the export/fsw directory
#and set it as your working directory.
#IMPORTANT WARNING!!!
#When creating your own simulation study, make sure to creare your own directory
#on the /exports/fsw directory. First create a folder for you,
#and within that folder, create a project folder
#For example: /exports/fsw/SHARKusername/projectName
# Write output (also possible to first save everything in a list object)
#optional to save the data
save(SimData, file =paste("Simdata","Row", RowOfDesign,
"Rep", Replication ,".Rdata" , sep =""))
#optional to save the full analysis result
save(MyAnalysisResult, file =paste("Analysis","Row", RowOfDesign,
"Rep", Replication ,".Rdata" , sep =""))
save(MyResult, file =paste("MyResult", "Row", RowOfDesign,
"Rep", Replication ,".Rdata" , sep =""))
#optional to save timing of analyses
save(time, file =paste("Time", "Row", RowOfDesign,
"Rep", Replication ,".Rdata" , sep =""))
```
## The bash script used to run Jobs
Below you can find a bash script (let's call it *RunMySimulationSLURM.sh*) that is used on Shark to run an \texttt{R}-script
The bash script 'starts' the \texttt{R}-script *'MainSimulationScriptSLURM.R'* with the command R CMD BATCH and includes some provided arguments: \$i and \$j. The \$ sign indicates that it is a replacement character, with a for loop you can change those values on the fly (see part 6.3). The values provided to these replacement characeters will be collected with the commandArgs() function in our \texttt{R}-script.
```{#numCode2 .sh .numberLines}
#!/bin/env bash
#SBATCH -J Sim_study
#SBATCH -N 2
#SBATCH --mem=512MB
#SBATCH --output=sim_study_%J.out
#SBATCH --error=sim_study_%J.err
#SBATCH --mail-user j.claramunt.gonzalez@fsw.leidenuniv.nl
#SBATCH --mail-type=BEGIN,END,FAIL
module load statistical/R/4.0.2/gcc.8.3.1
Rscript MainSimulationScriptSLURM.R $i $j
```
For the small simulation example we are going to run, requesting 1G of memory is enough (512MB per each of the 2 nodes).
For your own simulation study, we advise you to start by running only the row of your design which you expect to require more memory. If 1Gb is enough there is no need to modify the script to run all the jobs. In case the memory is not enough (SHARK will return an error), increase it gradually, for example, with increments of 0.5Gb. Asking for too much memory will require a node with more memory, and consequently, your job will be more time in the queue.
## Start your embarrassingly parallel jobs
We have an \texttt{R}-script with simulation code, *MainSimulationScriptSLURM.R*. We also have a bash script *RunMySimulationSLURM.sh* which 'starts' the \texttt{R}-script. But how do we actually start our simulation? We simply use the *sbatch* command in a double for-loop. Thus, our jobs are submitted to shark in a parallel fashion.
Before submitting multiple jobs, it is a good practice to submit only one first, in order to check that everything works properly. You can do it in the following way:
```{#numCode3aExtra .sh .numberlines}
i=1 j=1 sbatch RunMySimulationSLURM.sh
```
In your project folder of the home directory you will observe two new files. One containing the R console output and one containing the possible errors. If the second one is empty, there were no errors. The error file can be used to spot the errors. Solve them before moving to the next step.
You can also check your saved output (different from the R console output) on the project folder you have created on the exports/fsw directory.
Now, it is time to submit multiple jobs. Let's suppose that we want to submit all the replications for the first row of our design matrix. The code should be as follow:
```{#numCode3a .sh .numberlines}
for replication in {1..10}; do
i=1 j=$replication sbatch RunMySimulationSLURM.sh; done
```
Remember, for the current example the simulation design consists of two factors: Sample size (Samp) and effect size (Es). The factors have 4 and 3 levels respectively. So we have a 4x3 factorial design. And we want to replicate this design *k* = 10 times. Thus, we will generate, analyze and evaluate a total of 4 x 3 x 10 = 120 datasets. This can be done as follow:
```{#numCode3 .sh .numberlines}
for replication in {1..10}; do for rownumber in {1..12}; do
i=$rownumber j=$replication sbatch RunMySimulationSLURM.sh; done; done
```
**Note 1:** *Remove the email arguments (lines 7 and 8) from RunMySimulationSLURM.sh. Otherwise you will receive 240 emails.*
**Note 2:** *In order to use R on shark, you first need to 'load' it. This can simply be done on the command line by typing module load module load statistical/R/4.0.2/gcc.8.3.1 (or select another R version available on SHARK)*
You can submit the double for loop on the command line in Shark. This double for loop will start 10 x 12 = 120 jobs that will run in parallel. Each job has no dependency on each other, meaning that you can run each row of the design completely seperate. This is known as embarrasingly parallel computing and it saves you a lot of computation time!
### Job Arrays (alternative to the for-loop)
A downside of using for-loops to submit jobs is that each job is submitted to the cluster individually. This is taxing for the head node, especially if the number of jobs is large. An alternative to the use of the for-loop for submitting jobs to SHARK is to use a job array.
A job array is a special job which consist of similar tasks that you want the cluster to execute. The primary advantage of using a job array is that the entire set of tasks is submitted to the queue as a single entity. The array as a whole is assigned a job identifier (column 'JOBID' in the output of \texttt{squeue}), and each task within the array is assigned an individual task identifier (column 'JOBID', after the job identifier). In the \texttt{squeue} output, waiting tasks are collapsed into a single row.
For example, if we have 1000 tasks queued this will appear as a single job with JOBID_[1-1000] (i.e. "tasks 1 through 1000 are waiting"), rather than as 1000 individual jobs. Running tasks are still displayed individually, so you can see on what node they are being executed, etc. The presence of both a job and task identifier makes it easy to request information on, or make changes to, both the individual tasks and the array as a whole. \par
Creating a job array is very simple: just include the '--array' flag in your bash script, for example, the script *'RunMySimulationSLURM.sh'* generating *'RunMySimulationJobArraySLURM.sh'*. An example on how to create a job array is as follows:
```{#numCode4 .sh .numberlines}
#!/bin/env bash
#SBATCH -J my_array
#SBATCH -N 2
#SBATCH --mem=512MB
#SBATCH --output=sim_study_%a.out
#SBATCH --error=sim_study_%a.err
#SBATCH --array=1-10
module load statistical/R/4.0.2/gcc.8.3.1
i=1
Rscript MainSimulationScriptSLURM.R $i $SLURM_ARRAY_TASK_ID
```
%a in the output and error files is substituted by the SLURM_ARRAY_TASK_ID value. %A can be used if you prefer to include the SLURM_ARRAY_JOB_ID value.
Submitting this shell script to the cluster will create a job array, which consists of 10 tasks (--array=1-10). Each task executes the file 'MainSimulationScriptSLURM.R'. In this small example we are running the 10 repetitions of the first combination of parameters in the design.
It can be useful to save not only the results of a script, but also what is printed in the R console during execution. When using the Rscript command, you can easily do this by specifying a .Rout file. For example:
```{#numCode5 .sh .numberlines}
#!/bin/env bash
#SBATCH -J my_array
#SBATCH -N 2
#SBATCH --mem=512MB
#SBATCH --output=sim_study_%a.out
#SBATCH --error=sim_study_%a.err
#SBATCH --array=1-40
module load statistical/R/4.0.2/gcc.8.3.1
Rscript my_script.R $SLURM_ARRAY_TASK_ID
```
This job array will save the R console output of each task to a .Rout file, which is a simple text file that can be read using e.g. \texttt{cat}.
In general, when running a job array, we do not want each task to do exactly the same thing: we want to vary experimental parameters, etc. Fortunately, the identifier for each task is stored in an environment variable called *SLURM_ARRAY_TASK_ID* (as we can observe in the previous example). We can obtain the value of this variable in R by using e.g. `as.numeric(Sys.getenv("SLURM_ARRAY_TASK_ID"))`, as in the following example:
```{#numCode6 .R .numberLines}
# Experimental factors
f1 <- c(10, 20)
f2 <- c(0.2, 0.5)
# Replications
rep <- 1:10
# Matrix containing parameters for all individual tasks to perform
Design <- expand.grid(rep, f1, f2)
# Generate seeds
set.seed(123)
seed.list <- sample(.Machine$integer.max/2, size = nrow(Design))
# Obtain the Task ID
TID <- as.numeric(Sys.getenv("SLURM_ARRAY_TASK_ID"))
# Apply some function to obtain some result
set.seed(seed.list[TID])
result <- my_function(Design[TID,])
# Save the result
save(result, file=paste0("<path_to_folder>/my_result_", TID, ".RData"))
```
Submitting a job array as the one in the example will now execute this function 10 times, each with a unique combination of f1, f2 and seed.
# Collecting the results
After all submitted jobs are done, you should collect all results from your output folder. Below you can find an R script (*ResultsCollecting.R*) that collects the right output and automatically stores the results in a matrix. This matrix can be used for further analyses such as an ANOVA.
Note that collecting the results can be done on Shark. However, sometimes it is easier to first download the results to your own device and then collect the final results.
Note that this script uses the package gtools. You first need to install this package.
```{#numCode7 .sh .numberlines}
# Collect all results and put it in a matrix
# set directory with results
setwd("~/Desktop/shark/Data/")
library(gtools)
## original design
samp <- c(10, 20, 40, 80)
es <- c(0.2, 0.5, 0.8)
Design <- expand.grid(samp = samp, es = es)
# pattern matching with grep
files <- dir()
idx <- grep(pattern = "MyResult", files)
files <- files[idx]
# test to see mixedsort() does it's job
scram <- sample(1:120, 120, replace = F)
files <- files[scram]
files <- mixedsort(files)
###
RepIdxList <- list()
for(rep in 1:10){
RepIdxList[[rep]] <- grep(pattern = paste("Rep",rep,sep = "" ), files )
}
idx_to_remove <- RepIdxList[[1]] %in% RepIdxList[[10]]
RepIdxList[[1]] <- RepIdxList[[1]][!idx_to_remove]
idx <- unlist(RepIdxList)
files <- files[idx]
# rbind design k times and cbind files names
#(so that you can check whether the right values are taken)
Results <- do.call(what = rbind, args = replicate(10, Design, simplify = F))
Results <- cbind(Results, files)
# the sapply function loads the results and preserves the file name (in the rows).
# This is usefull for checking whether you did everything right
Res <- t( sapply(files, function(x) get(load(x) ) ) )
Results <- cbind(Results, Res)
colnames(Results) <- c("samp", "es", "files", "res1", "res2")
save(Results, file = "AllResults.Rdata")
```
# Email notifications
When submitting a job you can use the --mail-user and --mail-type flags to send yourself email notifications. The --mail-user flag is used to specify your email adress, while the --mail-type flag is used to specify when you want to be notified: at the beginning of the job (BEGIN) and/or at the end (END), or when the job fails (FAIL).
You will notice that the job array examples above do not include email notifications: this is because the flags applies to each task individually. It is generally not recommended to use begin- and end-notifications with large job arrays, otherwise you will get an email at the start and end of each individual task, which is not pleasant for both the mail server and your inbox (the same applies for the for-loop).
# Installing R packages on Shark
**Option 1:**
Start an individual R session on shark:
```
$ salloc -N1 -n1
$ module add statistical/R/4.0.2 (or another version)
$ R
```
Once R is open install the package you want as usual in R, using:
install.packages("<package_name>").
Choose a CRAN mirror(if you have not included it as an argument of the install.packages function), and say that you want to create a personal folder to install the package when Shark asks you (this will happen the first time you install a package).
*Note:* Once you have finished, remember to exit R using q(), and then, logout from the SHARK node typing 'exit'. Otherwise, the job will continue in the queue.
**Option 2 (only works when you already have a personal folder):**
The first step is to create a personal folder (if you do not have one yet). To do so, type on the command line:
mkdir /home/YOUR_DIRECTORY/R/x86_64-pc-linux-gnu-library/4.0 (or the corresponding R version)
Then, create the R script to install the packages, 'InstallPackages.R':
```{r, eval=F}
install.packages(c("ica","Matrix"), Sys.getenv("R_LIBS_USER"), repos = "http://cran.case.edu")
```
Finally, generate the bash script 'InstallPackages.sh' to run the R script in SHARK.
```{#numCode9 .sh .numberlines}
#!/bin/env bash
#SBATCH -J installpackages
#SBATCH -N 2
#SBATCH --mem=512MB
#SBATCH --output=sim_study.out
#SBATCH --error=sim_study.err
module load statistical/R/4.0.2/gcc.8.3.1
Rscript InstallPackages.R
```
How to run this script on Shark:
On the commandline type *sbatch InstallPackages.sh*.
# Warning for writing shell scripts on Windows: line endings
If you are a Microsoft Windows user and are writing shell scripts (.sh) for execution on Shark, make sure that your files have the correct Unix-style line endings, rather than the default Windows-style (see figure 4), otherwise your scripts may not execute properly.
`)
Most decent text editors (e.g. Notepad++, Sublime Text) have an option to set the line endings or convert to Unix. In the figure 5 it is shown the way to correct it in Notepad++.
`)
# Useful Linux commands
pwd: It returns your working directory.
ls: It displays the content of your working directory.
ln -s: Create a shortcut to a folder.
cd: It changes your working directory to a new one given as parameter.
man: It opens the manual regarding the command given as parameter.
mkdir: It creates a new directory.
rmdir: It removes an empty directory. Use rm -r to delete the directory with the files in it.
nano: It opens nano, a text editor.
cp: It copies a directory. Add -r to copy the files in the folder.
exit: It exits from the terminal or ends the connection to SHARK.
#Matlab
In the previous distribution of SHARK, Matlab was not available. Instead, Octave was available. Octave is a free software primarily intended for numerical computations that is highly compatible with Matlab. This means that many Matlab codes work also in Octave (even it is not necessary to transform the .m files).
In this current SLURM distribution of SHARK, both, Matlab and Octave, are available.
In order to perform the simulation study on Octave/Matlab, we need first to write analogous functions to *MyDataGeneration.R*, *Method_old.R*, *Method_new.R*, and *MyEvaluation.R*, as well as a script analogous to *MainSimulationScript.R* (let's call it *MainSimulationScript.m*).
Next, we need to slightly modify the .sh file. An example of it is the following:
```{#numCode10 .sh .numberlines}
#!/bin/bash
#SBATCH -J matlab_example
#SBATCH -N 2
#SBATCH --mem=512MB
#SBATCH --output=matlab_example.out
#SBATCH --error=matlab_example.err
module load statistical/MATLAB/2019b
matlab MainSimulationScript.m
```
For running in parallel or using job arrays to run your script go to chapter 6.
**Note:** *To run octave, substitute the last two lines of the previous .sh file. First load octave instead of matlab. Second, substitute the command matlab by the command octave.*