-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy patharmci_spec.tex
1920 lines (1800 loc) · 86.4 KB
/
armci_spec.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[12pt]{article}
\renewcommand{\baselinestretch}{1.0}
\begin{document}
\begin{titlepage}
\begin{center}
{\LARGE DRAFT: Specification for the Aggregate Remote Memory Copy Interface (ARMCI)}
\end{center}
\bf{Abstract:}
\end{titlepage}
\tableofcontents
\newpage
\pagestyle{plain}
\section{Introduction}
The Aggregate Remote Memory Copy Interface (ARMCI) grew out of the development
of the Global Arrays (GA) Toolkit. Global Arrays, in turn, was originally developed
developed to support the NWChem Quantum Chemistry package but it has sinced
proved useful in a large number of other application areas. Early in the
development of GA it was realized that it would be useful to separate the GA
functionality from the underlying native libraries that could directly
communicate with the network. Implementing GA on top of these libraries was
desirable because of the performance gains that could be achieved, but doing
this created significant porting and code management issues. Separating the bulk
of the GA implementation from the network and device dependent code was an
obvious solution and led to the development of the ARMCI interface.
Since that time ARMCI has been ported to many of the major networks used in modern
high performance computing platforms, including most of Department of Energy's
flagship machines. Recently there has been interest on the part of groups
outside of the original development team in implementing the ARMCI interface and
this specification document is a reponse to that interest. The functions
described herein are the minimum set of functions required to support the
current GA libraries, starting with the 5.2 release.
\section{Header Files}
Code that uses any of the standard ARMCI functions (any function whose name
begins with the character string ARMCI\_) must include the header file
\texttt{armci.h} at the top of the file. Any code that uses the ARMCI wrappers
on top of the runtime system (any function whose name begins with the lower case
character string armic\_) must also include the file \texttt{message.h}
\section{Initialization and Termination}
This section will discuss initialization and termination of the ARMCI library.
\subsection{ARMCI\_Init}
\begin{verbatim}
int ARMCI_Init(void)
\end{verbatim}
This function initializes the ARMCI library. An initialization function must be
called before most other ARMCI calls can be made. This function returns a value of
0 if successful.
\subsection{ARMCI\_Init\_args}
\begin{verbatim}
int ARMCI_Init_args(int *argc,
char ***argv)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{argc}: pointer to number of arguments
\item (IN) \texttt{argv}: pointer to a list of character strings representing
the arguments
\end{itemize}
This function initializes the ARMCI library. An initialization function must be
called before most other ARMCI calls can be made. This function returns a value of
0 if successful. This initialization function takes arguments that will be
passed on to the \texttt{MPI\_Init} routine, if it has not already been called.
QUESTION: Do we want to include a function that is so closely tied to MPI?
\subsection{ARMCI\_Finalize}
\begin{verbatim}
void ARMCI_Finalize(void)
\end{verbatim}
This function terminates the ARMCI library. It should be called before
terminating the underlying runtime (e.g. MPI).
\subsection{ARMCI\_Initialized}
\begin{verbatim}
int ARMCI_Initialized(void)
\end{verbatim}
This function can by used to determine if the ARMCI library has been
initialized. It returns 0 if ARMCI has not already been initialized and is
non-zero if ARMCI is initialized.
\section{Memory Management}
ARMCI has its own memory allocation and deallocation mechanisms. This is
necessary because reading and writing to remote memory locations requires that
the memory be registered with the network and that the processor originating the
read, write or accumulate has the network address of the remote memory location.
The network address is then used in the one-sided communications calls described
below.
Allocation of a distributed block of memory that will be used in communication
operations is a collective operation that results in each process having an
array of pointers that points to the allocated memory on every process. For each
processor, the entry in this processor corresponding to the processor ID is just
a conventional pointer, for remote processors it is a network address. For all
processors, the entry in the ``pointer'' array, or appropriate offsets to it,
can be used in one-sided communication calls to specify the location of data on
a remote processor. All one-sided communication calls must use memory allocated
with the ARMCI memory allocation functions for the remote memory location. Local
buffers can be allocated using either ARMCI or other conventional
allocation protocols.
The memory allocation functions in ARMCI also provide a mechanism for allocating
local buffers that can be accessed remotely. These allocations both create the
requested memory segment as well as an associated meta-data object
\texttt{armci\_meminfo\_t} that provides both local and network addresses and the
size of the allocated segment. This data object is defined in the \texttt{armci.h}
header file. The meta-data descriptor contains the following members:
\begin{itemize}
\item{\texttt{char *armci\_addr}}: a network address that can be used to
access the data remotely if the accessing processor is on another SMP node
\item{\texttt{char *addr}}: the local address of the allocated memory
\item{\texttt{size}}: total size of the allocated segment
\item{\texttt{cpid}}: processor ID on which data is allocated
\end{itemize}
\subsection{ARMCI\_Set\_shm\_limit}
\begin{verbatim}
void ARMCI_Set_shm_limit(unsigned long limit)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{limit}: Amount of memory requested (in bytes)
\end{itemize}
This function is used to set the maximum amount of memory that can be allocated
using \texttt{ARMCI\_Malloc}, \texttt{ARMCI\_Malloc\_local} and
\texttt{ARMCI\_Memget}. Because of
fragmentation issues, etc. the actual amount of memory available may be less
than this limit.
\subsection{ARMCI\_Malloc}
\begin{verbatim}
int ARMCI_Malloc(void *ptr_arr[],
armci_size_t bytes)
\end{verbatim}
\begin{itemize}
\item (OUT) \texttt{ptr\_arr}: array of pointers to allocated memory
\item (IN) \texttt{bytes}: amount of memory in bytes to be allocated on calling
processor
\end{itemize}
This function is a collective operation across all processors. Each process
specifies the amount of memory in bytes that it wants to allocate locally and
returns an array of pointers indicating the location of the allocated memory on
all processors. These pointers can be used to reference the allocated memory on
all processors in ARMCI one-sided communication calls. The array
\texttt{ptr\_arr} should be previously allocated with a size that is equal to
the number of processors times the size of a void pointer. The function returns
0 if the allocation is successful and a non-zero value otherwise.
\subsection{ARMCI\_Free}
\begin{verbatim}
int ARMCI_Free(void *ptr_arr[])
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{ptr\_arr}: array of pointers to allocated memory
\end{itemize}
This function is a collective operation across all processors and frees the
memory allocated in an \texttt{ARMCI\_Malloc} call. It returns 0 if the function is
successful and a non-zero value otherwise.
\subsection{ARMCI\_Malloc\_local}
\begin{verbatim}
void* ARMCI_Malloc_local(armci_size_t bytes)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{bytes}: amount of memory in bytes to be allocated
\end{itemize}
This function will allocate a local segment of memory out of registered memory on
the calling processor. This memory may increase performance if used as the local
buffer in an ARMCI one-sided communication call. The return value is a pointer
to the allocated memory.
\subsection{ARMCI\_Free\_local}
\begin{verbatim}
int ARMCI_Free_local(void *ptr)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{ptr}: pointer to local memory
\end{itemize}
This function can be used to free a block of memory allocated using the
\texttt{ARMCI\_Malloc\_local} function. It returns 0 if successful.
\subsection{ARMCI\_Uses\_shm}
\begin{verbatim}
int ARMCI_Uses_shm(void)
\end{verbatim}
This function can be used to query the underlying implementation of ARMCI. It
returns 0 if shared-memory was not used in implementing ARMCI and a non-zero
value otherwise. This information can be used to increase performance in
applications built on ARMCI.
\subsection{ARMCI\_Memget}
\begin{verbatim}
void ARMCI_Memget(size_t bytes,
armci_meminfo_t *meminfo,
int memflg)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{bytes}: number of bytes requested
\item (OUT) \texttt{meminfo}: meta-data structure containing information on
allocated memory
\item (IN) \texttt{memflg}: flag for future optimizations
\end{itemize}
This function allocates a segment of memory that can be accessed remotely by
other processors. It returns the meta-data required to access this memory in the
structure meminfo.
\subsection{ARMCI\_Memat}
\begin{verbatim}
void* ARMCI_Memat(armci_meminfo_t *meminfo,
long offset)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{meminfo}: meta-data structure containing information on
allocated memory
\item (IN) \texttt{offset}: offset (in bytes) for shifting value of memory
location
\end{itemize}
This function returns a pointer for a segment of memory allocated using the
\texttt{ARMCI\_Memget} function. If the value of \texttt{offset} is 0, this function
returns either a valid local pointer or the network address of the allocated
memory segment, for non-zero values of \texttt{offset} the value is shifted by
\texttt{offset} bytes. The returned pointer can be used in any ARMCI one-sided
call.
\subsection{ARMCI\_Memctl}
\begin{verbatim}
void ARMCI_Memctl(armci_meminfo_t *meminfo)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{meminfo}: meta-data structure containing information on
memory segment
\end{itemize}
This function will free up a segment of memory allocated using the
\texttt{ARMCI\_Memget}
command. The \texttt{meminfo} input must contain the same information as the
\texttt{armci\_meminfo\_t} struct that was used to create the segment.
\section{Blocking Communication of Structured Data}
Functions that can be used to read data on other processors, write data to other
processors and modify data on other processors are described here. These
functions can be broken up into three different classes.
\begin{enumerate}
\item Functions that work on contiguous blocks of data in both the local and
remote buffers.
\item Functions that work on non-contiguous blocks of data in both the local and
remote buffers. However, the local and remote data is strided in regular
patterns. This is frequently encountered when operating on rectangular
subsections of data blocks that are part of regular arrays.
\item Functions that work on randomly located data elements for both the local
and remote buffers.
\end{enumerate}
This section will describe functions that work on contiguous and strided data.
The basic pattern for describing strided data is used in all the strided calls
and will be summarized here. In general, strided data is used when
working with a rectangular sub-block of a higher dimensional data array. Suppose
one is starting with an array of dimension $N$. The individual axes have
magnitudes $D_1$, $D_2$,..., $D_N$ and the size of the individual array elements
in bytes is $E$. The fastest dimension is indexed by 1, followed by
consecutively slower dimensions. The total array will be represented by a contiguous
block of memory of size $T=E\Pi_{i=1}^{N}D_i$. A rectangular sub-block of this
array will form a regular pattern in memory within the segment representing the
entire array.
Suppose the sub-block is characterized by the upper and lower indices $\{U_i\}$
and $\{L_i\}$. The length of each dimension of the sub-block is then
$\{S_i=U_i-L_i+1\}$. If the first element is scaled by the size of the
individual elements $E$ then $S_1^{\prime} = E\times S_1$ is equal to the size of individual
contiguous memory segments in the sub-block. The values of the $S_i$, with $S_1$
scaled by $E$ go into the \texttt{count} array described in the strided
functions below.
The strides for the sub-block are determined from the dimensions of the original
array. For an array of $N$ dimensions, $N-1$ strides are needed to complete the
description of the data layout. The first stride $M_1$ is given by the product
of $D_1$ and $E$. The remaining $N-1$ stides can be obtained from the recursion
formula $M_i = M_{i-1}D_i$. The location of any byte in the selected sub-block can
be indexed by the set of integers $\{I_i\}$ where $I_i$ lies between 0 and the
corresponding value in the \texttt{count} array minus 1. The offset of this byte
from the first byte in the sub-block is given by the formula
\[ \Delta_{offset} = I_1 + \sum_{i=2}^{N}I_i M_{i-1} \]
The strides are located in the \texttt{dst\_stride\_arr} and
\texttt{src\_stride\_arr} arrays described below. The dimension of the array is
stored in the \texttt{stride\_levels} variable.
The ARMCI interface also supports remote accumulate operations for both
structured and unstructured data. These operations can be used to add data in a
local buffer to existing data on a remote processor. These operations are atomic
so only one process can modify a given data location at any one time. Because of
the commutivity of addition, this guarantees that the result of multiple
accumulates at a given processor location will give the same value (modulo
possible roundoff errors) regardless of the order in which they occur. The types
of accumulate operations that are supported
by ARMCI are addition of C-integers, C-doubles, C-floats, single precision
complex numbers, double precision complex numbers, and C-longs. Addition of
single precision complex and double precision complex numbers correspond to two
consecutive C-floats and C-doubles, respectively. In the accumulate operations
described in the rest of this document, the accumulates on different types
of data are distinguished by the C-preprocessor macros
\newline
\begin{center}
\begin{tabular}{l}
\texttt{ARMCI\_ACC\_INT} \\
\texttt{ARMCI\_ACC\_DBL} \\
\texttt{ARMCI\_ACC\_FLT} \\
\texttt{ARMCI\_ACC\_CPL} \\
\texttt{ARMCI\_ACC\_DCP} \\
\texttt{ARMCI\_ACC\_LNG}
\end{tabular}
\newline
\end{center}
corresponding
to C-integers, C-doubles, C-floats, single precision complex numbers, double
precision complex numbers, and C-longs. These macros are defined in the header
file \texttt{armci.h}.
Accumulate operations take the values in the local buffer, scale them by a
specified amount and add them to the corresponding contents on the remote
processor. Arithmetically, it performs the operation
$a_{dst} = a_{dst} + s\times a_{src}$ for each element. The fact that data is
locked on the remote processor while the accumulate operation is being performed
may result in serialization of the code if many processors are trying to
accumulate to the same processor simultaneously.
QUESTION: Do we require remote buffers to be registered?
QUESTION: Do we make any general statements about the order in which operations
are executed?
QUESTION: Are remote buffers locked while put/get operations are being
implemented?
\subsection{ARMCI\_Put}
\begin{verbatim}
int ARMCI_Put(void *src,
void *dst,
int bytes,
int proc)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{src}: pointer to location of first byte in local buffer
containing data to copied to remote buffer.
\item (IN) \texttt{dst}: pointer to first byte in remote buffer into which local
data will be copied.
\item (IN) \texttt{bytes}: number of bytes to be moved from local buffer to
remote buffer.
\item (IN) \texttt{proc}: network rank of remote processor to which data is
being sent.
\end{itemize}
This function copies a contiguous block of data from a local buffer to a buffer
on a remote processor. The local buffer on the calling process is assumed to be
available for reuse when the function returns. The function returns a value of 0
if successful.
\subsection{ARMCI\_PutS}
\begin{verbatim}
int ARMCI_PutS(void *src_ptr,
int src_stride_arr[],
void *dst_ptr,
int dst_stride_arr[],
int count[],
int stride_levels,
int proc)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{src\_ptr}: pointer to location of first byte in local buffer
containing data to copied to remote buffer.
\item (IN) \texttt{src\_stride\_arr}: array of strides for local source data.
\item (IN) \texttt{dst\_ptr}: pointer to first byte in remote buffer into which local
data will be copied.
\item (IN) \texttt{dst\_stride\_arr}: array of strides for remote destination data.
\item (IN) \texttt{count}: number of elements at each stride level.
\texttt{count[0]} equals then number of bytes in each individual data data segment.
\item (IN) \texttt{stride\_levels}: number of stride levels.
\item (IN) \texttt{proc}: network rank of remote processor to which data is
being sent.
\end{itemize}
This function is used to send a strided data set from a local buffer to a
strided data set on a remote processor. The size and layout of the local and
remote buffers may be different, hence the stride arrays for the local and
remote buffers need to be specified separately. The \texttt{count} and
\texttt{stride\_levels} are the same for both buffers and only need to be
specified once. The local buffer on the calling process is assumed to be
available for reuse when the function returns. The function returns a value of 0
if successful.
\subsection{ARMCI\_Get}
\begin{verbatim}
int ARMCI_Get(void *src,
void *dst,
int bytes,
int proc)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{src}: pointer to location of first byte in local buffer
that receives data from remote buffer.
\item (IN) \texttt{dst}: pointer to first byte in remote buffer from which data
will be sent.
\item (IN) \texttt{bytes}: number of bytes to be moved from remote buffer to
local buffer.
\item (IN) \texttt{proc}: network rank of remote processor from which data is
being received.
\end{itemize}
This function copies a contiguous block of data from a remote processor to a buffer
on the calling processor. The local buffer on the calling process is assumed to
contain the requested data and be available for use when the function returns. The
function returns a value of 0 if successful.
\subsection{ARMCI\_GetS}
\begin{verbatim}
int ARMCI_GetS(void *src_ptr,
int src_stride_arr[],
void *dst_ptr,
int dst_stride_arr[],
int count[],
int stride_levels,
int proc)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{src\_ptr}: pointer to location of first byte in local buffer
that will receive data from remote buffer.
\item (IN) \texttt{src\_stride\_arr}: array of strides for local source data.
\item (IN) \texttt{dst\_ptr}: pointer to first byte in remote buffer from which
data will be sent.
\item (IN) \texttt{dst\_stride\_arr}: array of strides for remote destination data.
\item (IN) \texttt{count}: number of elements at each stride level.
\texttt{count[0]} equals then number of bytes in each individual data data segment.
\item (IN) \texttt{stride\_levels}: number of stride levels.
\item (IN) \texttt{proc}: network rank of remote processor from which data is
being sent.
\end{itemize}
This function is used to collect a strided data set from a remote processor to a
strided data set on the calling processor. The size and layout of the local and
remote buffers may be different, hence the stride arrays for the local and
remote buffers need to be specified separately. The \texttt{count} and
\texttt{stride\_levels} are the same for both buffers and only need to be
specified once. The local buffer on the calling process is assumed to be
contain the requested data and be available for use when the function returns. The
function returns a value of 0 if successful.
\subsection{ARMCI\_Acc}
\begin{verbatim}
int ARMCI_Acc(int optype,
void *scale,
void *src,
void *dst,
int bytes,
int proc)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{optype}: value specifies what type of data is being operated
on. Values are mapped to C-preprocessor macros described above and defined in
\texttt{armci.h}.
\item (IN) \texttt{scale}: value of scale factor by which values in local buffer
should be multiplied before adding them to the contents of the remote buffer.
The value in scale should correspond to the value given in \texttt{optype}.
\item (IN) \texttt{src}: pointer to location of first byte in local buffer
containing data to copied to remote buffer.
\item (IN) \texttt{dst}: pointer to first byte in remote buffer into which local
data will be copied.
\item (IN) \texttt{bytes}: number of bytes to be moved from local buffer to
remote buffer.
\item (IN) \texttt{proc}: network rank of remote processor to which data is
being sent.
\end{itemize}
This function is used to accumulate a contiguous block of data on the calling
processor into a contiguous block of data on
remote processor. The local buffer on the calling process is assumed to be
available for reuse when the function returns. The function returns a value of 0
if successful.
\subsection{ARMCI\_AccS}
\begin{verbatim}
int ARMCI_AccS(int optype,
void *scale,
void *src_ptr,
int src_stride_arr[],
void *dst_ptr,
int dst_stride_arr[],
int count[],
int stride_levels,
int proc)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{optype}: value specifies what type of data is being operated
on. Values are mapped to C-preprocessor macros described above and defined in
\texttt{armci.h}.
\item (IN) \texttt{scale}: value of scale factor by which values in local buffer
should be multiplied before adding them to the contents of the remote buffer.
The value in scale should correspond to the value given in \texttt{optype}.
\item (IN) \texttt{src\_ptr}: pointer to location of first byte in local buffer
containing data to copied to remote buffer.
\item (IN) \texttt{src\_stride\_arr}: array of strides for local source data.
\item (IN) \texttt{dst\_ptr}: pointer to first byte in remote buffer into which local
data will be copied.
\item (IN) \texttt{dst\_stride\_arr}: array of strides for remote destination data.
\item (IN) \texttt{count}: number of elements at each stride level.
\texttt{count[0]} equals then number of bytes in each individual data data segment.
\item (IN) \texttt{stride\_levels}: number of stride levels.
\item (IN) \texttt{proc}: network rank of remote processor to which data is
being sent.
\end{itemize}
This function is used to accumulate a strided data set from a local buffer to a
strided data set on a remote processor. The size and layout of the local and
remote buffers may be different, hence the stride arrays for the local and
remote buffers need to be specified separately. The \texttt{count} and
\texttt{stride\_levels} are the same for both buffers and only need to be
specified once. The local buffer on the calling process is assumed to be
available for reuse when the function returns. The function returns a value of 0
if successful.
\section{Blocking Communication of Unstructured Data}
The operations described in this section are designed to move random segments of
data on the calling processor to and from random locations on the remote
processor. These routines are completely general, but may sacrifice performance
that is available when manipulating more structured data. These routines all
take arrays of the \texttt{armci\_giov\_t} data structure as an argument.
This data structure is defined as
\begin{verbatim}
typedef struct {
void **src_ptr_array;
void **dst_ptr_array;
int ptr_array_len;
int bytes;
} armci_giov_t;
\end{verbatim}
It describes the memory locations of the data on the local and remote processor,
the size of the data elements, and the number of data elements. All data
elements described by a single instance of this structure are assumed to be the
same size. The source and destination arrays are defined by the direction of
data movement and may be on the local or remote processor, depending on what
kind of operation is being performed. For a gather operation, the source array
refers to data on the remote processor and the destination array refers to the
local process, for scatter operations the roles are reversed.
The \texttt{src\_ptr\_array} is an array of void pointers describing the
locations of the source data, the \texttt{dst\_ptr\_array} array describes the
destination locations of the data, \texttt{ptr\_array\_len} is the number of
data elements that are to be moved and \texttt{bytes} is the size of the data
elements. If data elements of different sizes need to be moved, then multiple
structures need to be created, one for each size data element.
QUESTION: Does \texttt{armci\_giov\_t} data structure have to be defined this
way or is it only required to have these members? Implementations may be free to
add additional elements for internal use?
\subsection{ARMCI\_PutV}
\begin{verbatim}
int ARMCI_PutV(armci_giov_t darray[],
int len,
int proc)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{darray}: an array of structures describing the
locations in memory on both the local and remote processes, between which data
will be moved. Each array element corresponds to different sized elements that
will be moved between processors
\item (IN) \texttt{len}: number of elements in \texttt{darray}.
\item (IN) \texttt{proc}: network rank of remote processor to which data is
being sent.
\end{itemize}
This operation is designed to move a random collection of elements from the
calling process to a set of random locations on the remote process. The
function returns 0 if successful. After the function returns, it is safe to
modify the memory locations on calling process that contained the data being
moved.
\subsection{ARMCI\_GetV}
\begin{verbatim}
int ARMCI_GetV(armci_giov_t darray[],
int len,
int proc)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{darray}: an array of structures describing the
locations in memory on both the local and remote processes, between which data
will be moved. Each array element corresponds to different sized elements that
will be moved between processors
\item (IN) \texttt{len}: number of elements in \texttt{darray}.
\item (IN) \texttt{proc}: network rank of remote processor from which data is
being received.
\end{itemize}
This operation is designed to move a random collection of elements from a
remote process to a set of random locations on the calling process. The
function returns 0 if successful. After the function returns, the data is
located on the local process and it is safe to use it.
\subsection{ARMCI\_AccV}
\begin{verbatim}
int ARMCI_AccV(int optype,
void *scale,
armci_giov_t darray[],
int len,
int proc)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{optype}: value specifies what type of data is being operated
on. Values are mapped to C-preprocessor macros described in the section on
structured communication and defined in \texttt{armci.h}.
\item (IN) \texttt{darray}: an array of structures describing the
locations in memory on both the local and remote processes, representing the
elements that will be used in the accumulate operation. The operation specified
in the \texttt{optype} variable must apply to all data elements, which
implicitly requires that all data elements must be the same size, so only 1
element is needed in \texttt{darray}. However, for some applications, it may be
convenient to use multiple array elements to describe the data layout.
\item (IN) \texttt{len}: number of elements in \texttt{darray}.
\item (IN) \texttt{proc}: network rank of remote processor to which data is
being sent.
\end{itemize}
This operation is designed to accumulate a random collection of elements from the
calling process to a set of random elements on the remote process. The function
returns 0 if successful. After the function returns, it is safe to
modify the memory locations on calling process that contained the data being
moved.
\section{Non-Blocking Communication}
The routines in this section are very similar to the routines described in the
sections on blocking communication of structured and unstructured data. Refer to
these sections for a description of data structures and data layouts. Blocking
routines assume that local buffers are immediately available for use or reuse as
soon as the operation is completed on the calling process. For non-block operations,
in contrast, the state of the buffer is undetermined after the function returns
on the calling process. Completion of the operation can be forced by calling a
non-blocking wait function, after which the local buffers can safely be used or
reused. For platforms that support non-blocking communication, a non-blocking
call will complete in the interval between when the non-blocking function returns
and the non-blocking wait is called. In the meantime, this interval can be used to
do useful compuational work on the calling process. This will allow communication
to overlap with computation and reduce the impact of communication latency and
transfer times on scalability. Note that the non-blocking protocols can be
implemented on all platforms regardless of whether the underlying hardware and
OS supports non-blocking communication. In the event that non-blocking
communication is not supported, the non-blocking data transfer calls are
effectively blocking and the non-blocking wait call is a null operation.
The non-blocking calls potentially interact with synchronization and data
consistency operations (see below) differently from blocking communication. For
blocking communication, an operation such as \texttt{ARMCI\_AllFence} guarantees that all
communication has been completed on remote processors. For non-blocking
communication, there is no such guarantee unless the wait function has been
called on the non-blocking handle before the fence operation. If no wait
function has been called on the one-sided operation, then the status of the data
on remote processors remains unknown after a call to fence.
The non-blocking communication operations are distinguished from the blocking
communication operations by the addition of an extra argument that keeps track
of non-blocking operations. The argument is a pointer to a data structure
defined as
\begin{verbatim}
typedef struct{
int data[4];
double dummy[72];
} armci_hdl_t
\end{verbatim}
The sizes of 4 and 72 are chosen to reflect maximums on existing platforms, they
can be increased as necessary for new systems.
WE NEED SOME DISCUSSION OF HOW THIS DATA STRUCTURE IS USED AND WHAT HAPPENS TO
IT IN A NON-BLOCKING CALL
Apart from the non-blocking handle, the non-blocking calls are entirely
analogous to the blocking calls and the reader is referred to the documentaion
on those calls for additional information on other arguments.
\subsection{ARMCI\_NbPut}
\begin{verbatim}
int ARMCI_NbPut(void *src,
void *dst,
int bytes,
int proc,
armci_hdl_t *nbhandle)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{src}: pointer to location of first byte in local buffer
containing data to copied to remote buffer.
\item (IN) \texttt{dst}: pointer to first byte in remote buffer into which local
data will be copied.
\item (IN) \texttt{bytes}: number of bytes to be moved from local buffer to
remote buffer.
\item (IN) \texttt{proc}: network rank of remote processor to which data is
being sent.
\item (IN) \texttt{nbhandle}: non-blocking request handle
\end{itemize}
This function copies a contiguous block of data from a local buffer to a buffer
on a remote processor. The local buffer on the calling process is not available for
reuse when the function returns and cannot safely be used until a non-blocking
wait is called on the non-blocking handle. The function returns a value of 0
if successful.
\subsection{ARMCI\_NbPutS}
\begin{verbatim}
int ARMCI_NbPutS(void *src_ptr,
int src_stride_arr[],
void *dst_ptr,
int dst_stride_arr[],
int count[],
int stride_levels,
int proc,
armci_hdl_t *nbhandle)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{src\_ptr}: pointer to location of first byte in local buffer
containing data to copied to remote buffer.
\item (IN) \texttt{src\_stride\_arr}: array of strides for local source data.
\item (IN) \texttt{dst\_ptr}: pointer to first byte in remote buffer into which local
data will be copied.
\item (IN) \texttt{dst\_stride\_arr}: array of strides for remote destination data.
\item (IN) \texttt{count}: number of elements at each stride level.
\texttt{count[0]} equals then number of bytes in each individual data data segment.
\item (IN) \texttt{stride\_levels}: number of stride levels.
\item (IN) \texttt{proc}: network rank of remote processor to which data is
\item (IN) \texttt{nbhandle}: non-blocking request handle
being sent.
\end{itemize}
This function is used to send a strided data set from a local buffer to a
strided data set on a remote processor. The size and layout of the local and
remote buffers may be different, hence the stride arrays for the local and
remote buffers need to be specified separately. The \texttt{count} and
\texttt{stride\_levels} are the same for both buffers and only need to be
specified once. The local buffer on the calling process is not available for reuse
when the function returns and cannot safely be used until a non-blocking wait is
called on the non-blocking handle. The function returns a value of 0 if successful.
\subsection{ARMCI\_NbGet}
\begin{verbatim}
int ARMCI_NbGet(void *src,
void *dst,
int bytes,
int proc,
armci_hdl_t *nbhandle)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{src}: pointer to location of first byte in local buffer
that receives data from remote buffer.
\item (IN) \texttt{dst}: pointer to first byte in remote buffer from which data
will be sent.
\item (IN) \texttt{bytes}: number of bytes to be moved from remote buffer to
local buffer.
\item (IN) \texttt{proc}: network rank of remote processor from which data is
being received.
\item (IN) \texttt{nbhandle}: non-blocking request handle
\end{itemize}
This function copies a contiguous block of data from a remote processor to a buffer
on the calling processor. The local buffer is not available for use until a
non-blocking wait is called on the non-blocking handle. The function returns a value
of 0 if successful.
\subsection{ARMCI\_NbGetS}
\begin{verbatim}
int ARMCI_NbGetS(void *src_ptr,
int src_stride_arr[],
void *dst_ptr,
int dst_stride_arr[],
int count[],
int stride_levels,
int proc,
armci_hdl_t *nbhandle)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{src\_ptr}: pointer to location of first byte in local buffer
that will receive data from remote buffer.
\item (IN) \texttt{src\_stride\_arr}: array of strides for local source data.
\item (IN) \texttt{dst\_ptr}: pointer to first byte in remote buffer from which
data will be sent.
\item (IN) \texttt{dst\_stride\_arr}: array of strides for remote destination data.
\item (IN) \texttt{count}: number of elements at each stride level.
\texttt{count[0]} equals then number of bytes in each individual data data segment.
\item (IN) \texttt{stride\_levels}: number of stride levels.
\item (IN) \texttt{proc}: network rank of remote processor from which data is
being sent.
\item (IN) \texttt{nbhandle}: non-blocking request handle
\end{itemize}
This function is used to collect a strided data set from a remote processor to a
strided data set on the calling processor. The size and layout of the local and
remote buffers may be different, hence the stride arrays for the local and
remote buffers need to be specified separately. The \texttt{count} and
\texttt{stride\_levels} are the same for both buffers and only need to be
specified once. The local buffer is not available for use until a
non-blocking wait is called on the non-blocking handle. The
function returns a value of 0 if successful.
\subsection{ARMCI\_NbAcc}
\begin{verbatim}
int ARMCI_NbAcc(int optype,
void *scale,
void *src,
void *dst,
int bytes,
int proc,
armci_hdl_t *nbhandle)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{optype}: value specifies what type of data is being operated
on. Values are mapped to C-preprocessor macros described in the section on
blocking structured communication and defined in \texttt{armci.h}.
\item (IN) \texttt{scale}: value of scale factor by which values in local buffer
should be multiplied before adding them to the contents of the remote buffer.
The value in scale should correspond to the value given in \texttt{optype}.
\item (IN) \texttt{src}: pointer to location of first byte in local buffer
containing data to copied to remote buffer.
\item (IN) \texttt{dst}: pointer to first byte in remote buffer into which local
data will be copied.
\item (IN) \texttt{bytes}: number of bytes to be moved from local buffer to
remote buffer.
\item (IN) \texttt{proc}: network rank of remote processor to which data is
being sent.
\item (IN) \texttt{nbhandle}: non-blocking request handle
\end{itemize}
This function is used to accumulate a contiguous block of data on the calling
processor into a contiguous block of data on remote processor. The local buffer on
the calling process is not available for reuse when the function returns and cannot
be used until a non-blocking wait is called on the non-blocking handle. The
function returns a value of 0 if successful.
\subsection{ARMCI\_NbAccS}
\begin{verbatim}
int ARMCI_NbAccS(int optype,
void *scale,
void *src_ptr,
int src_stride_arr[],
void *dst_ptr,
int dst_stride_arr[],
int count[],
int stride_levels,
int proc,
armci_hdl_t *nbhandle)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{optype}: value specifies what type of data is being operated
on. Values are mapped to C-preprocessor macros described in the section on
blocking structured communication and defined in \texttt{armci.h}.
\item (IN) \texttt{scale}: value of scale factor by which values in local buffer
should be multiplied before adding them to the contents of the remote buffer.
The value in scale should correspond to the value given in \texttt{optype}.
\item (IN) \texttt{src\_ptr}: pointer to location of first byte in local buffer
containing data to copied to remote buffer.
\item (IN) \texttt{src\_stride\_arr}: array of strides for local source data.
\item (IN) \texttt{dst\_ptr}: pointer to first byte in remote buffer into which local
data will be copied.
\item (IN) \texttt{dst\_stride\_arr}: array of strides for remote destination data.
\item (IN) \texttt{count}: number of elements at each stride level.
\texttt{count[0]} equals then number of bytes in each individual data data segment.
\item (IN) \texttt{stride\_levels}: number of stride levels.
\item (IN) \texttt{proc}: network rank of remote processor to which data is
being sent.
\item (IN) \texttt{nbhandle}: non-blocking request handle
\end{itemize}
This function is used to accumulate strided data set from a local buffer to a
strided data set on a remote processor. The size and layout of the local and
remote buffers may be different, hence the stride arrays for the local and
remote buffers need to be specified separately. The \texttt{count} and
\texttt{stride\_levels} are the same for both buffers and only need to be
specified once. The local buffer on the calling process is not available for reuse
when the function returns and cannot be used until a non-blocking wait has been
called on the non-blocking handle. The function returns a value of 0
if successful.
\subsection{ARMCI\_NbPutV}
\begin{verbatim}
int ARMCI_NbPutV(armci_giov_t darray[],
int len,
int proc,
armci_hdl_t *nbhandle)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{darray}: an array of structures describing the
locations in memory on both the local and remote processes, between which data
will be moved. Each array element corresponds to different sized elements that
will be moved between processors
\item (IN) \texttt{len}: number of elements in \texttt{darray}.
\item (IN) \texttt{proc}: network rank of remote processor to which data is
being sent.
\item (IN) \texttt{nbhandle}: non-blocking request handle
\end{itemize}
This operation is designed to move a random collection of elements from the
calling process to a set of random locations on the remote process.
The local buffer on the calling process is not available for reuse when
the function returns and cannot safely be used until a non-blocking wait is
called on the non-blocking handle. The function returns a value of 0 if
successful.
\subsection{ARMCI\_NbGetV}
\begin{verbatim}
int ARMCI_NbGetV(armci_giov_t darray[],
int len,
int proc,
armci_hdl_t *nbhandle)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{darray}: an array of structures describing the
locations in memory on both the local and remote processes, between which data
will be moved. Each array element corresponds to different sized elements that
will be moved between processors
\item (IN) \texttt{len}: number of elements in \texttt{darray}.
\item (IN) \texttt{proc}: network rank of remote processor from which data is
being received.
\item (IN) \texttt{nbhandle}: non-blocking request handle
\end{itemize}
This operation is designed to move a random collection of elements from a
remote process to a set of random locations on the calling process. The
function returns 0 if successful. The local buffer is not available for use until
a non-blocking wait is called on the non-blocking handle. The
function returns a value of 0 if successful.
\subsection{ARMCI\_NbAccV}
\begin{verbatim}
int ARMCI_NbAccV(int optype,
void *scale,
armci_giov_t darray[],
int len,
int proc,
armci_hdl_t *nbhandle)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{optype}: value specifies what type of data is being operated
on. Values are mapped to C-preprocessor macros described in the section on
blocking structured communication and defined in \texttt{armci.h}.
\item (IN) \texttt{darray}: an array of structures describing the
locations in memory on both the local and remote processes, representing the
elements that will be used in the accumulate operation. The operation specified
in the \texttt{optype} variable must apply to all data elements, which
implicitly requires that all data elements must be the same size, so only 1
element is needed in \texttt{darray}. However, for some applications, it may be
convenient to use multiple array elements to describe the data layout.
\item (IN) \texttt{len}: number of elements in \texttt{darray}.
\item (IN) \texttt{proc}: network rank of remote processor to which data is
being sent.
\item (IN) \texttt{nbhandle}: non-blocking request handle
\end{itemize}
This operation is designed to accumulate a random collection of elements from the
calling process to a set of random elements on the remote process.
The local buffer on the calling process is not available for reuse
when the function returns and cannot be used until a non-blocking wait has been
called on the non-blocking handle. The function returns a value of 0
if successful.
\subsection{ARMCI\_Wait}
\begin{verbatim}
int ARMCI_Wait(armci_hdl_t *nbhandle)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{nbhandle}: non-blocking request handle
\end{itemize}
This function forces local completion of a non-blocking communication operation.
For gets, this implies that the data has arrived in the local buffer and is
ready for use, for puts and accumulates, this implies that data in the local
buffer has been injected into the communication network and the local buffer is
ready for reuse. This call does not guarantee completion of the operation on the
remote processor. The function returns a value of 0 if successful.
\subsection{ARMCI\_Test}
\begin{verbatim}
int ARMCI_Test(armci_hdl_t *nbhandle)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{nbhandle}: non-blocking request handle
\end{itemize}
This function returns the current status of the non-blocking operation. If it
returns 0, the operation has not completed locally, if it non-zero then the
operation has completed locally. This can be used to determine if a subsequent
call to \texttt{ARMCI\_Wait} will result in a significant delay until the non-blocking
operation completes or whether \texttt{ARMCI\_Wait} will return immediately. It can be
used to further optimize latency-hiding strategries.
\subsection{ARMCI\_WaitAll}
\begin{verbatim}
int ARMCI_WaitAll(void)
\end{verbatim}
This function forces all outstanding non-blocking operations originating from
the calling processor to complete. All non-blocking handles will be cleared. It
has the same effect as looping over all outstanding non-blocking calls on the
processor and calling \texttt{ARMCI\_Wait}. This function returns a value of 0 if
successful.
\subsection{ARMCI\_WaitProc}
\begin{verbatim}
int ARMCI_WaitProc(int proc)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{proc}: processor on which wait is invoked
\end{itemize}
This function forces all outstanding non-blocking operations between the calling
processor and the process indicated in the \texttt{proc} argument to complete.
This function returns a value of 0 if successful.
\section{Errors}
The ARMCI interface only contains one function for handling errors. Calling this
function will result in the code exiting execution on all processors and
reporting an error message to standard I/O.
\subsection{ARMCI\_Error}
\begin{verbatim}
void ARMCI_Error(char *msg,
int code)
\end{verbatim}
\begin{itemize}
\item (IN) \texttt{msg}: descriptive error message
\item (IN) \texttt{code}: integer code describing error
\end{itemize}
This function can be called by any processor to halt execution of the code and
stop all processes. The user can supply a descriptive error message in the