Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional DUCC FFT support #463

Merged
merged 33 commits into from
Jul 11, 2024

Conversation

mreineck
Copy link
Collaborator

No description provided.

@mreineck
Copy link
Collaborator Author

I think I managed to fix all build problems. Please double-check very carefully, much of my cmake-related changes are pure guesswork and can definitely be done more elegantly!

@mreineck
Copy link
Collaborator Author

OK, I give up on the Windows build ... the rest seems to be OK.

@mreineck
Copy link
Collaborator Author

@DiamonDinoia, I think this is ready for a first round of feedback!

@mreineck mreineck marked this pull request as ready for review June 23, 2024 07:23
Copy link
Collaborator Author

@mreineck mreineck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some explanations and questions.

CMakeLists.txt Outdated
@@ -1,6 +1,7 @@
cmake_minimum_required(VERSION 3.19)

project(finufft VERSION 2.2.0 LANGUAGES C CXX)
set(CMAKE_CXX_STANDARD 17)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use this here, we can remove the C++17 flags everywhere else. Not sure if this is worth it ... just a suggestion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to set it by target as this will influence projects that include this as a dependency.

@@ -225,8 +231,12 @@ typedef struct FINUFFT_PLAN_S { // the main plan object, fully C++
FLT *phiHat2; // " y-axis.
FLT *phiHat3; // " z-axis.

FFTW_CPX *fwBatch; // (batches of) fine grid(s) for FFTW to plan
#ifdef FINUFFT_USE_DUCC0
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't need backwards compatibility in memory layout, these members can of course just go away.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backward compatibility to fftw?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to FFTW, but maybe for interfacing with other languages. I'm thinking about the Matlab interface for example. If you have to describe the memory layout of the FINUFFT plan type there, then it would not be good if it changes depending on the used FFT library.
Maybe I'm just overly cautious here; should we just remove the entries when ducc is used and see if anything breaks?

p->fwBatch = FFTW_ALLOC_CPX(p->nf * p->batchSize); // the big workspace
#ifndef FINUFFT_USE_DUCC0
timer.restart();
p->fwBatch = (CPX *)FFTW_ALLOC_CPX(p->nf * p->batchSize); // the big workspace
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cast is guaranteed to work correctly, since C++ guarantees the memory layout of std::complex to be compatible to the FFTW type.

Copy link
Collaborator

@ahbarnett ahbarnett Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting for DUCC0 you don't do the big workspace allocation (fwBatch) here at the plan stage, but seem to do it in the execute. Wouldn't it be faster to have the allocation just done here once? (like we had?) Any explanation we're missing here?

src/finufft.cpp Outdated
timer.restart();
#ifdef FINUFFT_USE_DUCC0
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: I'm using opts.nthreads in the following, and it if it is >1,but OpenMP is switched off,the code will still run in parallel. Do I need to safeguard against this case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, If omp is switched off no parallelization should happen. It might conflict with users that parallelize by hand

Copy link
Collaborator

@DiamonDinoia DiamonDinoia Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I build without openMP I get:

libfinufft.so: undefined reference to `pthread_create'
libfinufft.so: undefined reference to `pthread_getaffinity_np'
libfinufft.so: undefined reference to `pthread_setaffinity_np'
libfinufft.so: undefined reference to `pthread_join'

I can have CMAKE linking against pthread. I have a question, are you using ptread directly in ducc0 (it is not windows compatible...)? Shall we have this wrapped in ifdefs? if pthread is not linked -> no parallelism? ( we can mode this discussion in ducc0 )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll take extra care to switch off multithreaing if OpenMP is not active!

I am using C++'s std::thread class for multithreading, which should in principle work out of the box. On Linux and MacOS this is built on pthreads.
The best way to get multithreading for ducc compiled in is to use -pthread for both compilation and linking; that should take care of everything. Not sure how it works on Windows, but it's possible that no flags are needed at all.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, Im now using min<size_t>(MY_OMP_GET_MAX_THREADS(), p->opts.nthreads) as the number of threads; that should work as expected in all circumtances, as fas as I can see.

@DiamonDinoia
Copy link
Collaborator

Hi Martin,

This looks good. I will merge this into my branch and change the cmakefile. I did not pass -march=native and other flags to ducc as I was thinking of writing a cmake for ducc and opening a PR. I am waiting for the right time for it.

I would like one change to happen, I would like to have a

fft.h (and an fft.cpp) that wraps all the fft functionalities so that inside [finufft.cpp|h] there are no ifdef(unless strictly necessary). So that, if in the future users want to use a third fft they can do so.

inside fft.h when writing a wrapper function we can have function that do something

fft_execute(arg1, .... argN){
if FFTW
fftw(arg1, arg2)
else
ducc.fft(arg1, arg3 ... argN)
}

Basically passing the union of the parameters and relying on the compiler to throw away the unused ones.
The other advantage is that the nufft logic is more clear, all the fftw initialization and handling is under the hood and the user can see that we are doing an fft of this vector.

Cheers,
Marco

@DiamonDinoia
Copy link
Collaborator

Hi Martin. I pushed some changes to https://github.com/DiamonDinoia/finufft/tree/switchable-fft let me know how can I help.

@DiamonDinoia
Copy link
Collaborator

OK, I give up on the Windows build ... the rest seems to be OK.

I would not worry about CI for now. I will handle the cmake relate changes at the end. Some of the CI uses makefiles so I would need to update that too.

@mreineck
Copy link
Collaborator Author

This looks good. I will merge this into my branch and change the cmakefile. I did not pass -march=native and other flags to ducc as I was thinking of writing a cmake for ducc and opening a PR. I am waiting for the right time for it.

Thank you for that! I'm happy to include this ... I just might have to come back o your for help from time to time, whenever something nontrivial changes :-)

Concerning the separation of the FFT into a new file: I think this is a good plan, but I'm not absolutely sure whether the timing is optimal. Assuming that the switch from FFTW #define magic to templates happens in the near future, it might be easier to do this separation afterwards, as the code should have become much simpler by then (as well as the compilation ... no more compiling files twice etc.). What do you think?

@mreineck
Copy link
Collaborator Author

Sorry, false alarm ... I had actually taken care of batched FFTs, phew!

ker is real-valued FLT array of length nf1/2+1.

Single thread only, but shouldn't matter since mostly data movement.

It has been tested that the repeated floating division in this inner loop
only contributes at the <3% level in 3D relative to the fftw cost (8 threads).
only contributes at the <3% level in 3D relative to the FFT cost (8 threads).
This could be removed by passing in an inverse kernel and doing mults.

todo: rewrite w/ C++-complex I/O, check complex divide not slower than
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remember to remove this todo post--merge

src/finufft.cpp Outdated
@@ -1092,6 +1073,13 @@ int FINUFFT_EXECUTE(FINUFFT_PLAN p, CPX *cj, CPX *fk) {
CNTime timer;
timer.start();

#ifdef FINUFFT_USE_DUCC0
std::vector<CPX> fwBatch_(p->nf * p->batchSize); // the big workspace
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment - could you explain why this isn't better done in the plan stage?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation is to reduce memory consumption of plans. I remember the HVOX people complaining at some point that finufft plans are huge, and with FFTW this is not so easy to avoid. With DUCC it's easier and I think it's worth the slight performance penalty.

Copy link
Collaborator

@DiamonDinoia DiamonDinoia Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could always use a caching allocator to avoid the issue. I'd rather keep the allocation in the plan for now and maybe switch to this with a flag or something. I do prefer having the same behavior for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are really sure ... for me this was one of the two big selling points of the patch (the other being not waiting for the FFTW planner).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's reverted now ... but I sincerely hope that you will change your mind again! :-)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, interesting that this was a selling point. Let's think about where this big malloc is done in the future, as a separate issue. We would like to understand how HVOX people are using the plan interface. The big malloc has to be done somewhere because the FFT is about to write into it! But if one were to keep multiple plans around that were not going to be done simultaneously, then, yes, it could be wasteful of RAM - but that seems like a weird use-case. Maybe people are using FINUFFT in a way I hadn't anticipated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very common situation would be iterative approximate inversion of a NUFFT via conjugate gradient or similar. For that you need (I think) two FINUFFT plans that are used alternatingly, and therefore there is always an overhead of a fine grid (and a copy of the coordinates of nonuniform points, but that's a different topic again).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation is to reduce memory consumption of plans. I remember the HVOX people complaining at some point that finufft plans are huge, and with FFTW this is not so easy to avoid. With DUCC it's easier and I think it's worth the slight performance penalty.

@mreineck Just to clarify, we don't complain that FINUFFT plans are huge in HVOX, but rather that exploiting sparsity in HVOXv1 requires creating multiple plans of different sizes, which is time-consuming when planning 100+ NUFFTs since planning cannot be done in parallel due it not being thread-safe. Also, depending on what your (x,z) point distribution is, the FFTs may indeed be huge.

That being said we don't need to do this anymore: a single plan suffices now in HVOXv2. [There is no HVOXv2 paper at the moment, but you can read this in my thesis soon. We'll draft a v2 paper proper afterwards.]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the clarification! I'm definitely looking forward to reading about the improved algorithm!


add_library(finufft_f64 OBJECT ${FINUFFT_PRECISION_DEPENDENT_SOURCES})
target_compile_definitions(finufft_f64 PRIVATE)
set_finufft_options(finufft_f64)
target_link_libraries(finufft_f64 PUBLIC ${FINUFFT_FFTLIBS})
target_link_libraries(finufft_f64 PRIVATE xsimd)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DiamonDinoia Marco, the reason windows can not find xsimd seems to be here, this line is moved out of set_finufft_options function. the below if(WIN32) does not add link xsimd.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thanks for fixing it!

@mreineck
Copy link
Collaborator Author

BTW, if you are running benchmarks on your hardware comparing the two FFTs within finufft I'd be extremely interested to see the results!

@ahbarnett
Copy link
Collaborator

Did anyone run basic timing tests on this? Just ran benchmarks on 5700U AMD laptop, 1 thread, but I must be doing something wrong, since ducc0 is 50x slower than fftw version. This appears to be mostly due to spreading (which shouldn't be different!) but also fft. I am confused!

I cloned via

git fetch origin pull/463/head:ducc
git checkout ducc

Here's using the makefile build (hence fftw), showing just the debug=2 for a small 3d1:

(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=1 test/finufft3d_test 1e2 1e2 1e2 1e6 1e-6 2
test 3d type 1:
[finufft_makeplan] new plan: FINUFFT version 2.2.0 .................
[finufft_makeplan] set auto upsampfac=2.00
setup_spreader (kerevalmeth=1) eps=1e-06 sigma=2: chose ns=7 beta=16.1
[finufft_makeplan] 3d1: (ms,mt,mu)=(100,100,100) (nf1,nf2,nf3)=(200,200,200)
               ntrans=1 nthr=1 batchSize=1 
[finufft_makeplan] kernel fser (ns=7):		7.9e-05 s
[finufft_makeplan] fwBatch 0.13GB alloc:   	1.6e-05 s
[finufft_makeplan] FFTW plan (mode 64, nthr=1):	0.000795 s
[finufft_setpts] spreadcheck (1):	1e-06 s
	sorted (1 threads):	0.0146 s
[finufft_setpts] sort (didSort=1):		0.0147 s
[finufft_execute] start ntrans=1 (1 batches, bsize=1)...
[finufft_execute] start batch 0 (size 1):
	spread 3D (M=1000000; N1=200,N2=200,N3=200), nthr=1
	zero output array	0.0575 s
	capping subproblem sizes to max of 100000
	t1 fancy spread: 	0.239 s (10 subprobs)
	FFT exec:		0.121 s
[finufft_execute] done. tot spread:		0.296 s
               tot FFT:				0.121 s
               tot deconvolve:			0.00954 s
     1000000 NU pts to (100,100,100) modes in 0.455 s 	2.2e+06 NU pts/s
	one mode: rel err in F[37,26,-39] is 4.43e-07

Typical 1-thread perf. Same as for master. Now for ducc I have to use cmake, so did:

 mkdir build 
   cd build/
   cmake .. -D FINUFFT_BUILD_TESTS=ON  -D FINUFFT_USE_DUCC0=ON
   cmake --build . -j
    ctest

I see ducc0 is built (no error msgs). Did I get the flags right?
Then:

(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=1 build/test/finufft3d_test 1e2 1e2 1e2 1e6 1e-6 2
test 3d type 1:
[finufft_makeplan] new plan: FINUFFT version 2.2.0 .................
[finufft_makeplan] set auto upsampfac=2.00
setup_spreader (kerevalmeth=1) eps=1e-06 sigma=2: chose ns=7 beta=16.1
[finufft_makeplan] 3d1: (ms,mt,mu)=(100,100,100) (nf1,nf2,nf3)=(200,200,200)
               ntrans=1 nthr=1 batchSize=1 
[finufft_makeplan] kernel fser (ns=7):		0.000184 s
[finufft_makeplan] fwBatch 0.13GB alloc:   	2.3e-05 s
[finufft_setpts] spreadcheck (1):	0 s
	sorted (1 threads):	0.0589 s
[finufft_setpts] sort (didSort=1):		0.059 s
[finufft_execute] start ntrans=1 (1 batches, bsize=1)...
[finufft_execute] start batch 0 (size 1):
	spread 3D (M=1000000; N1=200,N2=200,N3=200), nthr=1
	zero output array	0.0694 s
	capping subproblem sizes to max of 100000
	t1 fancy spread: 	12.7 s (10 subprobs)
	FFT exec:		2.68 s
[finufft_execute] done. tot spread:		12.7 s
               tot FFT:				2.68 s
               tot deconvolve:			0.0131 s
     1000000 NU pts to (100,100,100) modes in 15.5 s 	6.45e+04 NU pts/s
	one mode: rel err in F[37,26,-39] is 4.43e-07

So, I'm hoping bad math flags are in this cmake, or something? (The error is fine, but not the 50x slow-down...) :(
I will check cmake non-ducc build now...

@@ -36,6 +36,7 @@ option(FINUFFT_USE_CUDA "Whether to build CUDA accelerated FINUFFT library (libc
option(FINUFFT_USE_CPU "Whether to build the ordinary FINUFFT library (libfinufft)." ON)
option(FINUFFT_STATIC_LINKING "Whether to link the static FINUFFT library (libfinufft_static)." ON)
option(FINUFFT_BUILD_DEVEL "Whether to build development executables" OFF)
option(FINUFFT_USE_DUCC0 "Whether to use DUCC0 for CPU fft library" ON)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this has to be OFF by default, since users will need to be eased into this over time, if it turns out to be useful for them..

@ahbarnett
Copy link
Collaborator

I now used fresh

cmake .. -D FINUFFT_BUILD_TESTS=ON  
cmake --build . -j

weirdly, it again downloads ducc and builds it (it should not, since fftw is default).
Oh, I see ducc0 is the default!! (We have to make fftw default so as not to shock the users...)

cmake .. -D FINUFFT_BUILD_TESTS=ON -D FINUFFT_USE_DUCC0=OFF

Now I get the FFTW version, which shows FFTW timing as expected, but again a crippling spread (50x slower):

(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=1 build/test/finufft3d_test 1e2 1e2 1e2 1e6 1e-6 2
test 3d type 1:
[finufft_makeplan] new plan: FINUFFT version 2.2.0 .................
[finufft_makeplan] set auto upsampfac=2.00
setup_spreader (kerevalmeth=1) eps=1e-06 sigma=2: chose ns=7 beta=16.1
[finufft_makeplan] 3d1: (ms,mt,mu)=(100,100,100) (nf1,nf2,nf3)=(200,200,200)
               ntrans=1 nthr=1 batchSize=1 
[finufft_makeplan] kernel fser (ns=7):		0.00018 s
[finufft_makeplan] fwBatch 0.13GB alloc:   	1.4e-05 s
[finufft_makeplan] FFTW plan (mode 64, nthr=1):	0.00079 s
[finufft_setpts] spreadcheck (1):	0 s
	sorted (1 threads):	0.0591 s
[finufft_setpts] sort (didSort=1):		0.0591 s
[finufft_execute] start ntrans=1 (1 batches, bsize=1)...
[finufft_execute] start batch 0 (size 1):
	spread 3D (M=1000000; N1=200,N2=200,N3=200), nthr=1
	zero output array	0.0695 s
	capping subproblem sizes to max of 100000
	t1 fancy spread: 	12.7 s (10 subprobs)
	FFT exec:		0.12 s
[finufft_execute] done. tot spread:		12.8 s
               tot FFT:				0.12 s
               tot deconvolve:			0.0132 s
     1000000 NU pts to (100,100,100) modes in 13 s 	7.71e+04 NU pts/s
	one mode: rel err in F[37,26,-39] is 4.43e-07

I just rechecked master, and the cmake build is also equally crippled spreader. So there is a cmake problem in master (makefile is, as usual, fine).

@mreineck
Copy link
Collaborator Author

Just curious: how do you build the ducc0 version using make? I haven't figured this out yet...

@ahbarnett
Copy link
Collaborator

Finally:

cmake .. -D FINUFFT_BUILD_TESTS=ON -DCMAKE_BUILD_TYPE=Release

gives:

(base) alex@ross /home/alex/numerics/finufft/build> OMP_NUM_THREADS=1 test/finufft3d_test 1e2 1e2 1e2 1e6 1e-6 2
test 3d type 1:
[finufft_makeplan] new plan: FINUFFT version 2.2.0 .................
[finufft_makeplan] set auto upsampfac=2.00
setup_spreader (kerevalmeth=1) eps=1e-06 sigma=2: chose ns=7 beta=16.1
[finufft_makeplan] 3d1: (ms,mt,mu)=(100,100,100) (nf1,nf2,nf3)=(200,200,200)
               ntrans=1 nthr=1 batchSize=1 
[finufft_makeplan] kernel fser (ns=7):		7.6e-05 s
[finufft_makeplan] fwBatch 0.13GB alloc:   	2.4e-05 s
[finufft_setpts] spreadcheck (1):	1e-06 s
	sorted (1 threads):	0.0141 s
[finufft_setpts] sort (didSort=1):		0.0141 s
[finufft_execute] start ntrans=1 (1 batches, bsize=1)...
[finufft_execute] start batch 0 (size 1):
	spread 3D (M=1000000; N1=200,N2=200,N3=200), nthr=1
	zero output array	0.0575 s
	capping subproblem sizes to max of 100000
	t1 fancy spread: 	0.239 s (10 subprobs)
	FFT exec:		0.0481 s
[finufft_execute] done. tot spread:		0.297 s
               tot FFT:				0.0481 s
               tot deconvolve:			0.00965 s
     1000000 NU pts to (100,100,100) modes in 0.381 s 	2.62e+06 NU pts/s
	one mode: rel err in F[37,26,-39] is 4.43e-07

Spreading back to normal, and ducc0 fft w/ 3d-zero-patterns is 2.5x as fast as fftw - good!

@mreineck
Copy link
Collaborator Author

Nice - that means the FFT must be intrinsically faster (for this particular setup) than FFTW, since the savings can never give you more than a factor of 2.

@ahbarnett
Copy link
Collaborator

one thread, double-prec. using M=1e2 so as to make spread/interp negligible. Thus all time is FFT

OMP_NUM_THREADS=1 test/finufft3d_test 300 300 300 1e2 1e-6 0 2 2.0

etc. Thus FFT-dominated. (Since the timing of zeroing the array vs fft could be mixed, I just did total times).

3d1, upsampfac=1.25: ducc0_with_zero-pattern 3x faster than fftw (or 7x if looking at fft time in debug=1 output, but I don't know if believe that..)
upsampfac=2.0: 2x (weird, since this ratio should be better for upsampfac larger...)

8-threads (no HT):
upsampfac = 1.25: 1.6x
upsampfac = 2.0: 1.25x

Excellent!

@mreineck
Copy link
Collaborator Author

The particular timings depend very strongly on the prime factors of the array sizes ... there won't be a uniform runtime ratio between ducc0.fft and FFTW, since they are internally really different.

@ahbarnett
Copy link
Collaborator

2D, double-prec: OMP_NUM_THREADS=1 test/finufft2d_test 5e3 5e3 1e2 1e-6 0 2 2.0

1 thread:
up=1.25: 2x
up=2.0: 1.8x

8 threads:
up=1.25: 1.1x (not much diff)
up=2.0: 1.3x

1D, double-prec: OMP_NUM_THREADS=1 test/finufft1d_test 1e7 1e2 1e-6 0 2 2.0

1 thread:
up=1.25: 1.0x (no diff)
up=2.0: 1.4x

8 threads:
up=1.25: 0.55x ducc0 slower than fftw - and looking at debug=1, seems like ducc0=0.3s but fftw=0.1s, ie 3x slower :(
up=2.0: 1.1x (but debug=1 shows that is because fftw plan is slow, although the xform is faster than ducc. Need to test multiple transforms to amortize this)

So, pretty good story for ducc0 so far, in the case of a single-fftw with FFTW_ESTIMATE plan included.
Need to test batched-transforms now...

@ahbarnett
Copy link
Collaborator

Hi Martin,
so, could we edit next235even to be more favorable to ducc0 ? We get to tweak the nf{1,2,3} sizes :)
Best, Alex

@mreineck
Copy link
Collaborator Author

Don't expect miracles from ducc in 1D; this is where it's worst. Batched 1D might be better though.

@mreineck
Copy link
Collaborator Author

mreineck commented Jul 11, 2024

The output of next235even should already be pretty good for ducc; I don't think there's much to be gained there.
(By saying that the two libraries are quite different internally I don't mean that the principles are different. But for example, FFTW goes to great lengths to optimize lengths that are a large power of 2, spending thousands of lines of code on that particular case; ducc just is not that ambitious...)

@ahbarnett
Copy link
Collaborator

batched 1d (8-threads, batchsize=8):

(base) alex@ross /home/alex/numerics/finufft/build> OMP_NUM_THREADS=8 ../test/finufft1dmany_test 32 1e7 1e2 1e-6 0 0 0 2 1.25
test 1d1 many vs repeated single: ------------------------------------
ntr=32: 100 NU pts to 10000000 modes in 3.38 s  	948 NU pts/s
	one mode: rel err in F[3700000] of trans#31 is 1.3e-07
^C
(base) alex@ross /home/alex/numerics/finufft/build> OMP_NUM_THREADS=8 test/finufft1dmany_test 32 1e7 1e2 1e-6 0 0 0 2 1.25
test 1d1 many vs repeated single: ------------------------------------
ntr=32: 100 NU pts to 10000000 modes in 4.9 s  	654 NU pts/s
	one mode: rel err in F[3700000] of trans#31 is 1.3e-07
^C
(base) alex@ross /home/alex/numerics/finufft/build> em ../src/finufft.cpp 
(base) alex@ross /home/alex/numerics/finufft/build> OMP_NUM_THREADS=8 ../test/finufft1dmany_test 32 1e7 1e2 1e-6 0 0 0 2 2.0
test 1d1 many vs repeated single: ------------------------------------
ntr=32: 100 NU pts to 10000000 modes in 13.1 s  	244 NU pts/s
	one mode: rel err in F[3700000] of trans#31 is 1.02e-07
^C
(base) alex@ross /home/alex/numerics/finufft/build> OMP_NUM_THREADS=8 test/finufft1dmany_test 32 1e7 1e2 1e-6 0 0 0 2 2.0
test 1d1 many vs repeated single: ------------------------------------
ntr=32: 100 NU pts to 10000000 modes in 10.4 s  	308 NU pts/s
	one mode: rel err in F[3700000] of trans#31 is 1.02e-07
^C

Shows that for nf1=12500000 ducc0 is 0.7x faster (ie slower), but at nf1=20000000 ducc0 is 1.3x (faster).
Strange!

But it is quite exciting for 2d,3d.

@ahbarnett
Copy link
Collaborator

I will have a shot at adding ducc0 to my makefile, but I see no reason not to merge this branch now. I will set the cmake default to fftw for now. We also need to update docs to reflect the ducc0 option. Best, Alex

@ahbarnett ahbarnett merged commit ad02250 into flatironinstitute:master Jul 11, 2024
9 checks passed
@DiamonDinoia DiamonDinoia added this to the 3.0 milestone Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants