-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optional DUCC FFT support #463
Conversation
I think I managed to fix all build problems. Please double-check very carefully, much of my cmake-related changes are pure guesswork and can definitely be done more elegantly! |
OK, I give up on the Windows build ... the rest seems to be OK. |
@DiamonDinoia, I think this is ready for a first round of feedback! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some explanations and questions.
CMakeLists.txt
Outdated
@@ -1,6 +1,7 @@ | |||
cmake_minimum_required(VERSION 3.19) | |||
|
|||
project(finufft VERSION 2.2.0 LANGUAGES C CXX) | |||
set(CMAKE_CXX_STANDARD 17) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use this here, we can remove the C++17 flags everywhere else. Not sure if this is worth it ... just a suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to set it by target as this will influence projects that include this as a dependency.
include/finufft/defs.h
Outdated
@@ -225,8 +231,12 @@ typedef struct FINUFFT_PLAN_S { // the main plan object, fully C++ | |||
FLT *phiHat2; // " y-axis. | |||
FLT *phiHat3; // " z-axis. | |||
|
|||
FFTW_CPX *fwBatch; // (batches of) fine grid(s) for FFTW to plan | |||
#ifdef FINUFFT_USE_DUCC0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't need backwards compatibility in memory layout, these members can of course just go away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backward compatibility to fftw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not to FFTW, but maybe for interfacing with other languages. I'm thinking about the Matlab interface for example. If you have to describe the memory layout of the FINUFFT plan type there, then it would not be good if it changes depending on the used FFT library.
Maybe I'm just overly cautious here; should we just remove the entries when ducc is used and see if anything breaks?
p->fwBatch = FFTW_ALLOC_CPX(p->nf * p->batchSize); // the big workspace | ||
#ifndef FINUFFT_USE_DUCC0 | ||
timer.restart(); | ||
p->fwBatch = (CPX *)FFTW_ALLOC_CPX(p->nf * p->batchSize); // the big workspace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cast is guaranteed to work correctly, since C++ guarantees the memory layout of std::complex to be compatible to the FFTW type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's interesting for DUCC0 you don't do the big workspace allocation (fwBatch) here at the plan stage, but seem to do it in the execute. Wouldn't it be faster to have the allocation just done here once? (like we had?) Any explanation we're missing here?
src/finufft.cpp
Outdated
timer.restart(); | ||
#ifdef FINUFFT_USE_DUCC0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: I'm using opts.nthreads in the following, and it if it is >1,but OpenMP is switched off,the code will still run in parallel. Do I need to safeguard against this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, If omp is switched off no parallelization should happen. It might conflict with users that parallelize by hand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I build without openMP I get:
libfinufft.so: undefined reference to `pthread_create'
libfinufft.so: undefined reference to `pthread_getaffinity_np'
libfinufft.so: undefined reference to `pthread_setaffinity_np'
libfinufft.so: undefined reference to `pthread_join'
I can have CMAKE linking against pthread. I have a question, are you using ptread directly in ducc0 (it is not windows compatible...)? Shall we have this wrapped in ifdefs? if pthread is not linked -> no parallelism? ( we can mode this discussion in ducc0 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll take extra care to switch off multithreaing if OpenMP is not active!
I am using C++'s std::thread
class for multithreading, which should in principle work out of the box. On Linux and MacOS this is built on pthreads.
The best way to get multithreading for ducc compiled in is to use -pthread
for both compilation and linking; that should take care of everything. Not sure how it works on Windows, but it's possible that no flags are needed at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, Im now using min<size_t>(MY_OMP_GET_MAX_THREADS(), p->opts.nthreads)
as the number of threads; that should work as expected in all circumtances, as fas as I can see.
Hi Martin, This looks good. I will merge this into my branch and change the cmakefile. I did not pass -march=native and other flags to ducc as I was thinking of writing a cmake for ducc and opening a PR. I am waiting for the right time for it. I would like one change to happen, I would like to have a fft.h (and an fft.cpp) that wraps all the fft functionalities so that inside [finufft.cpp|h] there are no inside fft.h when writing a wrapper function we can have function that do something
Basically passing the union of the parameters and relying on the compiler to throw away the unused ones. Cheers, |
Hi Martin. I pushed some changes to https://github.com/DiamonDinoia/finufft/tree/switchable-fft let me know how can I help. |
I would not worry about CI for now. I will handle the cmake relate changes at the end. Some of the CI uses makefiles so I would need to update that too. |
Thank you for that! I'm happy to include this ... I just might have to come back o your for help from time to time, whenever something nontrivial changes :-) Concerning the separation of the FFT into a new file: I think this is a good plan, but I'm not absolutely sure whether the timing is optimal. Assuming that the switch from FFTW |
Sorry, false alarm ... I had actually taken care of batched FFTs, phew! |
ker is real-valued FLT array of length nf1/2+1. | ||
|
||
Single thread only, but shouldn't matter since mostly data movement. | ||
|
||
It has been tested that the repeated floating division in this inner loop | ||
only contributes at the <3% level in 3D relative to the fftw cost (8 threads). | ||
only contributes at the <3% level in 3D relative to the FFT cost (8 threads). | ||
This could be removed by passing in an inverse kernel and doing mults. | ||
|
||
todo: rewrite w/ C++-complex I/O, check complex divide not slower than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remember to remove this todo post--merge
src/finufft.cpp
Outdated
@@ -1092,6 +1073,13 @@ int FINUFFT_EXECUTE(FINUFFT_PLAN p, CPX *cj, CPX *fk) { | |||
CNTime timer; | |||
timer.start(); | |||
|
|||
#ifdef FINUFFT_USE_DUCC0 | |||
std::vector<CPX> fwBatch_(p->nf * p->batchSize); // the big workspace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above comment - could you explain why this isn't better done in the plan stage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation is to reduce memory consumption of plans. I remember the HVOX people complaining at some point that finufft plans are huge, and with FFTW this is not so easy to avoid. With DUCC it's easier and I think it's worth the slight performance penalty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could always use a caching allocator to avoid the issue. I'd rather keep the allocation in the plan for now and maybe switch to this with a flag or something. I do prefer having the same behavior for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are really sure ... for me this was one of the two big selling points of the patch (the other being not waiting for the FFTW planner).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's reverted now ... but I sincerely hope that you will change your mind again! :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, interesting that this was a selling point. Let's think about where this big malloc is done in the future, as a separate issue. We would like to understand how HVOX people are using the plan interface. The big malloc has to be done somewhere because the FFT is about to write into it! But if one were to keep multiple plans around that were not going to be done simultaneously, then, yes, it could be wasteful of RAM - but that seems like a weird use-case. Maybe people are using FINUFFT in a way I hadn't anticipated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A very common situation would be iterative approximate inversion of a NUFFT via conjugate gradient or similar. For that you need (I think) two FINUFFT plans that are used alternatingly, and therefore there is always an overhead of a fine grid (and a copy of the coordinates of nonuniform points, but that's a different topic again).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation is to reduce memory consumption of plans. I remember the HVOX people complaining at some point that finufft plans are huge, and with FFTW this is not so easy to avoid. With DUCC it's easier and I think it's worth the slight performance penalty.
@mreineck Just to clarify, we don't complain that FINUFFT plans are huge in HVOX, but rather that exploiting sparsity in HVOXv1 requires creating multiple plans of different sizes, which is time-consuming when planning 100+ NUFFTs since planning cannot be done in parallel due it not being thread-safe. Also, depending on what your (x,z) point distribution is, the FFTs may indeed be huge.
That being said we don't need to do this anymore: a single plan suffices now in HVOXv2. [There is no HVOXv2 paper at the moment, but you can read this in my thesis soon. We'll draft a v2 paper proper afterwards.]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the clarification! I'm definitely looking forward to reading about the improved algorithm!
|
||
add_library(finufft_f64 OBJECT ${FINUFFT_PRECISION_DEPENDENT_SOURCES}) | ||
target_compile_definitions(finufft_f64 PRIVATE) | ||
set_finufft_options(finufft_f64) | ||
target_link_libraries(finufft_f64 PUBLIC ${FINUFFT_FFTLIBS}) | ||
target_link_libraries(finufft_f64 PRIVATE xsimd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DiamonDinoia Marco, the reason windows can not find xsimd seems to be here, this line is moved out of set_finufft_options function. the below if(WIN32) does not add link xsimd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great thanks for fixing it!
BTW, if you are running benchmarks on your hardware comparing the two FFTs within finufft I'd be extremely interested to see the results! |
Did anyone run basic timing tests on this? Just ran benchmarks on 5700U AMD laptop, 1 thread, but I must be doing something wrong, since ducc0 is 50x slower than fftw version. This appears to be mostly due to spreading (which shouldn't be different!) but also fft. I am confused! I cloned via
Here's using the makefile build (hence fftw), showing just the debug=2 for a small 3d1:
Typical 1-thread perf. Same as for master. Now for ducc I have to use cmake, so did:
I see ducc0 is built (no error msgs). Did I get the flags right?
So, I'm hoping bad math flags are in this cmake, or something? (The error is fine, but not the 50x slow-down...) :( |
@@ -36,6 +36,7 @@ option(FINUFFT_USE_CUDA "Whether to build CUDA accelerated FINUFFT library (libc | |||
option(FINUFFT_USE_CPU "Whether to build the ordinary FINUFFT library (libfinufft)." ON) | |||
option(FINUFFT_STATIC_LINKING "Whether to link the static FINUFFT library (libfinufft_static)." ON) | |||
option(FINUFFT_BUILD_DEVEL "Whether to build development executables" OFF) | |||
option(FINUFFT_USE_DUCC0 "Whether to use DUCC0 for CPU fft library" ON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this has to be OFF by default, since users will need to be eased into this over time, if it turns out to be useful for them..
I now used fresh
weirdly, it again downloads ducc and builds it (it should not, since fftw is default).
Now I get the FFTW version, which shows FFTW timing as expected, but again a crippling spread (50x slower):
I just rechecked master, and the cmake build is also equally crippled spreader. So there is a cmake problem in master (makefile is, as usual, fine). |
Just curious: how do you build the ducc0 version using |
Finally:
gives:
Spreading back to normal, and ducc0 fft w/ 3d-zero-patterns is 2.5x as fast as fftw - good! |
Nice - that means the FFT must be intrinsically faster (for this particular setup) than FFTW, since the savings can never give you more than a factor of 2. |
one thread, double-prec. using M=1e2 so as to make spread/interp negligible. Thus all time is FFT
etc. Thus FFT-dominated. (Since the timing of zeroing the array vs fft could be mixed, I just did total times). 3d1, upsampfac=1.25: ducc0_with_zero-pattern 3x faster than fftw (or 7x if looking at fft time in debug=1 output, but I don't know if believe that..) 8-threads (no HT): Excellent! |
The particular timings depend very strongly on the prime factors of the array sizes ... there won't be a uniform runtime ratio between ducc0.fft and FFTW, since they are internally really different. |
2D, double-prec: 1 thread: 8 threads: 1D, double-prec: 1 thread: 8 threads: So, pretty good story for ducc0 so far, in the case of a single-fftw with FFTW_ESTIMATE plan included. |
Hi Martin, |
Don't expect miracles from ducc in 1D; this is where it's worst. Batched 1D might be better though. |
The output of next235even should already be pretty good for ducc; I don't think there's much to be gained there. |
batched 1d (8-threads, batchsize=8):
Shows that for nf1=12500000 ducc0 is 0.7x faster (ie slower), but at nf1=20000000 ducc0 is 1.3x (faster). But it is quite exciting for 2d,3d. |
I will have a shot at adding ducc0 to my makefile, but I see no reason not to merge this branch now. I will set the cmake default to fftw for now. We also need to update docs to reflect the ducc0 option. Best, Alex |
No description provided.