Perform spectral deconvolution on a CUDA enabled machine.
Three external libraries are required to build sdcuda: CCfits, CUDA and CULA.
CULA is a proprietary library that is free to use for academic purposes.
To build from scratch, first switch the active build to x64:
Build -> Configuration Manager and switch Active Solution Platform to x64.
Then add the libraries and their headers to the relevant build directories:
In Project -> sdcuda Properties -> Configuration Properties -> C/C++ -> General -> Additional Include Directories, add the necessary include paths, e.g.
C:\cfitsio\cfitsio;C:\CCfits\CCfits\..;C:\CCfits\CCfits;C:\CULA\include;C:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.5common\inc;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5\include\
and similarly, add the library paths to Project -> sdcuda Properties -> Configuration Properties -> Linker -> General -> Additional Library Directories, e.g.
$(CudaToolkitLibDir);C:\CCfits\CCfits.build\Release;C:\cfitsio\cfitsio.build\Release;C:\CULA\lib
then add the specific libraries required to Project -> sdcuda Properties -> Configuration Properties -> Linker -> Input -> Additional Dependencies, e.g.
cublas.lib;cufft.lib;cudart_static.lib;kernel32.lib;user32.lib;gdi32.lib;winspool.lib;comdlg32.lib;advapi32.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;odbc32.lib;odbccp32.lib;kernel32.lib;user32.lib;gdi32.lib;winspool.lib;shell32.lib;ole32.lib;oleaut32.lib;uuid.lib;comdlg32.lib;advapi32.lib;C:\CCfits\CCfits.build\Release\CCfits.lib;C:\cfitsio\cfitsio.build\Release\cfitsio.lib;C:\CULA\lib\culapack.lib;C:\CULA\lib\culapack_link.lib
The CULA and CUDA .dlls need to be placed in the executable directory (cublas64_XX, cudart64_XX, cula_sparse, culapack, culapack_link).
The program's configuration parameters are stored in the distribution's config.xml file.
There are three distinct sections under the XML root: <host>, <device> and <process>.
The host tag defines attributes related to the GPU processing. A host item is encapsulated with a <param> tag, each of which should have <name>, <value> and <description> tags. The description tag is purely aesthetic.
Recognised parameter name/values are:
- nCPUCORES (int) - The number of CPU cores to use in multiprocessing. Setting this parameter too high will move the bottleneck to CPU memory.
The device tag defines attributes related to the GPU processing. A device item is encapsulated with a <param> tag, each of which should have <name>, <value>
and <description> tags. The description tag is purely aesthetic. Recognised parameter names (and their expected value types) are:
- nCUDABLOCKS (int) - The number of CUDA blocks.
- nCUDATHREADSPERBLOCK (int) - The number of threads per CUDA block.
The process tag defines the sequence of stages in the pipeline. Each process item is encapsulated with a <stage> tag, each of which should have a <name> tag.
Recognised stage names can be found in the enum process_stages in cprocess.h. Stages will be actioned in the order they are placed in the file.
The binary is called with the following command line parameters (with corresponding flags) required:
- The input FITS file path (-i)
- The simulation parameters file path (-p)
- The configuration file path (-c)
- The output file path (-o)
e.g. sdcuda -i "C:\Users\barnsley\Desktop\HARMONI-HC_data\in.fits" -p "C:\Users\barnsley\Desktop\HARMONI-HC_data\parameters.xml" -c "C:\Users\barnsley\Documents\Visual Studio 2013\Projects\sdcuda\config.xml" -o "C:\Users\barnsley\Desktop\HARMONI-HC_data\out.fits"
On program execution, a clparser instance is invoked to parse the command line parameters. On success, an input instance is then spawned. This instance reads
in the input FITS file, required simulation parameters and builds the process chain.
Following this, a loop is entered whereby new process instances are spawned while the number of concurrent running processes < nCPUCORES. Each process
instance reduces a single detector integration time in the input 4D cube (x, y, DIT, wavelength). The separate integration times correspond to different
rotator positions (as required by ADI).
The primary constructs for the program are derived from the cube class type, namely hcube and dcube. The prefixes of these classes denote
where the data will physically reside, either on the host (h) or on the device (d), and are common nomenclature elsewhere too.
A cube instance is defined as a datacube in the traditional sense, with two spatial axes and one spectral, and contains a vector container of spslice type.
The spslice class has two derived classes, hspslice and dspslice, both of which house a pointer, p_data, to the block of memory containing the data for
the corresponding slice.
To create a new process stage, the following sequence of events should be followed:
- Add a unique process identifier to
process_stagesincprocess.h. - Add a new case to
process::stepincprocess.cpp. This may require adding additional functions and prototypes tocprocess.cppandcprocess.h. - Add the corresponding XML stage name -> process_stage mapping to
process_stages_mappingincinput.h.
If a call to a function to be performed on the device is required, it will be necessary to define a new device function. A device function is called via:
cudacalls.cuh function > _global_ cdevice.cuh function -> async _device_ cdevice.cuh function
The following sequence of events should be followed:
- Add the function prototype to
cudacalls.cuh. The first two parameters of the function call will always be nCUDABLOCKS and nCUDATHREADSPERBLOCK, so they should always haveinttype. - Define the corresponding function call in
cudacalls.cu. - Add a new
__global__function prototype tocdevice.cuh. - Define the corresponding function call in
cdevice.cu. - (optional) Add necessary
__device__prototypes and defintions tocdevice.cuhandcdevice.curespectively. These are the functions that will be called asynchronously for each thread on the GPU.