forked from open-mpi/ompi
-
Notifications
You must be signed in to change notification settings - Fork 4
Cuda rebase #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
eddy16112
wants to merge
68
commits into
ICLDisco:main
Choose a base branch
from
eddy16112:cuda-rebase
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Cuda rebase #6
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
44d59cf to
42944b2
Compare
add cuda stream for submmitting multiple kernels. add suppot for predefined datatypes. Conflicts: opal/datatype/opal_datatype_unpack.c test/datatype/ddt_test.c
Add support for iovec and for pipeline iovec. a new way to compute nb_block and thread_per_block Conflicts: test/datatype/Makefile.am
Conflicts: test/datatype/Makefile.am
Improve the GPU memory management. Conflicts: opal/mca/mpool/gpusm/mpool_gpusm.h opal/mca/mpool/gpusm/mpool_gpusm_module.c fix gpu memory and vector datatype
device 0, we now use the devices already opened.
issues, when 2 peers were doing a send/recv or when multiple senders were targetting the same receiver. Rolf provided a patch to solve this issue, by moving the IPC communication index from a global location onto each endpoint.
and will be populated with all the known information. Beware: one still has to manually set the CUDA lib and path as they are not available after configure (unlike the include which is). Conflicts: opal/datatype/cuda/Makefile This file was certainly not supposed to be here. There is NO valid reason to have a copy of a locally generated file in the source. Add the capability to install the generated library and other minor cleanups. Open the datatype CUDA library from a default install location. Various other minor cleanups.
1. free code did not work right because we were computing the amount we freed after merging the list 2. we need to store original malloc GPU buffer in extra place because the one in the convertor gets changed over time Conflicts: opal/datatype/cuda/opal_datatype_cuda.cu opal/datatype/cuda/opal_datatype_pack_cuda_wrapper.cu clean up code in pack and unpack Conflicts: ompi/mca/pml/ob1/pml_ob1_cuda.c opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu
Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu opal/mca/btl/smcuda/btl_smcuda.c fix a bug when buffer is not big enough for whole ddt Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu if data in different gpu, instead of copy direct from one to the other, we do a D2D copy Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu test/datatype/Makefile.am now we can use cudamemcpy2d Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu enable zero copy + fix GPU buffer bug Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu put pipeline size into mca
iteration of the datatype based on a NULL pointer. This list will then contain the displacement and the length of each fragment of the datatype memory layout and can be used for any packing/unpacking purpose.
Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu opal/datatype/opal_datatype_unpack.c Fix pipeline bug
Conflicts: opal/datatype/cuda/opal_datatype_pack_cuda_wrapper.cu fix zerocopy
functions Conflicts: opal/datatype/cuda/opal_datatype_cuda.cu opal/datatype/cuda/opal_datatype_cuda_internal.cuh opal/datatype/cuda/opal_datatype_pack_cuda_kernel.cu opal/datatype/cuda/opal_datatype_pack_cuda_wrapper.cu opal/datatype/cuda/opal_datatype_unpack_cuda_kernel.cu opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu opal/datatype/opal_datatype_gpu.c
rewrite pipeline s up and running. PUT size in an MCA parameters. Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu Conflicts: opal/mca/btl/btl.h less bugs Conflicts: ompi/mca/pml/monitoring/pml_monitoring_component.c opal/mca/mpool/gpusm/mpool_gpusm.h fix pipelining for non-contiguous to contiguous
reorder datatypes to cache boundaries slience warnings
this file is not used anymore
multi-GPU when ompi support multi-GPU in the future fix a cuda stream bug for iov, remove some stream syncs in openib, disable rdma for non-contiguous gpu data
rename some functions check point
Add support for caching the unpacked datatype description
via the opal_convertor_raw_cached function.
cached iov is working for count = 1
check point use raw_cached, but cuda iov caching is not enabled
check point, split iov into two version, non-cached and cached
check point iov cache
another checkpoint
check point, cuda iov is cached, but not used for pack/unpack
check point, ready to use cached cuda iov
checkpoint, cached cuda iov is working with multiple send, but not for
count > 1
checkpoint, fix a bug for partial unpack
checkpoint, fix unpack size
cache the entire cuda iov
checkpoint, during unpack, cache the entire iov before unpack
another checkpoint
checkpoint , remove unnecessary cuda stream sync
use bit to replace %
rollback to use %, not bit, since it is faster, not sure why
now cuda iov is {nc_disp, c_disp}
clean up kernel, put variables uses multiple times into register
cached cuda iov is working for count > 1
another checkpoint
now convertor->count > 1 is woring
move the cuda iov caching into a seperate function
these two variables are useless now
fix a bug for ib, current count of convertor should be set in
set_cuda_iov_position
cleanup, move cudamalloc into cache cuda iov
rearrange varibles
if cuda_iov is not big enough, use realloc. However, cudaMallocHost does
not work with realloc, so use malloc instead
make sure check pointer is not NULL before free it
rewrite non cached iov, make it unified with cached iov
checkpoint, rewrite non-cached version
fix for non cached iov
fix the non cached iov, set position should be put at first
move ddt iov to cuda iov into a function
merge iov cached and non-cached
for non cached iov, if there is no enough cuda iov space, break
c2a29eb to
8b85c3d
Compare
This is my first complete review of the code. Many things need to get cleaned, but overall the code looks pretty good.
…datatype support enabled or not; check cuda calls.
…dont have outer_stream
…o do not init kernel support until confirming buffer is gpu buffer.
69b4614 to
7063d19
Compare
thananon
pushed a commit
that referenced
this pull request
Oct 27, 2017
Signed-off-by: Clement Foyer <clement.foyer@inria.fr>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.