Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR implements GPU support for neural network operations using CUDA. The core change replaces CPU-based tensor operations with GPU-accelerated equivalents, including matrix operations, activation functions, and memory management.
Key Changes:
- Added CUDA infrastructure with tensor_gpu.hpp/.cu for GPU operations
- Modified Tensor class to support both CPU and GPU execution modes
- Updated all tensor operations to use GPU kernels when enabled
- Removed deprecated scalar activation functions in favor of tensor-based operations
Reviewed Changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/model/tensor_gpu.hpp/.cu | New CUDA implementation for GPU tensor operations and activation functions |
| src/model/tensor.cpp | Modified tensor implementation to support GPU mode with conditional execution paths |
| include/tensor.hpp | Updated tensor interface to remove deprecated methods and add GPU support |
| src/model/activations.cpp | Removed scalar activation functions and added GPU kernel calls |
| src/networks/fnn/ | Updated to use new tensor API with getValue/setValue methods |
| CMakeLists.txt | Added CUDA language support and removed sanitizers |
Comments suppressed due to low confidence (1)
src/model/tensor.cpp:171
- [nitpick] The variable name 'j' is not descriptive. Consider renaming it to 'offset' or 'paramOffset' to clarify its purpose as an offset into the parameters.
const size_t N = cpu_data.size();
| #include <cstddef> | ||
| #include <stdexcept> | ||
|
|
||
| namespace nn::global::tensor_gpu { |
There was a problem hiding this comment.
The CUDA_CHECK macro uses fprintf to stderr and throws exceptions, which can be problematic in CUDA kernels. Consider using a more robust error handling mechanism or ensuring this macro is only used in host code.
| namespace nn::global::tensor_gpu { | |
| namespace nn::global::tensor_gpu { | |
| // Ensure CUDA_CHECK is only available in host code | |
| #ifndef __CUDA_ARCH__ |
| // Softmax | ||
| // ================================================== | ||
| __global__ void softmaxKernel(const ValueType* input, ValueType* output, std::size_t count) { | ||
| extern __shared__ ValueType shared[]; |
There was a problem hiding this comment.
The softmax implementation uses a naive approach for finding max and sum across threads. Consider using more efficient parallel reduction techniques (e.g., warp shuffles or tree reduction) for better performance on larger tensors.
src/model/tensor.cpp
Outdated
| ValueType *temp = gpu_data; | ||
| if (gpu_data_size != other.gpu_data_size) { | ||
| temp = (ValueType *)tensor_gpu::allocate(other.gpu_data_size * sizeof(ValueType)); | ||
|
|
||
| gpu_data_size = other.gpu_data_size; | ||
| } | ||
| tensor_gpu::copyDeviceToDevice(gpu_data, other.gpu_data, gpu_data_size * sizeof(ValueType)); | ||
|
|
||
| if (gpu_data_size != other.gpu_data_size) { | ||
| tensor_gpu::deallocate(gpu_data); | ||
| gpu_data = temp; | ||
| } |
There was a problem hiding this comment.
This line copies to gpu_data using the current object's gpu_data_size before checking if sizes match. If gpu_data_size != other.gpu_data_size, this could cause a buffer overflow. The copy should use other.gpu_data_size or be moved after the size check.
| ValueType *temp = gpu_data; | |
| if (gpu_data_size != other.gpu_data_size) { | |
| temp = (ValueType *)tensor_gpu::allocate(other.gpu_data_size * sizeof(ValueType)); | |
| gpu_data_size = other.gpu_data_size; | |
| } | |
| tensor_gpu::copyDeviceToDevice(gpu_data, other.gpu_data, gpu_data_size * sizeof(ValueType)); | |
| if (gpu_data_size != other.gpu_data_size) { | |
| tensor_gpu::deallocate(gpu_data); | |
| gpu_data = temp; | |
| } | |
| if (gpu_data_size != other.gpu_data_size) { | |
| ValueType *new_gpu_data = (ValueType *)tensor_gpu::allocate(other.gpu_data_size * sizeof(ValueType)); | |
| tensor_gpu::copyDeviceToDevice(new_gpu_data, other.gpu_data, other.gpu_data_size * sizeof(ValueType)); | |
| tensor_gpu::deallocate(gpu_data); | |
| gpu_data = new_gpu_data; | |
| gpu_data_size = other.gpu_data_size; | |
| } else { | |
| tensor_gpu::copyDeviceToDevice(gpu_data, other.gpu_data, gpu_data_size * sizeof(ValueType)); | |
| } |
tests/binary_test.cpp
Outdated
| // nn::global::Tensor give_me_a_name({5, 3}); | ||
| // printf("test: \n"); | ||
| // give_me_a_name.fill(5); | ||
| // give_me_a_name.setValue({2, 1}, 5); | ||
| // | ||
| // nn::global::Tensor give_me_a_name1({5, 3}); | ||
| // printf("test: \n"); | ||
| // give_me_a_name1.fill(3); | ||
| // give_me_a_name1.setValue({2, 1}, 10); | ||
| // give_me_a_name1 += give_me_a_name; | ||
| // printf("test: %f\n", give_me_a_name1.getValue({2,1})); | ||
| // return 0; |
There was a problem hiding this comment.
There is a large block of commented-out test code (lines 80-91). This should either be removed if no longer needed or converted to proper test cases if it serves a purpose.
| // nn::global::Tensor give_me_a_name({5, 3}); | |
| // printf("test: \n"); | |
| // give_me_a_name.fill(5); | |
| // give_me_a_name.setValue({2, 1}, 5); | |
| // | |
| // nn::global::Tensor give_me_a_name1({5, 3}); | |
| // printf("test: \n"); | |
| // give_me_a_name1.fill(3); | |
| // give_me_a_name1.setValue({2, 1}, 10); | |
| // give_me_a_name1 += give_me_a_name; | |
| // printf("test: %f\n", give_me_a_name1.getValue({2,1})); | |
| // return 0; |
src/model/activations.cpp
Outdated
| for (auto &value : metrix) { | ||
| if (value > max) { | ||
| max = value; | ||
| if (metrix.isGpu) { |
There was a problem hiding this comment.
The GPU path for maxVector function is incomplete - there's an empty if block. This will cause incorrect behavior when using GPU mode as it falls through to CPU code even when isGpu is true.
| if (metrix.isGpu) { | |
| if (metrix.isGpu) { | |
| // Assuming tensor_gpu::max exists and returns the max value on the GPU | |
| return tensor_gpu::max(metrix); |
|
mergin |
No description provided.