Skip to content

Conversation

@zcbenz
Copy link
Collaborator

@zcbenz zcbenz commented Feb 2, 2025

This PR adds a device parameter to mallocs, which makes the code written for unified memory backend also usable for general gpu backend (#1789).

@awni
Copy link
Member

awni commented Feb 6, 2025

Per some offline discussion, we prefer not to merge this as it breaks one of the key design goals of MLX (e.g. intended for unified memory systems).

For example, if we ever added a CUDA backend we would want to use the unified memory programming model (which it supports): https://developer.nvidia.com/blog/unified-memory-cuda-beginners/

I think WebGPU does things a bit in reverse: which is if you are on a unified memory system then copies between CPU-GPU can be no-ops, but the programmer is always required to manage memory.

I think the main reason you wanted to add the device parameter is so you can reuse a lot of the code in common. So worst case scenario, there will be some code duplication. It may be possible to refactor some of that code into more reusable components (feel free to make changes there).

I think there may also be an alternative solution by adding a level of indirection on top of the webgpu allocator which does lazy allocation of CPU / GPU pointers as needed (e.g. when actually called by the primitive).

@zcbenz
Copy link
Collaborator Author

zcbenz commented Feb 7, 2025

I totally understand the decision. My only purpose is to make the code reusable, rather than changing MLX's design goal.

I think there may also be an alternative solution by adding a level of indirection on top of the webgpu allocator which does lazy allocation of CPU / GPU pointers as needed (e.g. when actually called by the primitive).

The GPU memory can be allocated lazily, but there is no way to do the same for CPU: we can not hook the memcpy or *ptr calls.

A workaround is to just always allocate CPU memory, which only happens when using code from common, and I think if we do things carefully there won't be much affections to performance and memory usages. But this design complicates things a lot and does not look very delicate.

I think the main reason you wanted to add the device parameter is so you can reuse a lot of the code in common. So worst case scenario, there will be some code duplication. It may be possible to refactor some of that code into more reusable components (feel free to make changes there).

I can define a hook function like allocate_for_array which calls malloc_or_wait in cpu/metal backends and something else in webgpu, and for the code that I want to reuse, I just replace calls like

out.set_data(allocator::malloc_or_wait(nbytes);

with:

out.set_data(malloc_for_array(out, nbytes));

Though I think this option essentially is just a variant of this PR. With this PR I did not intend to change every invocation of malloc_or_wait to pass a device argument, instead I only plan to change the code that can be reused, i.e. use malloc_or_wait(device, nbytes) for code that might be used in webgpu backend and malloc_or_wait(nbytes) for everything else.

@zcbenz
Copy link
Collaborator Author

zcbenz commented Feb 11, 2025

I just realized a pitfall of webgpu: the bool type in webgpu takes 4 bytes instead of 1. So I likely have to do conversions on the fly when uploadTo/downloadFrom GPU, and I can no longer assume the data in CPU and GPU has same size.

Either alternative should still work though since the array's dtype information is available when doing allocations.

@zcbenz
Copy link
Collaborator Author

zcbenz commented Mar 21, 2025

Closing this PR for now since I'm not actively working on the webgpu backend.

@zcbenz zcbenz closed this Mar 21, 2025
@zcbenz zcbenz deleted the malloc-device branch March 21, 2025 05:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants