Context Parallelism #67

oliverkinch · 2025-11-13T09:04:19Z

Implements CP for non MoE models. Implementing CP for MoEs will be in a separate PR.

Fix #31.
#38 will be redundant given this PR.

oliverkinch · 2025-11-13T09:06:47Z

From train.py we have

if parallel_dims.cp_enabled: # the following is necessary for CP w/ flex attention
    from torch.distributed.tensor.experimental._attention import _set_cp_global_var, _DispatchMode, _cp_options

    # set_rotate_method("alltoall")  # alltoall or allgather (only allgather for flex)
    _set_cp_global_var("cp_shard_dim", 2)
    # _cp_options.enable_load_balance = True  # no load balancing for flex
    torch.distributed.tensor.experimental._attention._dispatch_mode = (
        _DispatchMode.TORCH_FUNCTION
    )

_set_cp_global_var is only available in torch 2.9.0, but if I force this version the code crashes as .backward() is called. Is _set_cp_global_var necessary?

oliverkinch · 2025-11-13T09:47:48Z

Problems with FLASH_ATTENTION? It works with MATH

with sdpa_kernel(SDPBackend.FLASH_ATTENTION):
    return F.scaled_dot_product_attention(q, k, v, is_causal=True, scale=scale)

torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in function scaled_dot_product_attention>(*(FakeTensor(..., device='cuda:0', size=(1, 16, 4096, 192), dtype=torch.bfloat16,
           grad_fn=<TransposeBackward0>), FakeTensor(..., device='cuda:0', size=(1, 16, 4096, 192), dtype=torch.bfloat16,
           grad_fn=<TransposeBackward0>), FakeTensor(..., device='cuda:0', size=(1, 16, 4096, 128), dtype=torch.bfloat16,
           grad_fn=<TransposeBackward0>)), **{'is_causal': True, 'scale': 0.07216878364870322}): got RuntimeError('No available kernel. Aborting execution.')

oliverkinch · 2025-11-13T09:59:27Z

@rlrs Context parallelism now runs for gemma and llama

New dcp script related to model where yarn has been used to extend the context length

oliverkinch · 2025-12-10T13:12:43Z

I have now also included the related to YaRN in this PR, see d36078d

Blue: The base model.
Orange: The base model with its context window extended from 4k to 32k using YaRN, without any additional training.
Green: The same YaRN-extended model, further trained for 1,000 steps on long-context data (wiki_expanded)

CP

9a57ae2

oliverkinch added 2 commits November 13, 2025 10:51

Unique name

9d50ff9

Compatibility with train current train code

6485e46

oliverkinch marked this pull request as ready for review November 13, 2025 09:59

oliverkinch requested a review from rlrs November 13, 2025 09:59

YaRN implementation

d36078d

New dcp script related to model where yarn has been used to extend the context length

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Context Parallelism #67

Context Parallelism #67

Uh oh!

oliverkinch commented Nov 13, 2025 •

edited

Loading

Uh oh!

oliverkinch commented Nov 13, 2025

Uh oh!

oliverkinch commented Nov 13, 2025

Uh oh!

oliverkinch commented Nov 13, 2025

Uh oh!

oliverkinch commented Dec 10, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Context Parallelism #67

Are you sure you want to change the base?

Context Parallelism #67

Uh oh!

Conversation

oliverkinch commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oliverkinch commented Nov 13, 2025

Uh oh!

oliverkinch commented Nov 13, 2025

Uh oh!

oliverkinch commented Nov 13, 2025

Uh oh!

oliverkinch commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

oliverkinch commented Nov 13, 2025 •

edited

Loading

oliverkinch commented Dec 10, 2025 •

edited

Loading