Skip to content

Conversation

@zihugithub
Copy link
Collaborator

PR Category

CICD

PR Types

Others

PR Description

  • resume partial tests for inference and serving
  • Fix the sccache mechanism

@CLAassistant
Copy link

CLAassistant commented Jan 9, 2026

CLA assistant check
All committers have signed the CLA.

@tengqm
Copy link
Contributor

tengqm commented Jan 11, 2026

@zihugithub Looks like GitHub cannot determine your identity from your GIT PR. Can you help double check your email used for the PR (git config user.email) and whether you have signed the CLA using that account?

tengqm
tengqm previously approved these changes Jan 12, 2026
@tengqm
Copy link
Contributor

tengqm commented Jan 12, 2026

@zihugithub There are conflicts to be resolved.

{ echo 'runs_on<<EOFRUNSON'; echo "$RUNNER_LABELS"; echo 'EOFRUNSON'; } >> $GITHUB_OUTPUT
{ echo 'container_volumes<<EOFVOLUMES'; echo "$VOLUMES"; echo 'EOFVOLUMES'; } >> $GITHUB_OUTPUT
{ echo 'unit_subsets<<EOFUNITSUBSETS'; echo "$UNIT_SUBSETS"; echo 'EOFUNITSUBSETS'; } >> $GITHUB_OUTPUT
{ echo 'unit_train_subsets<<EOFUNITSUBSETS'; echo "$UNIT_TRAIN_SUBSETS"; echo 'EOFUNITSUBSETS'; } >> $GITHUB_OUTPUT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not using the simpler syntax like we did on line 74?

Comment on lines 76 to 77
"cd ${GITHUB_WORKSPACE}/Megatron-LM-FL"
"pip install --no-build-isolation . -vvv"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this change work?

# install FlagGems
INSTALL_FLAGGEMS=(
"pip install scikit-build scikit-build-core"
"pip install git+https://github.com/FlagOpen/FlagGems.git@v3.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"pip install git+https://github.com/FlagOpen/FlagGems.git@v3.0"
"pip install git+https://github.com/flagos-ai/FlagGems.git@v3.0"

Comment on lines 104 to 117
"pip install gymnasium"
"pip install dm-tree"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can/shall we pin the package version?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall/can we expose these dependencies in a more formal way?
I mean ... these packages are not just required by the workflow, they are required for the software to run. These requirements thus should be declared by the software, not the workflow.

@zihugithub zihugithub closed this Jan 19, 2026
@zihugithub zihugithub force-pushed the update_workflows260109 branch from 6cf4f3b to b1db41c Compare January 19, 2026 01:56
@zihugithub zihugithub reopened this Jan 19, 2026
experiment:
exp_name: deepseek_r1_distill_qwen-flaggems
exp_dir: tests/functional_tests/test_cases/inference/deepseek_r1_distill_qwen-flaggems/results_test/7b-tp2
exp_dir: tests/functional_tests/inference/deepseek_r1_distill_qwen-flaggems/results_test/7b-tp2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename tests/functional_tests/ to tests/functional/.
Similarly, rename tests/test_utils/ to tests/utils/.
You get simplicity without losing anything.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tengqm It must align with the test directory structure of FlagScale, which uses the naming of function_tests. We will stick to this convention unless modification is deemed necessary.


# Compare actual output and golden reference output line by line (ignoring newline character differences)
for result_line, gold_value_line in zip(result_lines, gold_value_lines):
print(result_line, gold_value_line)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove line 323?

# Concatenate the VLLM OpenAI-compatible interface URL
url = f"http://localhost:{deploy_config.port}/v1/completions"

# Set request headers (JSON format)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to write comments for lines that speak for themselves.

if key_value.startswith("val-core/openai/gsm8k/reward/mean"):
# Split by colon and extract the numeric part after the colon
value = key_value.split(":")[-1]
# Convert string value to float and add to the result list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are simply too many unnecessary comments like this in the code.

tengqm
tengqm previously approved these changes Jan 21, 2026

jobs:
functional_test:
name: functional-${{ inputs.device }}-${{ inputs.task }}-${{ inputs.model }}--${{ inputs.case }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too redundant

@zihugithub zihugithub closed this Jan 26, 2026
@zihugithub zihugithub force-pushed the update_workflows260109 branch from c76e1e0 to 32a1a06 Compare January 26, 2026 02:00
fi
else
echo "ℹ️ Running tests with system Python"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it make a lot of senses to use conda in test workflow.

Comment on lines +156 to +158
pip uninstall -y flash_attn
pip install flash-attn==2.6.3 --no-build-isolation
python -c "import flash_attn; import flash_attn.layers; print('FlashAttention loaded successfully, version:', flash_attn.__version__)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't install individual packages in workflow.
All requirements should be declared in the pyproject.toml file.
Even if some packages are only needed for functional tests, we can declare them as optional dependencies, e.g.

[project.optional-dependencies]

functional_test = [
   'foo==1.3.4',
]

We can install them using a one command uv pip install .[functional_test].
The command is reusable anywhere, while the dependency is centrally managed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a special use case that is incompatible with other use case environments, and will be modified later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is why pyproject.toml has an optional-dependencies section.

log_info "Installing vllm-FL"

# Clone repository
retry_git_clone "$vllm_url" "$vllm_dir" "$RETRY_COUNT"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mechanism is somehow fragile.
We are hooking the stability of FlagScale to that of vllm-FL.
We have to anticipate a scenario where FlagScale works fine, but there is a bug in the vllm-FL head.
To isolate errors like this, I'd suggest we pin a release tag of the plugin.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tengqm We will do it later when the release version of vllm-fl is ready.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't have to be a formal release. It can be as simple as a tag, such as 20250129 or something.
Later on, when things do not work as expected, we will know where to start.

log_info "Installing vllm-FL"

# Clone repository
retry_git_clone "$vllm_url" "$vllm_dir" "$RETRY_COUNT"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well ... again, we replicate the same script, why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the current perspective, the server and inference run in the same environment. This is done to distinguish between serve and inference, and the common source code will be installed and stored uniformly in the future

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is something like serve is a superset of inference.
They are no conflicts between these two use cases besides this.
Am I understanding this correctly?


# support 0.5b_multiple_instance ci test
ray==2.49.1
gymnasium
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this referenced?

# support 0.5b_multiple_instance ci test
ray==2.49.1
gymnasium
dm-tree
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this referenced?

run: |
git config --global --add safe.directory $PROJECT_ROOT
- name: Check sccache installation and get path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we installing sccache?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly used for compiling VLLM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need some more contexts on this to understand it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tengqm vLLM will be compiled multiple times in different jobs repeatedly; introducing sccache is to reduce this compilation time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Darryl233 Thanks for the clarification. Is the 'compilation' a build time thing or a run time thing?

Copy link
Collaborator

@Darryl233 Darryl233 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants