Skip to content

Usage with AWS S3 and Ray #59

@0x2b3bfa0

Description

@0x2b3bfa0

Usage

Cluster creation

ray up --yes cluster.yml
ray dashboard cluster.yml

Job submission

git clone https://github.com/mlfoundations/datacomp
ray job submit \
--address=http://localhost:8265 \
--working-dir=datacomp \
--runtime-env-json="$(
  jq --null-input '
    {
      conda: "datacomp/environment.yml",
      env_vars: {
        AWS_ACCESS_KEY_ID: env.AWS_ACCESS_KEY_ID,
        AWS_SECRET_ACCESS_KEY: env.AWS_SECRET_ACCESS_KEY,
        AWS_SESSION_TOKEN: env.AWS_SESSION_TOKEN
      }
    }
  '
)" \
-- \
python download_upstream.py \
--subjob_size=11520 \
--thread_count=128 \
--processes_count=1 \
--distributor=ray \
--metadata_dir=/tmp/metadata \
--data_dir=s3://datacomp-small \
--scale=small

Note

Image shards would be saved to the datacomp-small AWS S3 bucket, specified with the --data_dir option.

Cluster deletion

$ ray down --yes cluster.yml

Configuration

Sample cluster.yml

cluster_name: datacomp-downloader

min_workers: 0
max_workers: 10
upscaling_speed: 1.0

docker:
  run_options: [--dns=127.0.0.1]
  image: rayproject/ray:2.6.1-py310
  container_name: ray

provider:
  type: aws
  region: us-east-1
  cache_stopped_nodes: false

available_node_types:
  ray.head.default:
    resources: {}
    node_config:
      InstanceType: m5.12xlarge
      ImageId: ami-068d304eca3399469
      BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
            DeleteOnTermination: true
            VolumeSize: 200
            VolumeType: gp2
  ray.worker.default:
    resources: {}
    node_config:
      InstanceType: m5.12xlarge
      ImageId: ami-068d304eca3399469
      BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
            DeleteOnTermination: true
            VolumeSize: 200
            VolumeType: gp2

initialization_commands:
  - wget https://secure.nic.cz/files/knot-resolver/knot-resolver-release.deb
  - sudo dpkg --install knot-resolver-release.deb
  - sudo apt-get update
  - sudo apt-get install --yes knot-resolver
  - echo $(hostname --all-ip-addresses) $(hostname) | sudo tee --append /etc/hosts
  - sudo systemctl start kresd@{1..48}.service
  - echo nameserver 127.0.0.1 | sudo tee /etc/resolv.conf
  - sudo systemctl stop systemd-resolved

setup_commands:
  - sudo apt-get update
  - sudo apt-get install --yes build-essential ffmpeg

Obscure details

  • When --data_dir points to a cloud storage like S3, we also have to specify a local --metadata_dir because the downloader script doesn't support saving metadata to cloud storage.

  • The last pip install on the setup_commands section is needed for compatibility with AWS S3, because the required libraries aren't included in the conda environment file.

  • There is no need to provide additional AWS credentials if the destination bucket is on the same account as the cluster, because it already has S3 full access through an instance profile.

    • While the cluster has a default instance profile that grants full S3 access, it doesn't seem to work as intended (probably due to rate limit of IMDS endpoint), and I ended up having to pass my local AWS credentials as environment variables.
  • The Python version in environment.yml must match the Python version of the Ray cluster; make sure that docker.image on cluster.yaml has exactly the same version as the environment.yml from this project.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions