Usage with AWS S3 and Ray

* Inspired by https://github.com/rom1504/img2dataset/pull/272
* Depends on https://github.com/mlfoundations/datacomp/pull/58
* Depends on https://github.com/mlfoundations/datacomp/pull/60

# Usage

## Cluster creation
```bash
ray up --yes cluster.yml
```
```bash
ray dashboard cluster.yml
```

## Job submission
```bash
git clone https://github.com/mlfoundations/datacomp
```
```bash
ray job submit \
--address=http://localhost:8265 \
--working-dir=datacomp \
--runtime-env-json="$(
  jq --null-input '
    {
      conda: "datacomp/environment.yml",
      env_vars: {
        AWS_ACCESS_KEY_ID: env.AWS_ACCESS_KEY_ID,
        AWS_SECRET_ACCESS_KEY: env.AWS_SECRET_ACCESS_KEY,
        AWS_SESSION_TOKEN: env.AWS_SESSION_TOKEN
      }
    }
  '
)" \
-- \
python download_upstream.py \
--subjob_size=11520 \
--thread_count=128 \
--processes_count=1 \
--distributor=ray \
--metadata_dir=/tmp/metadata \
--data_dir=s3://datacomp-small \
--scale=small
```

> [!NOTE]
> Image shards would be saved to the `datacomp-small` AWS S3 bucket, specified with the `--data_dir` option.

## Cluster deletion
```bash
$ ray down --yes cluster.yml
```

# Configuration

## Sample `cluster.yml`
```yaml
cluster_name: datacomp-downloader

min_workers: 0
max_workers: 10
upscaling_speed: 1.0

docker:
  run_options: [--dns=127.0.0.1]
  image: rayproject/ray:2.6.1-py310
  container_name: ray

provider:
  type: aws
  region: us-east-1
  cache_stopped_nodes: false

available_node_types:
  ray.head.default:
    resources: {}
    node_config:
      InstanceType: m5.12xlarge
      ImageId: ami-068d304eca3399469
      BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
            DeleteOnTermination: true
            VolumeSize: 200
            VolumeType: gp2
  ray.worker.default:
    resources: {}
    node_config:
      InstanceType: m5.12xlarge
      ImageId: ami-068d304eca3399469
      BlockDeviceMappings:
        - DeviceName: /dev/sda1
          Ebs:
            DeleteOnTermination: true
            VolumeSize: 200
            VolumeType: gp2

initialization_commands:
  - wget https://secure.nic.cz/files/knot-resolver/knot-resolver-release.deb
  - sudo dpkg --install knot-resolver-release.deb
  - sudo apt-get update
  - sudo apt-get install --yes knot-resolver
  - echo $(hostname --all-ip-addresses) $(hostname) | sudo tee --append /etc/hosts
  - sudo systemctl start kresd@{1..48}.service
  - echo nameserver 127.0.0.1 | sudo tee /etc/resolv.conf
  - sudo systemctl stop systemd-resolved

setup_commands:
  - sudo apt-get update
  - sudo apt-get install --yes build-essential ffmpeg
```

## Obscure details

* When `--data_dir` points to a cloud storage like S3, we also have to specify a local `--metadata_dir` because the downloader script doesn't support saving metadata to cloud storage.

* The last `pip install` on the `setup_commands` section is needed for compatibility with AWS S3, because the required libraries aren't included in the `conda` environment file.

* ~There is no need to provide additional AWS credentials if the destination bucket is on the same account as the cluster, because it already has S3 full access through an instance profile.~
  * While the cluster has a default instance profile that grants full S3 access, it doesn't seem to work as intended (probably due to rate limit of [IMDS](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-service.html) endpoint), and I ended up having to pass my local AWS credentials as environment variables.

* The Python version in `environment.yml` must match the Python version of the Ray cluster; make sure that `docker.image` on `cluster.yaml` has **exactly** the same version as the `environment.yml` from this project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usage with AWS S3 and Ray #59

Usage

Cluster creation

Job submission

Cluster deletion

Configuration

Sample `cluster.yml`

Obscure details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Usage with AWS S3 and Ray #59

Description

Usage

Cluster creation

Job submission

Cluster deletion

Configuration

Sample cluster.yml

Obscure details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Sample `cluster.yml`