running eval correctly?

Hey there, I'm trying to run the eval on the random set with retrieval right now. im running on 10 workers across 2 40GB A100's, coming up on 12 hours now. It seems that lean is a much bigger bottleneck than the model itself from my profiling, see attached. The highlighted pink cell appears to be calls to the transformer model while the adjacent cell to the left are all pexpect calls to `lean`. is there anything im missing about how to run this or is this the expected duration? 

and very sorry for my amateur approach, but from text searching through the logs i think i've only completed ~200 proof searches in this time. i don't see an indicator of the number completed in the logs other than this. at this rate it will take days to complete the eval, it seems i might be doing something wrong. i am using the default values for timeout and tactic samples:
```
num_sampled_tactics: int = 64,
timeout: int = 600
```

running:
```
python prover/evaluate.py --data-path data/leandojo_benchmark_4/random/  --gen_ckpt_path kaiyuy/leandojo-lean4-retriever-tacgen-byt5-small --ret_ckpt_path kaiyuy/leandojo-lean4-retriever-byt5-small --indexed-corpus-path $PATH_TO_INDEXED_CORPUS --split test --num-workers 10 --num-gpus 2 --save-results
```

Thanks

<img width="1272" height="597" alt="Image" src="https://github.com/user-attachments/assets/f55c9b39-216f-41ff-9a78-dadc23c1e0b7" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

running eval correctly? #80

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

running eval correctly? #80

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions