Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
131 commits
Select commit Hold shift + click to select a range
837149c
add 500GB runs
Tmonster Sep 4, 2024
24b6e91
add more helper files
Tmonster Sep 4, 2024
69d68bc
added 500GB. need to modify some solutions to go on disk for the 500G…
Tmonster Sep 4, 2024
96bb929
add back in on_disk check for some solutions
Tmonster Sep 4, 2024
758be4b
add script to set up and run benchmark
Tmonster Sep 5, 2024
f7f2a5b
make it easier to choose what gets run
Tmonster Sep 5, 2024
10a90c9
fixed some scripts
Tmonster Sep 5, 2024
28754db
change ~duckdb to duckdb
Tmonster Sep 5, 2024
aa76eb5
fix more path issues
Tmonster Sep 5, 2024
ae508c6
one last fix
Tmonster Sep 5, 2024
07ea245
change permissions on run files
Tmonster Sep 5, 2024
67b1ec5
move -c
Tmonster Sep 5, 2024
c64fd3e
small update
Tmonster Sep 5, 2024
1179bf4
fix polar src_dataname
Tmonster Sep 5, 2024
a637af5
fix polar scale factor again
Tmonster Sep 5, 2024
bcb2831
neato
Tmonster Sep 5, 2024
66dc1a6
modify setup script to be more modular
Tmonster Sep 6, 2024
39fdd45
use aws s3 copy and create a _setup_utils specifically for setting up…
Tmonster Sep 6, 2024
6a4c1bc
fix run.conf and run.sh
Tmonster Sep 6, 2024
aef8e50
modify regression benchmark runner
Tmonster Sep 6, 2024
5b31321
fix some datafusion things
Tmonster Sep 6, 2024
960c07c
datafusion needs better scale factor
Tmonster Sep 6, 2024
1dce3c6
fix datafusion
Tmonster Sep 6, 2024
894eeef
some more updates to duckdb
Tmonster Sep 10, 2024
293d5d0
typo
Tmonster Sep 10, 2024
a3cba57
fix merge conflicts
Tmonster Sep 12, 2024
1a30c75
Merge branch 'add_500GB_run' of github.com:Tmonster/db-benchmark into…
Tmonster Sep 12, 2024
a6dbf35
add new line to path.env
Tmonster Sep 12, 2024
7563d4a
add datafusion ability to go off disk
Tmonster Sep 12, 2024
0b0e43f
some updates
Tmonster Sep 13, 2024
0855b4d
clean up, will add 500GB runs later
Tmonster Jan 14, 2025
d0ea331
add machine type to file names
Tmonster Jan 15, 2025
ce0d398
Add new column for machine type to time.csv and logs.csv
Tmonster Jan 15, 2025
491e61a
have scripts handle new machine type column
Tmonster Jan 15, 2025
b708a0c
write machine type to logs as well
Tmonster Jan 15, 2025
9cdac26
logs also needs machine type
Tmonster Jan 15, 2025
db1b5d8
change header from machine_size to machine_type
Tmonster Jan 15, 2025
f04e2fa
add machine type to data passed around
Tmonster Jan 15, 2025
3c93ec2
remove traces of 500GB
Tmonster Jan 15, 2025
1493560
fix active tab
Tmonster Jan 15, 2025
7ae9789
fix other solutions to use machine type
Tmonster Jan 16, 2025
ee1d220
fix naming when unpacking data
Tmonster Jan 16, 2025
348b59e
no default machine type sizes
Tmonster Jan 16, 2025
35e7877
more fixes to adding a new machine type
Tmonster Jan 17, 2025
3895781
Merge branch 'main' into add_new_machine_type
Tmonster Jan 17, 2025
d2c8bd4
some minor changes for PR
Tmonster Jan 17, 2025
ba0eb40
remove unused extract files
Tmonster Jan 17, 2025
64da7a1
remvoe traces of 1e10 code
Tmonster Jan 17, 2025
b302fbd
Merge branch 'main' into add_new_machine_type
Tmonster Jan 20, 2025
986f1c6
get rid of some clickhouse setup files. Use 'spill_dir' instead of mo…
Tmonster Jan 21, 2025
9a16a3e
modify some more clickhouse things
Tmonster Jan 21, 2025
548292b
clickhouse should store data on disk
Tmonster Jan 21, 2025
42a1bfb
fix syntax
Tmonster Jan 21, 2025
a52eed7
export on disk
Tmonster Jan 21, 2025
9121932
fix spark memory usage
Tmonster Jan 22, 2025
7533188
read machine_type env variable dask
Tmonster Jan 22, 2025
22b97fd
use aws machine names in time and logs
Tmonster Jan 22, 2025
4f4d17d
solutions have different naming now too
Tmonster Jan 22, 2025
0edc841
Run scripts also use new machine names
Tmonster Jan 22, 2025
9bc804b
more fixes to help with setup
Tmonster Jan 22, 2025
2a7f8ae
fix some small things
Tmonster Jan 22, 2025
83ef8f3
proper update to time and logs
Tmonster Jan 23, 2025
0e1d9dd
add new timings
Tmonster Jan 24, 2025
af13e6c
duckdb should spill to disk when machine is small
Tmonster Jan 24, 2025
49c038c
clickhouse should set up user that has low memory restraing
Tmonster Jan 24, 2025
cf1b854
helpers should not default machine type
Tmonster Jan 24, 2025
f3811ad
trying to figure out why the report wont generate
Tmonster Jan 24, 2025
980b72c
report correctly generates now
Tmonster Jan 24, 2025
b082ef0
index report now has all reports
Tmonster Jan 24, 2025
7032c7b
modify gitignore
Tmonster Jan 24, 2025
9960e67
modify datas
Tmonster Jan 24, 2025
cd3d62f
modify data back to original
Tmonster Jan 24, 2025
37ce639
add no sign request
Tmonster Jan 24, 2025
b83b604
fix some path stuff
Tmonster Jan 24, 2025
8736a0e
smaller changes to help with report creation
Tmonster Jan 27, 2025
e8fcddc
dont run 50GB join on c6id.4xlarge
Tmonster Jan 29, 2025
f4c716e
duckdb should have a temp table
Tmonster Jan 30, 2025
918b949
do not run window query on small machine
Tmonster Feb 5, 2025
3ee5cb9
add new duckdb results
Tmonster Feb 5, 2025
47add3f
change duckdb version to v1.2.0
Tmonster Feb 5, 2025
fcbf35a
update index.Rmd to show join results as well
Tmonster Feb 5, 2025
bc6006f
write machine type for duckdb join
Tmonster Feb 5, 2025
f82b66e
fix logs.csv
Tmonster Feb 5, 2025
1523b73
update error messages for 50GB datasets on small machine
Tmonster Feb 5, 2025
7059a00
add -p to mkdir
Tmonster Feb 5, 2025
ab685ff
update index.Rmd
Tmonster Feb 5, 2025
7553a9d
more changes to clickhouse setup (more permissions)
Tmonster Feb 5, 2025
9b9e9e5
more clickhouse checks
Tmonster Feb 5, 2025
de6c715
more recent clickhouse times
Tmonster Feb 5, 2025
851152c
add back in v1.2.0 results
Tmonster Feb 5, 2025
44a6a38
fix time.csv one more time
Tmonster Feb 5, 2025
6f3530c
update duckdb and clickhouse versions
Tmonster Feb 5, 2025
3b5f13b
fix duckdb times for small joins
Tmonster Feb 5, 2025
ea86c8b
always try to run large join even on small machine
Tmonster Feb 6, 2025
f011b3a
fix typoin clickhouse script and fix comment in run_large
Tmonster Feb 6, 2025
197b256
fix datafusion script
Tmonster Feb 6, 2025
68377d5
fix run.sh
Tmonster Feb 6, 2025
2e34c23
new clickhouse times
Tmonster Feb 6, 2025
646aceb
add time and logs that somehow dissappeared
Tmonster Feb 6, 2025
c779a74
fix spark join script
Tmonster Feb 6, 2025
b1d385b
fix logs for clickhouse
Tmonster Feb 6, 2025
3c8cd44
clarify errors in report
Tmonster Feb 6, 2025
8c69917
fix regression sccript
Tmonster Feb 6, 2025
889589c
fix regression.yml again
Tmonster Feb 6, 2025
171137e
more fixing regression script
Tmonster Feb 6, 2025
1c793e2
in setup_utils, not utils
Tmonster Feb 6, 2025
be19cc4
run on both machine types, otherwise erros during validation
Tmonster Feb 6, 2025
e9917ca
fix spill dirs
Tmonster Feb 6, 2025
f5760b4
don't source the run.conf, will override the command line machine type
Tmonster Feb 6, 2025
0180e9d
remove machine type from run.conf anyway
Tmonster Feb 6, 2025
9a17d13
spark needs to shut down java, so sleep inbetween runs
Tmonster Feb 6, 2025
707b78e
more regression.yml fixes
Tmonster Feb 7, 2025
2cb4652
fix small report text
Tmonster Feb 7, 2025
618d4bb
Merge branch 'main' into add_new_machine_type
Tmonster Feb 7, 2025
36feff1
try to get julia working
Tmonster Feb 7, 2025
f314730
hopefully fix dask and pydatatable
Tmonster Feb 7, 2025
9bb3658
change name for github solo solutions
Tmonster Feb 10, 2025
dc9b0a0
remove steps from repro that are not needed
Tmonster Feb 10, 2025
515ce2c
try to fix julia ds again
Tmonster Feb 10, 2025
8fa928f
try to fix datafusion
Tmonster Feb 10, 2025
3ffd0a8
Merge branch 'main' into add_new_machine_type
Tmonster Feb 10, 2025
e9e6709
fix some setup scripts
Tmonster Feb 10, 2025
5ba2714
install dataframes in juliads so CI passes
Tmonster Feb 10, 2025
5e6575f
now julia should be fixes
Tmonster Feb 10, 2025
b3bbfa8
fix helpers.jl
Tmonster Feb 10, 2025
1a6623d
remove references to helpersds
Tmonster Feb 10, 2025
3a55cdc
fix (hopefully) last juliads problems
Tmonster Feb 10, 2025
9b8ac43
Revert "fix (hopefully) last juliads problems"
Tmonster Feb 10, 2025
d0e3779
Revert "remove references to helpersds"
Tmonster Feb 10, 2025
ef6d93a
Revert "now julia should be fixes"
Tmonster Feb 10, 2025
5b17cc1
give juliads its own helpers file
Tmonster Feb 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 27 additions & 14 deletions .github/workflows/regression.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ jobs:
fail-fast: false
matrix:
solution: [data.table, collapse, dplyr, pandas, pydatatable, spark, juliadf, juliads, polars, R-arrow, duckdb, datafusion, dask, clickhouse]
name: Regression Tests solo solutions
name: Solo solutions
runs-on: ubuntu-20.04
env:
CC: gcc-10
Expand All @@ -36,7 +36,7 @@ jobs:

- name: Install libraries
shell: bash
run: ./_utils/setup-small.sh
run: ./_setup_utils/setup_small.sh

- name: Generate 500mb datasets
shell: bash
Expand All @@ -50,7 +50,7 @@ jobs:
shell: bash
run: source path.env && python3 _setup_utils/install_all_solutions.py ${{ matrix.solution }}

- name: Turn swap off
- name: Turn swap off
shell: bash
run: sudo swapoff -a

Expand All @@ -61,23 +61,32 @@ jobs:
shell: bash
if: ${{ matrix.solution == 'clickhouse' || matrix.solution == 'all' }}
run: |
python3 _utils/prep_solutions.py --task=groupby --solution=clickhouse
python3 _setup_utils/prep_solutions.py --task=groupby --solution=clickhouse
source path.env
TEST_RUN=true TEST_MOUNT_DIR=$GITHUB_WORKSPACE ./run.sh
MACHINE_TYPE="c6id.4xlarge" TEST_RUN=true TEST_MOUNT_DIR=$GITHUB_WORKSPACE ./run.sh
sleep 60
MACHINE_TYPE="c6id.metal" TEST_RUN=true TEST_MOUNT_DIR=$GITHUB_WORKSPACE ./run.sh
sleep 60

- name: Run mini GroupBy benchmark
shell: bash
run: |
python3 _utils/prep_solutions.py --task=groupby --solution=${{ matrix.solution }}
python3 _setup_utils/prep_solutions.py --task=groupby --solution=${{ matrix.solution }}
source path.env
TEST_RUN=true TEST_MOUNT_DIR=$GITHUB_WORKSPACE ./run.sh
MACHINE_TYPE="c6id.4xlarge" TEST_RUN=true TEST_MOUNT_DIR=$GITHUB_WORKSPACE ./run.sh
sleep 60
MACHINE_TYPE="c6id.metal" TEST_RUN=true TEST_MOUNT_DIR=$GITHUB_WORKSPACE ./run.sh
sleep 60

- name: Run mini Join benchmark
shell: bash
run: |
python3 _utils/prep_solutions.py --task=join --solution=${{ matrix.solution }}
python3 _setup_utils/prep_solutions.py --task=join --solution=${{ matrix.solution }}
source path.env
TEST_RUN=true TEST_MOUNT_DIR=$GITHUB_WORKSPACE ./run.sh
MACHINE_TYPE="c6id.4xlarge" TEST_RUN=true TEST_MOUNT_DIR=$GITHUB_WORKSPACE ./run.sh
sleep 60
MACHINE_TYPE="c6id.metal" TEST_RUN=true TEST_MOUNT_DIR=$GITHUB_WORKSPACE ./run.sh
sleep 60

- name: Validate benchmark results and report generation
shell: bash
Expand Down Expand Up @@ -123,7 +132,7 @@ jobs:

- name: Install libraries
shell: bash
run: ./_utils/setup-small.sh
run: ./_setup_utils/setup_small.sh

- name: Generate 500mb datasets
shell: bash
Expand All @@ -144,16 +153,20 @@ jobs:
- name: Run mini GroupBy benchmark
shell: bash
run: |
python3 _utils/prep_solutions.py --task=groupby --solution=all
python3 _setup_utils/prep_solutions.py --task=groupby --solution=all
source path.env
TEST_RUN=true TEST_MOUNT_DIR=$GITHUB_WORKSPACE ./run.sh
MACHINE_TYPE="c6id.4xlarge" TEST_RUN=true TEST_MOUNT_DIR=$GITHUB_WORKSPACE ./run.sh
sleep 60
MACHINE_TYPE="c6id.metal" TEST_RUN=true TEST_MOUNT_DIR=$GITHUB_WORKSPACE ./run.sh

- name: Run mini Join benchmark
shell: bash
run: |
python3 _utils/prep_solutions.py --task=join --solution=all
python3 _setup_utils/prep_solutions.py --task=join --solution=all
source path.env
TEST_RUN=true TEST_MOUNT_DIR=$GITHUB_WORKSPACE ./run.sh
MACHINE_TYPE="c6id.4xlarge" TEST_RUN=true TEST_MOUNT_DIR=$GITHUB_WORKSPACE ./run.sh
sleep 60
MACHINE_TYPE="c6id.metal" TEST_RUN=true TEST_MOUNT_DIR=$GITHUB_WORKSPACE ./run.sh

- name: Validate benchmark results and report generation
shell: bash
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ metastore_db/*
*.csv
!time.csv
!logs.csv
!_control/data_small.csv
!_control/data_large.csv
*.md5
.Rproj.user
.Rhistory
Expand Down
41 changes: 21 additions & 20 deletions R-arrow/groupby-R-arrow.R

Large diffs are not rendered by default.

21 changes: 11 additions & 10 deletions R-arrow/join-R-arrow.R
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ cache = TRUE
on_disk = FALSE

data_name = Sys.getenv("SRC_DATANAME")
machine_type = Sys.getenv("MACHINE_TYPE")
src_jn_x = file.path("data", paste(data_name, "csv", sep="."))
y_data_name = join_to_tbls(data_name)
src_jn_y = setNames(file.path("data", paste(y_data_name, "csv", sep=".")), names(y_data_name))
Expand Down Expand Up @@ -46,15 +47,15 @@ t = system.time({
})[["elapsed"]]
m = memory_usage()
chkt = system.time(chk <- collect(summarise(ans, sum(v1, na.rm=TRUE), sum(v2, na.rm=TRUE))))[["elapsed"]]
write.log(run=1L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
write.log(run=1L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk, machine_type=machine_type)
rm(ans)
t = system.time({
ans<-collect(inner_join(x, small, by="id1"))
print(dim(ans))
})[["elapsed"]]
m = memory_usage()
chkt = system.time(chk <- collect(summarise(ans, sum(v1, na.rm=TRUE), sum(v2, na.rm=TRUE))))[["elapsed"]]
write.log(run=2L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
write.log(run=2L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk, machine_type=machine_type)
ans <- collect(ans)
print(head(ans, 3))
print(tail(ans, 3))
Expand All @@ -68,15 +69,15 @@ t = system.time({
})[["elapsed"]]
m = memory_usage()
chkt = system.time(chk <- collect(summarise(ans, sum(v1, na.rm=TRUE), sum(v2, na.rm=TRUE))))[["elapsed"]]
write.log(run=1L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
write.log(run=1L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk, machine_type=machine_type)
rm(ans)
t = system.time({
ans<-collect(inner_join(x, medium, by="id2"))
print(dim(ans))
})[["elapsed"]]
m = memory_usage()
chkt = system.time(chk <- collect(summarise(ans, sum(v1, na.rm=TRUE), sum(v2, na.rm=TRUE))))[["elapsed"]]
write.log(run=2L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
write.log(run=2L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk, machine_type=machine_type)
ans <- collect(ans)
print(head(ans, 3))
print(tail(ans, 3))
Expand All @@ -90,15 +91,15 @@ t = system.time({
})[["elapsed"]]
m = memory_usage()
chkt = system.time(chk <- collect(summarise(ans, sum(v1, na.rm=TRUE), sum(v2, na.rm=TRUE))))[["elapsed"]]
write.log(run=1L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
write.log(run=1L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk, machine_type=machine_type)
rm(ans)
t = system.time({
ans<-collect(left_join(x, medium, by="id2"))
print(dim(ans))
})[["elapsed"]]
m = memory_usage()
chkt = system.time(chk <- collect(summarise(ans, sum(v1, na.rm=TRUE), sum(v2, na.rm=TRUE))))[["elapsed"]]
write.log(run=2L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
write.log(run=2L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk, machine_type=machine_type)
ans <- collect(ans)
print(head(ans, 3))
print(tail(ans, 3))
Expand All @@ -112,15 +113,15 @@ t = system.time({
})[["elapsed"]]
m = memory_usage()
chkt = system.time(chk <- collect(summarise(ans, sum(v1, na.rm=TRUE), sum(v2, na.rm=TRUE))))[["elapsed"]]
write.log(run=1L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
write.log(run=1L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk, machine_type=machine_type)
rm(ans)
t = system.time({
ans <- collect(inner_join(x, medium, by="id5"))
print(dim(ans))
})[["elapsed"]]
m = memory_usage()
chkt = system.time(chk <- collect(summarise(ans, sum(v1, na.rm=TRUE), sum(v2, na.rm=TRUE))))[["elapsed"]]
write.log(run=2L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
write.log(run=2L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk, machine_type=machine_type)
ans <- collect(ans)
print(head(ans, 3))
print(tail(ans, 3))
Expand All @@ -134,15 +135,15 @@ t = system.time({
})[["elapsed"]]
m = memory_usage()
chkt = system.time(chk <- collect(summarise(ans, sum(v1, na.rm=TRUE), sum(v2, na.rm=TRUE))))[["elapsed"]]
write.log(run=1L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
write.log(run=1L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk, machine_type=machine_type)
rm(ans)
t = system.time({
ans<-collect(inner_join(x, big, by="id3"))
print(dim(ans))
})[["elapsed"]]
m = memory_usage()
chkt = system.time(chk <- collect(summarise(ans, sum(v1, na.rm=TRUE), sum(v2, na.rm=TRUE))))[["elapsed"]]
write.log(run=2L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
write.log(run=2L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk, machine_type=machine_type)
ans <- collect(ans)
print(head(ans, 3))
print(tail(ans, 3))
Expand Down
26 changes: 19 additions & 7 deletions _benchplot/benchplot-dict.R
Original file line number Diff line number Diff line change
Expand Up @@ -267,10 +267,12 @@ groupby.syntax.dict = {list(
)}
groupby.data.exceptions = {list( # exceptions as of run 1575727624
"collapse" = {list(
"Not Tested" = c("G1_1e9_1e2_0_0")
)},
"data.table" = {list(
"timeout" = c("G1_1e9_1e1_0_0", # not always happened, q8 probably #110
"G1_1e9_2e0_0_0") # q4 #110 also sometimes segfaults during fread but not easily reproducible
"G1_1e9_2e0_0_0"),
"Not Tested" = c("G1_1e9_1e2_0_0") # q4 #110 also sometimes segfaults during fread but not easily reproducible
)},
"dplyr" = {list(
"timeout" = c("G1_1e8_2e0_0_0"), # q10
Expand All @@ -285,7 +287,8 @@ groupby.data.exceptions = {list(
"csv reader NAs bug: datatable#2808" = c("G1_1e9_1e2_5_0")
)},
"spark" = {list(
"timeout" = "G1_1e9_1e2_5_0" ## seems that both runs have finished but second run timing was not logged to time.csv due to timeout
"timeout" = "G1_1e9_1e2_5_0", ## seems that both runs have finished but second run timing was not logged to time.csv due to timeout
"Not Tested" = c("G1_1e9_1e2_0_0")
)},
"dask" = {list(
"not yet implemented: dask#6986" = c("G1_1e7_1e2_5_0","G1_1e8_1e2_5_0","G1_1e9_1e2_5_0"), # #171
Expand All @@ -307,9 +310,11 @@ groupby.data.exceptions = {list(
"CSV import Segfault: JuliaLang#55765" = c("G1_1e7_1e2_0_0","G1_1e7_1e1_0_0","G1_1e7_2e0_0_0","G1_1e7_1e2_0_1","G1_1e7_1e2_5_0","G1_1e8_1e2_0_0","G1_1e8_1e1_0_0","G1_1e8_2e0_0_0","G1_1e8_1e2_0_1","G1_1e8_1e2_5_0","G1_1e9_1e2_0_0","G1_1e9_1e1_0_0","G1_1e9_2e0_0_0","G1_1e9_1e2_0_1","G1_1e9_1e2_5_0")
)},
"clickhouse" = {list(
"Out of Memory" = c("G1_1e9_1e2_0_0")
)},
"polars" = {list(
# "out of memory" = c("G1_1e9_1e2_0_0","G1_1e9_1e1_0_0","G1_1e9_2e0_0_0","G1_1e9_1e2_0_1","G1_1e9_1e2_5_0") # q10
"Not Tested" = c("G1_1e9_1e2_0_0")
)},
"R-arrow" = {list(
# "timeout" = c(), # q10
Expand All @@ -325,7 +330,9 @@ groupby.data.exceptions = {list(
# "out of memory" = c("G1_1e9_1e2_0_0","G1_1e9_1e1_0_0","G1_1e9_2e0_0_0","G1_1e9_1e2_0_1","G1_1e9_1e2_5_0"),
# "incorrect: duckdb#1737" = c("G1_1e7_1e2_5_0","G1_1e8_1e2_5_0")
)},
"datafusion" = {list()}
"datafusion" = {list(
"Not Tested" = c("G1_1e9_1e2_0_0")
)}
)}
groupby.exceptions = task.exceptions(groupby.query.exceptions, groupby.data.exceptions)

Expand Down Expand Up @@ -463,6 +470,7 @@ join.query.exceptions = {list(
)}
join.data.exceptions = {list( # exceptions as of run 1575727624
"collapse" = {list(
"Not tested" = c("J1_1e9_NA_0_0")
)},
"data.table" = {list(
"timeout" = c("J1_1e9_NA_0_0","J1_1e9_NA_5_0","J1_1e9_NA_0_1") # fread
Expand All @@ -478,7 +486,8 @@ join.data.exceptions = {list(
"out of memory" = c("J1_1e9_NA_0_0","J1_1e9_NA_0_1") # q5 out of memory due to a deep copy
)},
"spark" = {list(
"timeout" = c("J1_1e9_NA_0_0","J1_1e9_NA_5_0","J1_1e9_NA_0_1") # q5 using new 8h timeout #126
# "timeout" = c("J1_1e9_NA_0_0","J1_1e9_NA_5_0","J1_1e9_NA_0_1"), # q5 using new 8h timeout #126
"Not tested" = c("J1_1e9_NA_0_0")
)},
"dask" = {list(
"internal error: dask#7015" = c("J1_1e7_NA_0_0","J1_1e7_NA_5_0","J1_1e7_NA_0_1", # dask/dask#7015
Expand All @@ -494,6 +503,7 @@ join.data.exceptions = {list(
"CSV import Segfault: JuliaLang#55765" = c("J1_1e7_NA_0_0", "J1_1e7_NA_5_0", "J1_1e7_NA_0_1", "J1_1e8_NA_0_0", "J1_1e8_NA_5_0", "J1_1e8_NA_0_1", "J1_1e9_NA_0_0")
)},
"clickhouse" = {list(
"Out of Memory" = c("J1_1e9_NA_0_0")
)},
"polars" = {list(
"out of memory" = c("J1_1e9_NA_0_0","J1_1e9_NA_5_0","J1_1e9_NA_0_1")
Expand All @@ -504,15 +514,17 @@ join.data.exceptions = {list(
)},
"duckdb" = {list(
# "internal error: duckdb#1739" = c("J1_1e7_NA_0_0","J1_1e7_NA_5_0","J1_1e7_NA_0_1","J1_1e8_NA_0_0","J1_1e8_NA_5_0","J1_1e8_NA_0_1"),
"out of memory" = c("J1_1e9_NA_0_0","J1_1e9_NA_5_0","J1_1e9_NA_0_1")#,
# "out of memory" = c("J1_1e9_NA_0_0","J1_1e9_NA_5_0","J1_1e9_NA_0_1")#,
#"incorrect: duckdb#1737" = c("J1_1e7_NA_5_0","J1_1e8_NA_5_0")
)},
"duckdb-latest" = {list(
# "internal error: duckdb#1739" = c("J1_1e7_NA_0_0","J1_1e7_NA_5_0","J1_1e7_NA_0_1","J1_1e8_NA_0_0","J1_1e8_NA_5_0","J1_1e8_NA_0_1"),
"out of memory" = c("J1_1e9_NA_0_0","J1_1e9_NA_5_0","J1_1e9_NA_0_1")#,
# "out of memory" = c("J1_1e9_NA_0_0","J1_1e9_NA_5_0","J1_1e9_NA_0_1")#,
#"incorrect: duckdb#1737" = c("J1_1e7_NA_5_0","J1_1e8_NA_5_0")
)},
"datafusion" = {list()}
"datafusion" = {list(
"Not tested" = c("J1_1e9_NA_0_0")
)}
)}
join.exceptions = task.exceptions(join.query.exceptions, join.data.exceptions)

Expand Down
6 changes: 5 additions & 1 deletion _benchplot/benchplot.R
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
## Based on Matt Dowle scripts from 2014
## https://github.com/h2oai/db-benchmark/commit/fce1b8c9177afb49471fcf483a438f619f1a992b
## Original grouping benchmark can be found in: https://github.com/Rdatatable/data.table/wiki/Benchmarks-:-Grouping
suppressPackageStartupMessages(library(bit64))

format_comma = function(x) {
format(as.integer64(x), big.mark=",")
}

format_comma = function(x) format(as.integer(x), big.mark=",")
format_num = function(x, digits=3L) { # at least 3+1 chars on output, there is surely some setting to achieve that better with base R but it is not obvious to find that among all features there
cx = sprintf("%0.2f", x)
int = sapply(strsplit(cx, ".", fixed=TRUE), `[`, 1L)
Expand Down
7 changes: 7 additions & 0 deletions _control/data_large.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
task,data,nrow,k,na,sort,active
groupby,G1_1e9_1e2_0_0,1e9,1e2,0,0,1
groupby,G1_1e9_1e1_0_0,1e9,1e1,0,0,1
groupby,G1_1e9_2e0_0_0,1e9,2e0,0,0,1
groupby,G1_1e9_1e2_0_1,1e9,1e2,0,1,1
groupby,G1_1e9_1e2_5_0,1e9,1e2,5,0,1
join,J1_1e9_NA_0_0,1e9,NA,0,0,1
17 changes: 17 additions & 0 deletions _control/data_small.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
task,data,nrow,k,na,sort,active
groupby,G1_1e7_1e2_0_0,1e7,1e2,0,0,1
groupby,G1_1e7_1e1_0_0,1e7,1e1,0,0,1
groupby,G1_1e7_2e0_0_0,1e7,2e0,0,0,1
groupby,G1_1e7_1e2_0_1,1e7,1e2,0,1,1
groupby,G1_1e7_1e2_5_0,1e7,1e2,5,0,1
groupby,G1_1e8_1e2_0_0,1e8,1e2,0,0,1
groupby,G1_1e8_1e1_0_0,1e8,1e1,0,0,1
groupby,G1_1e8_2e0_0_0,1e8,2e0,0,0,1
groupby,G1_1e8_1e2_0_1,1e8,1e2,0,1,1
groupby,G1_1e8_1e2_5_0,1e8,1e2,5,0,1
join,J1_1e7_NA_0_0,1e7,NA,0,0,1
join,J1_1e7_NA_5_0,1e7,NA,5,0,1
join,J1_1e7_NA_0_1,1e7,NA,0,1,1
join,J1_1e8_NA_0_0,1e8,NA,0,0,1
join,J1_1e8_NA_5_0,1e8,NA,5,0,1
join,J1_1e8_NA_0_1,1e8,NA,0,1,1
Loading