Skip to content

Investigate Memory Usage of Scoring #145

@singjc

Description

@singjc

Proteomics Dataset

16 runs, ~32K precursors (target + decoy), ~196K transitions, ~1.4M precursor features (peak-groups)

Command
/usr/bin/time pyprophet score --in merged_osw.parquet --level ms1ms2 --classifier SVM --xeval_num_iter 3 --ss_num_iter 3 --threads 3 --profile

Peak RAM usage is ~17.34 GB

1902.14user 1397.48system 23:12.49elapsed 236%CPU (0avgtext+0avgdata 18182704maxresident)k
320392inputs+1639776outputs (285major+10407902minor)pagefaults 0swaps

Note: The total memory allocated reported by memray is virtual memory allocated (i.e. by pandas, numpy, duckdb), not the actual materialized physical memory used.

$ memray stats memray_pyp_score.bin
📏 Total allocations:
	4923580

📦 Total memory allocated:
	13.453GB

📊 Histogram of allocation size:
	min: 1.000B
	----------------------------------------------
	< 7.000B   :   79403 ▇
	< 49.000B  : 2911664 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
	< 345.000B : 1472040 ▇▇▇▇▇▇▇▇▇▇▇▇▇
	< 2.370KB  :  391343 ▇▇▇▇
	< 16.643KB :   55240 ▇
	< 116.825KB:    6330 ▇
	< 820.058KB:    6666 ▇
	< 5.621MB  :     443 ▇
	< 39.460MB :     396 ▇
	<=276.990MB:      55 ▇
	----------------------------------------------
	max: 276.990MB

📂 Allocator type distribution:
	 MALLOC: 4916019
	 MMAP: 6732
	 REALLOC: 702
	 CALLOC: 127

🥇 Top 15 largest allocating locations (by size):
	- <stack trace unavailable> -> 6.132GB     <- This is mostly duckdb
	- __array__:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/series.py:1031 -> 1.719GB
	- copy:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/internals/blocks.py:796 -> 1.017GB
	- _fetch_ms2_features:/home/singjc/Documents/github/pyprophet/pyprophet/io/scoring/parquet.py:130 -> 978.292MB
	- _take_nd_ndarray:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/array_algos/take.py:157 -> 790.236MB
	- _merge_ms1ms2_features:/home/singjc/Documents/github/pyprophet/pyprophet/io/scoring/parquet.py:218 -> 401.331MB
	- _merge_blocks:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/internals/managers.py:2301 -> 331.308MB
	- vstack:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/numpy/_core/shape_base.py:287 -> 331.302MB
	- _stack_arrays:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/internals/managers.py:2252 -> 316.366MB
	- maybe_convert_platform:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:138 -> 222.684MB
	- collect:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/polars/lazyframe/frame.py:2207 -> 135.000MB
	- get_join_indexers_non_unique:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/reshape/merge.py:1795 -> 130.348MB
	- maybe_infer_to_datetimelike:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/dtypes/cast.py:1189 -> 111.343MB
	- _isna_array:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/dtypes/missing.py:300 -> 107.266MB
	- <listcomp>:/home/singjc/Documents/github/pyprophet/pyprophet/scoring/data_handling.py:239 -> 103.221MB

🥇 Top 15 largest allocating locations (by number of allocations):
	- <stack trace unavailable> -> 3708032
	- _fetch_ms2_features:/home/singjc/Documents/github/pyprophet/pyprophet/io/scoring/parquet.py:130 -> 897673
	- __init__:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pyarrow/parquet/core.py:317 -> 97523
	- _merge_ms1ms2_features:/home/singjc/Documents/github/pyprophet/pyprophet/io/scoring/parquet.py:218 -> 89539
	- read:/home/singjc/Documents/github/pyprophet/pyprophet/io/scoring/parquet.py:46 -> 64332
	- _init_duckdb_views:/home/singjc/Documents/github/pyprophet/pyprophet/io/_base.py:982 -> 46096
	- open_binary:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/psutil/_common.py:711 -> 7398
	- __array__:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/series.py:1031 -> 5711
	- read_schema:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pyarrow/parquet/core.py:2348 -> 2208
	- _build_nested_paths:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pyarrow/parquet/core.py:337 -> 1797
	- _to_pandas_without_object_columns:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/polars/dataframe/frame.py:2483 -> 613
	- table_to_dataframe:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pyarrow/pandas_compat.py:808 -> 285
	- _subst_vars:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/sysconfig.py:156 -> 180
	- _extend_dict:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/sysconfig.py:168 -> 168
	- _to_pandas_without_object_columns:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/polars/dataframe/frame.py:2484 -> 145

Phosphoproteomics Dataset

20 runs, ~45K precursors (target + decoys), ~5.7M transitions, ~1.8M precursor features (peak-groups)

Command
/usr/bin/time pyprophet score --in merged.oswpq --level ms1ms2 --ss_num_iter 3 --xeval_num_iter 3 --profile

Peak RAM usage is ~9.67 GB

1271.60user 615.97system 18:56.20elapsed 166%CPU (0avgtext+0avgdata 10141896maxresident)k
168inputs+1204336outputs (95major+6227858minor)pagefaults 0swaps

Note: The total memory allocated reported by memray is virtual memory allocated (i.e. by pandas, numpy, duckdb), not the actual materialized physical memory used.

$ memray stats memray_score.bin

📏 Total allocations:
	8573096

📦 Total memory allocated:
	179.028GB

📊 Histogram of allocation size:
	min: 1.000B
	----------------------------------------------
	< 7.000B   :  195538 ▇
	< 60.000B  : 5567520 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
	< 473.000B : 1390693 ▇▇▇▇▇▇▇
	< 3.604KB  :  627854 ▇▇▇
	< 28.088KB :  384064 ▇▇
	< 218.924KB:  303930 ▇▇
	< 1.666MB  :   79467 ▇
	< 12.988MB :   23061 ▇
	< 101.226MB:     882 ▇
	<=788.964MB:      87 ▇
	----------------------------------------------
	max: 788.964MB

📂 Allocator type distribution:
	 MALLOC: 8375834
	 REALLOC: 136973
	 CALLOC: 42899
	 MMAP: 17390

🥇 Top 15 largest allocating locations (by size):
	- _take_nd_ndarray:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/array_algos/take.py:157 -> 88.081GB
	- <stack trace unavailable> -> 23.864GB
	- <listcomp>:/home/singjc/Documents/github/pyprophet/pyprophet/report.py:173 -> 13.667GB
	- copy:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/internals/blocks.py:796 -> 8.521GB
	- __array__:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/series.py:1031 -> 4.738GB
	- unique_with_mask:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/algorithms.py:438 -> 4.728GB
	- plot_identification_consistency:/home/singjc/Documents/github/pyprophet/pyprophet/report.py:176 -> 4.695GB
	- unique_with_mask:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/algorithms.py:440 -> 3.522GB
	- vstack:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/numpy/_core/shape_base.py:287 -> 3.334GB
	- _merge_blocks:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/internals/managers.py:2301 -> 2.884GB
	- _evaluate_standard:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/computation/expressions.py:73 -> 2.780GB
	- _fetch_ms2_features:/home/singjc/Documents/github/pyprophet/pyprophet/io/scoring/split_parquet.py:124 -> 1.554GB
	- _stack_arrays:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/internals/managers.py:2252 -> 1.499GB
	- take:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/algorithms.py:1239 -> 1.236GB
	- _getitem_bool_array:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/frame.py:4154 -> 1.234GB

🥇 Top 15 largest allocating locations (by number of allocations):
	- <stack trace unavailable> -> 5635464
	- _fetch_ms2_features:/home/singjc/Documents/github/pyprophet/pyprophet/io/scoring/split_parquet.py:124 -> 948497
	- _init_duckdb_views:/home/singjc/Documents/github/pyprophet/pyprophet/io/_base.py:1246 -> 216931
	- _take_nd_ndarray:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/array_algos/take.py:157 -> 147627
	- __init__:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pyarrow/parquet/core.py:317 -> 119820
	- unique_with_mask:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/algorithms.py:440 -> 107224
	- <listcomp>:/home/singjc/Documents/github/pyprophet/pyprophet/report.py:173 -> 105035
	- _any:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/numpy/_core/_methods.py:64 -> 86112
	- unique_with_mask:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/algorithms.py:438 -> 84076
	- plot_identification_consistency:/home/singjc/Documents/github/pyprophet/pyprophet/report.py:176 -> 79352
	- read:/home/singjc/Documents/github/pyprophet/pyprophet/io/scoring/split_parquet.py:49 -> 64332
	- _write_parquet_with_scores:/home/singjc/Documents/github/pyprophet/pyprophet/io/scoring/split_parquet.py:345 -> 64260
	- maybe_convert_indices:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/pandas/core/indexers/utils.py:280 -> 63383
	- transform_affine:/home/singjc/anaconda3/envs/pyprophet/lib/python3.9/site-packages/matplotlib/transforms.py:1865 -> 55828
	- _write_parquet_with_scores:/home/singjc/Documents/github/pyprophet/pyprophet/io/scoring/split_parquet.py:351 -> 51864

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions