Skip to content

Conversation

@jayckaiser
Copy link
Collaborator

@jayckaiser jayckaiser commented Sep 27, 2024

This is a research branch trying to resolve the lack of PyArrow strings being used in Python 3.12. There are a few main changes:

  • Force datatype in read_csv() from str (i.e., Python strings) to "string" (i.e., default string type).
  • Remove forced string-deconversion from FileSource.execute().

That second change causes breakages when using nested JSON data (since we are no longer using generic object datatypes. The following fixes are required:

  • Overload fromjson() Jinja macro to use ast.literal_eval() when JSON has single-quotes.
  • Require fromjson() be applied to Jinja templating in YAML when retrieving nested fields.

Some observations:

  • The differences in runtime between Python 3.8 and 3.12 is drastic. Python 3.8 runs in one third the time with 2/3rds the memory during earthmover -t.
  • We can remove the mandatory calls to the pyarrow backend, since this is turned on by default when available and raises an error in 3.8 otherwise.
  • I have yet to run this on a larger dataset where these performance impacts would be more noteworthy. Please try running such and let me know if we're only seeing poorer performance on smaller datasets (given the overhead when serializing).

Please let me know your thoughts.

@tomreitz
Copy link
Collaborator

Thanks for this, @jayckaiser - exciting work. I'll dig into it, test, etc. probably the week after next.

In the meantime, I want to ask/clarify two things:

The differences in runtime between Python 3.8 and 3.12 is drastic. Python 3.8 runs in one third the time with 2/3rds the memory during earthmover -t.

I'm confused about which is faster... are you saying that with your PyArrow changes/optimizaitions, the runtime is slower on 3.12? (this seems counterintuitive to me) Or is it the other way around, it's faster on 3.12?

That second change causes breakages when using nested JSON data (since we are no longer using generic object datatypes. [and the two bullet points below this]

Suppose earthmover reads in a JSONL file containing a line/row/payload like (un-linearized)

{
  "field": {
    "some": {
      "deeply": {
        "nested": {
          "property": "value"
        }
      }
    }
  }
}

Are you saying that

  • previously field would be passed through dataframes as an object column, and hence could be referenced in a .jsont template with {{ field.some.deeply.nested.property }}
  • with these changes, the field would be passed through dataframes as a (PyArrow) string column, and thus a .jsont template would have to do {% set field_object = fromjson(field) %}{{field_object.some.deeply.nested.property}} (or similar)

(I'd really like to avoid changes to earthmover that would require changes to projects' earthmover.yml and/or *.jsont, so I'm hoping I'm misunderstanding here.)

Base automatically changed from feature/python-312 to main October 16, 2024 21:48
@tomreitz
Copy link
Collaborator

@jayckaiser I ran example_projects/01_simple/big_earthmover.yaml; with Python 3.10:

$ python3 -V
Python 3.10.12
$ /usr/bin/time -v earthmover run -c big_earthmover.yaml 
2024-10-18 10:38:33.953 earthmover INFO starting...
2024-10-18 10:38:34.024 earthmover INFO skipping hashing and run-logging (no `state_file` defined in config)
2024-10-18 11:59:51.878 earthmover INFO done!
        User time (seconds): 3684.89
        System time (seconds): 81.00
        Percent of CPU this job got: 77%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:21:19
        Maximum resident set size (kbytes): 1347476
        ...

(so 1hr 21min, max 1.3GB memory used - this was with a 3.2GB input TSV file, producing a 28GB JSONL file)

With Python 3.12:

$ python3 -V
Python 3.12.5
$ /usr/bin/time -v earthmover run -c big_earthmover.yaml 
2024-10-18 12:22:34.749 earthmover INFO starting...
2024-10-18 12:22:34.813 earthmover INFO skipping hashing and run-logging (no `state_file` defined in config)
2024-10-18 14:16:41.572 earthmover INFO done!
        User time (seconds): 5460.58
        System time (seconds): 127.11
        Percent of CPU this job got: 81%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:54:09
        Maximum resident set size (kbytes): 1722360
        ...

so 1hr 54min (38% longer), max 1.7GB memory used (24% more). This confirms your result on a large dataset: slower and less memory efficient under Python 3.12 with Pyarrow strings 😢.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants