genbindings: Streaming parse clang output to reduce memory, get correct filepaths by arnetheduck · Pull Request #226 · mappu/miqt

arnetheduck · 2025-05-25T08:24:21Z

To prepare genbindings for accepting arbitrary libraries from the command line, we must retain parse types not only from the header being processed but also its includes, to understand things like type hierarchies etc without depending on a particular module order.

This is the first baby step that addresses clang AST filtering.

The change has some collateral benefits: the improved accuracy of the AST parsing keeps more relevant information but at the same time reduces the memory footprint since filtering is done streaming instead of loading everything into memory first (ditto when writing the cache).

A consequence is that a lot more clang processes can run in parallel without OOMing.

parse "file" correctly to discover type provenance - using the standard go json parser for this does not work since the depend on serialization order which go discards (see
https://github.com/dtolnay/clang-ast?tab=readme-ov-file#source-locations)
stream clang output to JSON decoder and stream-write the cache to reduce memory footprint
with the newfound memory efficiency, bump up the number of parallel jobs and use threads for the parsing as well
fix missing classes such as QSysInfo using the corrected file source location field
don't try to re-run clang if it fails (OOM is unlikely anyway)

arnetheduck · 2025-05-25T08:25:58Z

draft because of the numbered renames - need some help with these in case we want to merge this

cmd/genbindings/clangast.go

To prepare genbindings for accepting arbitrary libraries from the command line, we must retain parse types not only from the header being processed but also its includes, to understand things like type hierarchies etc without depending on a particular module order. This is the first baby step that addresses clang AST filtering. The change has some collateral benefits: the improved accuracy of the AST parsing keeps more relevant information but at the same time reduces the memory footprint since filtering is done streaming instead of loading everything into memory first (ditto when writing the cache). A consequence is that a lot more clang processes can run in parallel without OOMing. * parse "file" correctly to discover type provenance - using the standard go json parser for this does not work since the depend on serialization order which go discards (see https://github.com/dtolnay/clang-ast?tab=readme-ov-file#source-locations) * stream clang output to JSON decoder and stream-write the cache to reduce memory footprint * with the newfound memory efficiency, bump up the number of parallel jobs and use threads for the parsing as well * fix missing classes such as QSysInfo using the corrected `file` source location field * don't try to re-run clang if it fails (OOM is unlikely anyway)

arnetheduck · 2025-05-28T07:12:29Z

qt6/qml/gen_qqmllist.h

 #endif

 QQmlListReference* QQmlListReference_new();
-QQmlListReference* QQmlListReference_new2(QVariant* variant);


the interesting thing about these removed overloads is that 3 and 4 are the same .. 🤷

arnetheduck · 2025-06-03T09:38:39Z

the qmllist changes look legit as far as I can tell, so undrafting

qt/gen_qstringlist.h

arnetheduck · 2025-07-07T06:15:50Z

ping @mappu anything more needed here?

mappu · 2025-07-13T05:26:16Z

I think this LGTM 🚢 The change makes sense to me, just need a spare evening to test out the performance delta on my PC.

I really wish there was a cleaner way of doing the JSON munging but at least it's pretty well isolated.

rcalixte · 2025-07-13T05:38:03Z

Do we want to assess the conflicts this will have with #118? The thought of rebasing somewhat terrifies me. I don't know that I'll be applying this change downstream either.

arnetheduck · 2025-07-13T15:51:09Z

Do we want to assess the conflicts this will have

Semantically, I believe there is no conflict - this PR basically enriches the incoming JSON from clang with correct provenance information - from that point of view, the two pr:s address different stages of the processing pipeline, this PR fixing the input to what #118 has to transform - in a way, this PR simply reduces the garbage coming into #118 and later stages and thus reduces the garbage coming out of them.

There might indeed be some minor code adjustments to make, though these should in principle be trivial since it's still the same data being processed (just that it's now correct:er)

rcalixte · 2025-07-14T05:25:12Z

Semantically, I believe there is no conflict

Handling clang2il.go and main.go seem to be the worst of it. It would help a little if you could revert the changes in config-libraries.go as they are just formatting fixes that conflict with the entire restructuring in the other PR. I also structured the commits in the other PR so that they are mostly single files (the exceptions being config-libraries.go & main.go since they are logically consistent and the rebuild of the library which is mostly trivial to rebase). If I'd gotten the other PR done in time, we could've avoided this. Apologies. 😔

arnetheduck · 2025-07-14T06:56:01Z

revert the changes in config-libraries.go

9d2f200

rcalixte · 2025-07-14T21:22:41Z

revert the changes in config-libraries.go

9d2f200

Thank you... and I hate to be a nag but this PR will likely not be squashed and merged which means rebasing still means processing each commit from the PR... meaning this adds to the rebase effort. Rather than nagging like this, would it be easier if we just got the other PR sorted first? Is there any rush on this being merged?

arnetheduck · 2025-07-15T05:59:03Z

I don't mind either way, from the test merge I did locally of the two, there were only a few small adjustments regardless of the order. Conflicts happen, it's a fact of life in async development ;)

I have a few other PR:s lined up that could also maybe benefit miqt that depend on this one, so it's more a matter that it's holding those up as well and since yours is still in draft, I didn't want to wait indefinitely before posting this one.

rcalixte · 2025-07-15T06:04:50Z

I have a few other PR:s lined up that could also maybe benefit miqt that depend on this one, so it's more a matter that it's holding those up as well and since yours is still in draft, I didn't want to wait indefinitely before posting this one.

That's on me. I've been too excited which led to me being sloppy and the draft is to prevent an accidental merge of something that I know needs changes. I need to rebase with the fixes for the iterator types and then figure out something cleaner for Qt 5 and then there's the quirk of the private signals disappearing. But I think it's important enough to land since it can unblock or trivialize supporting Qt 6.8 (and more?) as a base.

mappu · 2025-09-04T05:51:32Z

I ran some genbindings benchmarks on this branch:

v0.11.2:

empty cachedir: 38m12.864s
warm cachedir: 0m10.542s

9d2f200:

empty cachedir: 21m21.183s 🥳
warm cachedir: 0m18.376s Bit slower but probably this is just the gunzipping?

I did some small tests and a single -ast-dump=json takes an average of 1.50s on my PC - there is a non-JSON version that takes half the time, that might be worth investigating if it produces enough data.

I think i'd like to merge this PR for now. The correctness issue with 'file' gets us onto much better footing. We can help resolve conflicts with the other branch,

rcalixte · 2025-09-04T06:29:32Z

We can help resolve conflicts with the other branch,

I'll take a shot at rebasing either this weekend or the next. I've had to touch up the implementation anyways but it's been stress tested quite a bit by now as well. The quirks that I remember are finding a way to fence off Qt 5 from the changes and the private signals disappearing on Qt 6.

@mappu Would you rather see the PR for 6.8 resolved first or upstreaming support for FunctionDecl and FieldDecl? I still need to read through the reimplementation from this PR but the patches for those Decls from downstream should hopefully still carry over.

mappu · 2025-09-04T06:37:23Z

I think FunctionDecl/FieldDecl are the more important changes, as they're more user facing. But either is fine (whatever makes your life easier!)

arnetheduck · 2025-09-04T06:39:16Z

probably this is just the gunzipping

Nah, there's just more data to parse, in general, because each cached file now contains more data - basically, the full translation unit (including #include) instead of just the symbols from the one header file. unzip usually is balanced out by the smaller read on disk but if you unzip all of them, you'll see that they've grown quite big.

After this PR it's possible to further refactor the code to process each header file out-of-order, ie all the globals in the generator can be turned to locals.

In another language, one would probably use a "cache"/work area per thread to save on the analysis, not sure how that's done in golang since you don't control the threads - probably a "cache pool" of some sort where each "job" checks out a cache and hands it back when it's done.

With that out of the way, the header processing itself can be paralelised, but more importantly, you can run it on any one single header file and get a deterministic result.

arnetheduck marked this pull request as draft May 25, 2025 08:24

mappu reviewed May 26, 2025

View reviewed changes

cmd/genbindings/clangast.go Outdated Show resolved Hide resolved

mappu mentioned this pull request May 26, 2025

Add bindings for KDE Frameworks #216

Open

arnetheduck force-pushed the miqt-ast-full branch from d564893 to 6de4e76 Compare May 27, 2025 18:58

arnetheduck commented May 28, 2025

View reviewed changes

update comments

9af1bfc

arnetheduck marked this pull request as ready for review June 3, 2025 09:37

mappu reviewed Jun 11, 2025

View reviewed changes

qt/gen_qstringlist.h Outdated Show resolved Hide resolved

arnetheduck added 2 commits June 13, 2025 19:30

ignore stringlist

f85157b

Merge remote-tracking branch 'upstream/master' into miqt-ast-full

9237b27

restore config-libraries

9d2f200

mappu changed the title ~~Improve clang AST parsing~~ genbindings: Streaming parse clang output to reduce memory, get correct filepaths Sep 4, 2025

mappu merged commit 78d69e1 into mappu:master Sep 4, 2025
10 checks passed

Conversation

arnetheduck commented May 25, 2025

Uh oh!

arnetheduck commented May 25, 2025

Uh oh!

Uh oh!

arnetheduck May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arnetheduck commented Jun 3, 2025

Uh oh!

Uh oh!

arnetheduck commented Jul 7, 2025

Uh oh!

mappu commented Jul 13, 2025

Uh oh!

rcalixte commented Jul 13, 2025

Uh oh!

arnetheduck commented Jul 13, 2025

Uh oh!

rcalixte commented Jul 14, 2025

Uh oh!

arnetheduck commented Jul 14, 2025

Uh oh!

rcalixte commented Jul 14, 2025

Uh oh!

arnetheduck commented Jul 15, 2025

Uh oh!

rcalixte commented Jul 15, 2025

Uh oh!

mappu commented Sep 4, 2025

Uh oh!

Uh oh!

rcalixte commented Sep 4, 2025

Uh oh!

mappu commented Sep 4, 2025

Uh oh!

arnetheduck commented Sep 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

arnetheduck May 28, 2025 •

edited

Loading