-
Notifications
You must be signed in to change notification settings - Fork 334
[ENH] Add TextFeatures transformer for text feature extraction #880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ankitlade12
commented
Jan 8, 2026
- Add TextFeatures class to extract features from text columns
- Support for features: char_count, word_count, digit_count, uppercase_count, etc.
- Add comprehensive tests with pytest parametrize
- Add user guide documentation
solegalli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ankitlade12
Thanks a lot!
This transformer, function-wise, I'd say it's ready. I made a few suggestions regarding how to optimize the feature creation functions. Let me know if they make sense.
Other than that, we need the various docs file and we'll be good to go :)
Thanks again!
|
We need to rebase main so the 2 remaining tests pass. |
solegalli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ankitlade12
I am very sorry for the delayed review. I am travelling till end of April, so I am a bit slower than usual.
I think, for the first version of the transformer, let's enforce the user to pass the names of the text variables. They can pass one or more variables in case there are more than one text column.
Other than that, we need to add the tranformer in the docs/index file, in the readme, and in the docs/api, and adjust the tests and the demo to the newer functionality. Then it is good to merge.
Thank you very much for this great addition.
ba25768 to
ba69e23
Compare
|
Hey @solegalli, I tracked down the cause of the CI failures. They are caused by Pandas 2.2/3.0 breaking changes in the CI environment (specifically datetime string formatting and select_dtypes behavior). Because these are breaking the entire library (280 failures), they need to be fixed in the main branch first. My |
ba69e23 to
41f9528
Compare
|
Hi @ankitlade12 Something went wrong here. The arcsintransformer files, for some reason are in this PR. We need to remove all commits starting from Could you please take a look? Thanks a lot! |
|
I made the tests passing optional, because there is a lot to fix on the pandas 3 side. If you want to go ahead and remove unrelated files, we can merge this transformer while we work on the maintenance fixes. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #880 +/- ##
==========================================
- Coverage 98.23% 98.10% -0.14%
==========================================
Files 114 116 +2
Lines 4886 4963 +77
Branches 772 788 +16
==========================================
+ Hits 4800 4869 +69
- Misses 55 61 +6
- Partials 31 33 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
solegalli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ankitlade12
This PR has a bunch of unrelated files. The files corresponding to the text transformer are gone. I suggest resetting hard to the last commit where the files corresponding to the transformer are present, and addressing the comments again from there.
- Add TextFeatures class to extract 19 text features from string columns - Features include char_count, word_count, sentence_count, digit_count, etc. - Add comprehensive test suite (16 tests)
a7aef4d to
88d15bc
Compare
|
Hi @solegalli, Thanks for the feedback! I've cleaned up the PR: Removed unrelated files:
Restored TextFeatures transformer:
The branch now only contains the TextFeatures implementation with 16 passing tests. I rebased on latest main to resolve the merge conflict. Ready for re-review! |
Use pd.api.types.is_string_dtype() and is_object_dtype() instead of checking dtype == 'object' directly, which fails with StringDtype.
|
Hey @ankitlade12 Thank you so much for the quick turnaround. Unfortunately, now all the user guide and api files are lost. We need to put those back in :). Together with the readme, doc/index, etc. |