Skip to content

Conversation

@sundy1994
Copy link
Collaborator

@sundy1994 sundy1994 commented Jun 17, 2025

Added tests for all columns in feature dict, Closes #220, Closes #359

@sundy1994 sundy1994 requested a review from xehu June 17, 2025 23:24
Copy link
Collaborator

@xehu xehu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good! Just a few small comments/questions.

"""
emoji_pattern = r'[:;]-?\)+'
emojis = re.findall(emoji_pattern, text)
# emoji_pattern = r'[:;]-?\)+'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to remove the commented out pattern before putting this in

@xehu
Copy link
Collaborator

xehu commented Jun 25, 2025

Note: it looks like this branch crashes locally.

This can be reproduced by installing the testing_100 version of the code, navigating to the tests repo, and manually running run_tests.py. We get a failure in a mimicry feature.

Here's what the error looks like:

Successfully installed team_comm_tools-0.1.8
(tpm_virtualenv) xehu@WHA-ODD44VVQ-ML team_comm_tools % cd tests
(tpm_virtualenv) xehu@WHA-ODD44VVQ-ML tests % python3 run_tests.py
[nltk_data] Downloading package wordnet to /Users/xehu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Initializing Featurization...
Confirmed that data has conversation_id_col column: conversation_num!
Confirmed that data has speaker_id_col column: speaker_nickname!
Confirmed that data has message_col column: message!
Generating SBERT sentence vectors...
100%|█████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.21it/s]
Generating RoBERTa sentiments...
100%|█████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12.87it/s]
ERROR: The length of the sentiment data does not match the length of the chat data. Regenerating...
Generating RoBERTa sentiments...
100%|█████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 13.18it/s]
Chat Level Features ...
  0%|                                                                        | 0/16 [00:00<?, ?it/s]WARNING: Failed to generate lexicons due to unexpected error: object of type 'float' has no len()
 31%|███████████████████▋                                           | 5/16 [00:00<00:00, 215.07it/s]
Traceback (most recent call last):
  File "/Users/xehu/Desktop/Team Process Mapping/team_comm_tools/tests/run_tests.py", line 57, in <module>
    test_positivity.featurize()
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/feature_builder.py", line 479, in featurize
    self.chat_level_features()
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/feature_builder.py", line 632, in chat_level_features
    self.chat_data = chat_feature_builder.calculate_chat_level_features(self.feature_methods_chat)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/utils/calculate_chat_level_features.py", line 102, in calculate_chat_level_features
    method(self)
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/utils/calculate_chat_level_features.py", line 326, in calculate_word_mimicry
    self.chat_data["content_word_accommodation_per_conv"] = Content_mimicry_score_per_conv(
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/features/word_mimicry.py", line 165, in Content_mimicry_score_per_conv
    ContWordFreq = compute_frequency_per_conv(df_conv, column_count_frequency)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/tpm_virtualenv/lib/python3.11/site-packages/team_comm_tools/features/word_mimicry.py", line 106, in compute_frequency_per_conv
    return (dict(pd.Series(np.concatenate(df_temp[on_column])).value_counts()))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: need at least one array to concatenate

@xehu
Copy link
Collaborator

xehu commented Jun 25, 2025

It seems like the source of the crash is that we're getting a nan in our conversation ID:

Currently processing mimicry for conversation I
  conversation_num  speaker_nickname  ... function_word_accommodation content_word_accommodation
0                I               1.0  ...                           0                          0
1                I               2.0  ...                           0                          0
2                I               3.0  ...                           0                          0
3                I               1.0  ...                           0                          0
4                I               3.0  ...                           0                          0

[5 rows x 25 columns]
content_words
Currently processing mimicry for conversation J
  conversation_num  speaker_nickname  ... function_word_accommodation content_word_accommodation
0                J               1.0  ...                           0                          0
1                J               2.0  ...                           0                          0
2                J               3.0  ...                           0                          0
3                J               1.0  ...                           0                          0
4                J               3.0  ...                           0                          0

[5 rows x 25 columns]
content_words
Currently processing mimicry for conversation nan
Empty DataFrame

The original input dataframe only has two conversation ID's -- I and J. For some weird reason, we're getting three: I, J, and nan.

@sundy1994
Copy link
Collaborator Author

This issue was because I tried to optimize RAM usage, so instead of concatenating data frames in the for looop, I write all intermediate dfs to the output file in append mode. As a result, if the output file exists and we need to overwrite it, it won't be deleted but become longer.

The commit above resolves this. Now we append intermediate dfs to a list and concatenate only once at the end. It also saves RAM while making minimal changes.

batch_df = get_sentiment(batch)
batch_sentiments_df = pd.concat([batch_sentiments_df, batch_df], ignore_index=True)
batch_df = get_sentiment(batch, model_bert, device)
batch_df.to_csv(output_path, mode='a', header=first, index=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self - we're now appending here; Emily to run vector tests locally and confirm that everything passes

@xehu xehu merged commit 1df76ce into dev Jun 26, 2025
1 check passed
@xehu xehu deleted the testing_100 branch June 26, 2025 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make sentiment BERT feature run faster FB Fails if Indices are Missing/Out of Order

3 participants