Expand 4cat to other languages (Finnish) #470

anitabraida · 2024-12-12T07:46:42Z

These are the changes proposed in this pull request:

Wordlists added:

Finnish hatespeech wordlist (no offensiveness scores available)
Finnish wordlist with common vocabulary

Processors changed:

lexical_filter. Added Finnish hatespeech, introduced choice for the user to select the language. If the language selected is a highly inflected one (in this case, Finnish), it gives the user the option to choose between exact word match (present previously) and subword match (added in this pull request).
wikipedia_network. Added the option to find Wikipedia links in a Finnish language dataset.
neologisms. Added the Finnish wordlist with common vocabulary. The user has now the option to choose either English (default) or Finnish as the language. Pipeline changed accordingly.
tokenise. Added Finnish wordlist with common vocabulary for the filter. Added a lemmatization option for Finnish (uses SpaCy).
similar_words (presets). Included additional information.

In the setup, SpaCy is included again to allow for lemmatization on Finnish text.

…too)

…ring

…y instead.

…ors. Missing: filter

…ordlist

matnel · 2025-03-01T12:10:31Z

@ErikBorra @stijn-uva Would be lovely if you could check and merge this soonish! :) Please let me know if there are any issues with the code, happy to patch this PR.

dale-wahl · 2025-03-03T12:35:22Z

@sal-uva you removed spacy last year. I cannot recall or find what the specific issue was and only vaguely remember some dependency problems. Do you recall?

sal-uva · 2025-03-03T14:19:09Z

@sal-uva you removed spacy last year. I cannot recall or find what the specific issue was and only vaguely remember some dependency problems. Do you recall?

There were some dependency problems (iirc it didn't like our version of NumPy) and because of its resource-heavy nature was only suited for small-scale datasets for (among others) entity recognition, which can now be done much better with LLMs.

But if it's handy for Finnish and it doesn't cause any dependency errors for you, all good to add back in!

matnel · 2025-03-04T09:31:09Z

I think we did not encounter issues, but happy to test more if that helps - any hints on what to test out?

dale-wahl

I think the lexical_filter needs a change to be clearer to users, but otherwise this looks good.

dale-wahl · 2025-04-03T15:35:23Z

processors/filtering/lexical_filter.py

            "help": "Custom word list (separate with commas)"
        },
+        #the language determines whether the user can choose the "Match inflections and compounds" option under "match-type"
+        "language": {


I do not like the language option as it does not do anything at the moment yet implies it does to the user.

I see that the Finnish list is regex already compared to the English lists. I understand that is why you have added the match-subword option as the regex is designed to handle those particular inflections. The problem is that we allow users to select multiple lexicons as well as add an optional custom list. And right now, the options are all combining to run on all lexicons. This need a rework. I see now that the as_regex would run on the English lists as well though it should not.

Perhaps language could to allow two separate option lists to be shown (finn/eng). And it be made clearer what options are relevant to the custom word list vs the pre-made ones. Loading the lexicons could then be made clearer.

dale-wahl · 2025-04-03T15:52:21Z

processors/text-analysis/tokenise.py

 						   "'bicycles' becomes 'bicycle', 'better' becomes 'good'.",
 				"requires": "language==english"
 			},
+			#it seems as if requires only allows one option, so I added another option exclusive to finnish


Not exactly true. You could use "^=lemm_" and start the eng/finn keys with lemm_ for example. But this is a poorly documented feature.

anitabraida and others added 30 commits September 13, 2024 12:13

#5 added processors for wikipedia networks Finnish only

b3e43de

#7 added link to the finnish version of spacy

d73c13a

#7 added processor for linguistic extraction for finnish text

defbab0

#7 added processor get_entities for finnish text

d43f790

#7 added processor get_nouns for finnish text

0330b1c

#6 added neologisms in finnish processor (added wordlist for issue #2 …

762b1d2

…too)

#7 updated tokenisation processors to allow finnish wordlist in filte…

8afc6d7

…ring

Update README.md

34517f9

Update README.md

eef33bc

Update README.md

ac6c836

#8 language identification method added

1336fd9

Merge branch 'master' of github.com:uh-dcm/4cat_fi

7864294

#8 added option for user to modify the dataset language

0ed43e1

#10 changed compatibility for linguistic extractor

319ef30

#10 changed language compatibility

92e95ef

#10 changed compatibilites for presets

c7028f7

#10 changed compatibilites for network processors

ea6066c

#2 hatespeech wordlist regex

2325c44

#1 lexical filter for finnish and #10 changed compatibility

3dfede3

#11 reference added

56282cc

added "punkt_tab" to avoid errors

01c106a

#8 Errors with previous wrapper. Using fasttext (light) model directl…

480513a

…y instead.

#8 children datasets inherit language from parent dataset

f521499

#1 added lexical filter with lemmatization

2a22d5b

#10 changed compatibilities for hatebase

c7091b4

#6 added language

83b37a1

#7 added lemmatization for finnish text and fixed compatibilities

f7a8e8a

#8 changed to more reasonable number

9f149d2

#8

27e31fd

Merge remote-tracking branch 'upstream/master'

f4716b4

anitabraida and others added 19 commits October 18, 2024 15:29

SpaCy added again

801c112

Updated to requirements: removed language detection, combined process…

e691208

…ors. Missing: filter

#1 combined filter according to requirements and updated hatespeech w…

afe9804

…ordlist

Merge remote-tracking branch 'upstream/master'

c8c2e04

Fixed some compatibility issues

2b3f832

fixed descriptions

6d8267a

Merge remote-tracking branch 'upstream/master'

0ff8c95

fixed tokenization error

74edaba

Fixed tokenise issues, modified lexical_filter

dd1a9ea

Merge remote-tracking branch 'upstream/master'

0cb371c

Merge branch 'digitalmethodsinitiative:master' into master

eac9e38

Merge branch 'digitalmethodsinitiative:master' into master

e9e9313

Merge branch 'digitalmethodsinitiative:master' into master

d68f28b

Update README.md

6cff740

Merge branch 'digitalmethodsinitiative:master' into master

87af9d6

Update README.md

6bff0ed

Update README.md

33adcc6

Remove language compatibility from top_hatebase.py

73fced0

Merge branch 'digitalmethodsinitiative:master' into master

7889935

dale-wahl reviewed Apr 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Expand 4cat to other languages (Finnish) #470

Expand 4cat to other languages (Finnish) #470

Uh oh!

anitabraida commented Dec 12, 2024

Uh oh!

matnel commented Mar 1, 2025

Uh oh!

dale-wahl commented Mar 3, 2025

Uh oh!

sal-uva commented Mar 3, 2025

Uh oh!

matnel commented Mar 4, 2025

Uh oh!

dale-wahl left a comment

Uh oh!

dale-wahl Apr 3, 2025

Uh oh!

dale-wahl Apr 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Expand 4cat to other languages (Finnish) #470

Are you sure you want to change the base?

Expand 4cat to other languages (Finnish) #470

Uh oh!

Conversation

anitabraida commented Dec 12, 2024

Uh oh!

matnel commented Mar 1, 2025

Uh oh!

dale-wahl commented Mar 3, 2025

Uh oh!

sal-uva commented Mar 3, 2025

Uh oh!

matnel commented Mar 4, 2025

Uh oh!

dale-wahl left a comment

Choose a reason for hiding this comment

Uh oh!

dale-wahl Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

dale-wahl Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants