The project is an implementation of the paper [1] on 'Minimally Supervised Induction of Grammatical Gender' for the evaluation of English-Hindi pair.
The objective was to determine the grammatical gender of Hindi nouns, having known a few true-gender instances. We use morphological features, and bootstrapping algorithms over the true-gender instances.
- IIT-B parallel-corpus [2], available for download here.
- UDPipe [3] for analysis and processing of the parallel-corpus mentioned above.
- Google's pygtrie Python library [4].
- UD 2.x Treebanks for Hindi Language [5].
The pipeline of the entire project, as reproducible through the makefile with certain changes, is explained as follows. The section headings give an overview of makefile targets that achieve the mentioned process.
-
After the parallel corpus is downloaded and then extracted with their names intact, they are processed with UDPipe, using version 2.0 models. While using UDPipe, be careful to use the
--immediateand--tokenizer=presegmentedas arguments, so as to process the files on Hindi side keeping the one-line concurrency between the two files. -
Having stored the processed files in
hi.conllufile, we extract all the nouns from each of them using the extract_dict.py's-pos NOUN --countswitch, which also giveshi.most_commonfile along with not-neededhi.NOUNfile.
-
We now get a smaller corpus. In this case, using the UD Treebanks [5], and extract all nouns from the pre-processed conllu files again using extract_dict.py's
-pos NOUN --countswitch, their usages in.outputfiles and again get rid of*.NOUNfiles obtained. -
We combine different
.most_commonfiles using filter_data.py's--combineswitch, making sure the final filesample.most_commonhas the counters combined from different.most_commonfiles. -
Next, we generate the values as in Table 1 of the paper for the true gender nouns in Hindi language, using process_sample.py's
translationandtranslation_checkarguments, thus generating the resulting file called default_seeds_table.tsv. The format of this file is explained here.
Having generated all the needed data, we start with contextual bootstrapping.
-
First, we generate a file containing the list of seeds, from true_gender_eng_to_hin.tsv file, using filter_data.py's
-dsswitch, gettingseeds.LISTfile. -
We generate a list of words that are to be removed. This list (remove_list) is manually compiled and removes the following cases:
- words which are in non-devanagri script
- words which can be used for both male/female translations.
- numbers
- words which are transcribed version of English words, and don't undergo the morphology of the hindi language.
-
Next, we remove all the tokens from
sample.most_commonfile which also appear in remove_list file orseeds.LISTfile using filter_data.py's-fl -lswitches. The resulting file is calledcontext_bootstrap. -
For context bootstraping, we get the usage information from the different
.outputfiles generated from the UD treebanks, and consider only the tokens with count >= 4. -
Once a token shows high enough confidence value with enough information about the relative distribution of the gender distribution, we update the
seeds.LISTfile with the token and the discovered gender. -
Next, we filter
context_bootstrapfile again with the updatedseeds.LISTfile as in step 3 above. We call this resulting filemorphoand serves as input for the next step of the pipeline.
-
We use trie model using the Google's pygtrie python library [4] and process the tokens in
morphfile to get the values with confirmed values inmorpho.outputusing filter_data.py's--morphology --conlluswitches. The tokens that could not be disambiguated based on morphology, are stored inmorpho.not_analysedfile. The--conlluswitch is used to input certain values with special parameters in the trie. More about this can be read in-file documentation of filter_data.py file. -
The values in
morpho.outputare concatenated toseeds.LISTfile again, to update the latter. -
We next go back to
hi.outputfile we had from the first part of the pipeline, and filter the tokens appearing inseeds.LIST,morpho_not_analysedor remove_list file, calling the resulting filehi.morph. -
We again process the resulting file without using context bootstraping, and directly use it for morphological analysis, using filter_data.py's
--morphologyswitch, and WITHOUT using--conlluswitch, treating all the values equally. -
The confirmed values are again stored with their output gender tags in
hi.morph.outputfile, while the ones not-analyzed are stored inhi.morph.not_analysedfile. -
Finally, the true_gender values in
hi.morphfile are found using--get_accuracy --conlluswitch and the accuracy is computed against the predictions found inhi.morph.outputfile with the bootstrapped values. The results are presented here.
This section includes the files contained, with the switches, if available with their intended purpose.
- extract_dict.py
This file is used to mainly extract all tokens with a POS tag, and display them in a file, with the number of occurences intoken : countformat. The file uses following arguments:
-i , --input : Required argument for the input file (Expected format- Files with data in 10-column format of CONLL-U files).
-pos, --pos_tag : Optional argument to extract the tokens with the argument value as the POS tag.
--count : Optional but recommended argument, outputs the counters in a file. Without this, only the list of the tokens is generated. - filter_data.py
This is the main file responsible for context bootstraping as well as morphological analysis of both, the smaller and the main corpus. The file uses following arguments, all optional. Running the file without any of the arguments does nothing, while completing execution.
-i , --input : Input file.
--conllu : Files with data in 10-column format of CONLL-U files. Multiple values readable.
-fl, --filter_by_list :
-l , --list_file : The list file needed as a compulsory argument for -fl above.
--combine : Combines files with format as "token : count" values per file, to result in a single file with added counters from different files. Multiple values readable.
-ds, --get_default_seeds : Gets a list of seeds in a seperate file from the input file, for the --bootstrap and --morphology switches to work. Requires --input.
--bootstrap : Runs the contextual bootstraping algorithm on input file, calculating from the provided CONLL-U formatted files. Requires --input and --conllu.
--morphology : Runs trie-based morphological analysis of the tokens in the input file. Requires --input. Different functionalities depending on if --conllu is or is not provided.
--get_accuracy : Prints the accuracy of the finished pipeline, using input file as predicted values and provided CONLL-U format files as truth values generated from bootstrapping. Requires --input and --conllu.-
makefile
This file contains the pipeline as described above arranged in neat commands for LINUX based systems. Before using the makefile, please make sure to edit the variableUDPIPE_Modelsto point to the UDpipe models directory. -
process_sample.py
This file is used to generate a table as in Table 1 in the source paper. The final output result is stored in here. The file considers the first argument as the switch. The following are the possible switches with their function:
translation: reads input from true_gender file, and checks the gender against trailing CONLLU files for their gender.
translation_check: compares two files, to give the output based on how correct are the values in generated tsv to truth values.-
remove_list
This file contains the list of tokens that were removed manually, while processing the sample_data, as was explained in thebootstrapsegment of the pipeline. -
true_gender_eng_to_hin.tsv
This file contains the natural gender values with their english translations, and the true gender, marked asFfor female,Mfor male. The columns in the file are eng_token, hin_token, true_gender_value, arranged in tabular seperated format. More details about this file are at the end of the section. -
true_gender_hin.list
This file contains the values of the second column in the true_gender_eng_to_hin.tsv file, to get the values in a nice list format in a non-alphabetical order.
Some notes for true_gender_eng_to_hin.tsv file:
- Generated from https://www.collinsdictionary.com/us/dictionary/english-hindi
- In case the word was found to be not common, not included.
- If a translation was not present, added after checking other resources.
- Overlaps between the translations, as clear cut not defined.
- A lot of translations also serve as proper nouns, so might have been removed then, or errorenously tagged that way.
- Instead of using
gentlemanandladyas in paper,sirandmadamwas used instead.
A table of statistics similar to Table 1 in the source paper can be found in this file. The following is the structure of the file:
Token_eng Token_hin Frequency Male Female Quest KeyToken_eng: An English translation of the natural-gender word.
Token_hin: The token with the english translation in previous field.
Frequency: Total number of times the token appeared.
Male: Number of times the token was used in Masculine context.
Female: Number of times the token was used in Feminine context.
Quest: Number of times the token was used without specific gender based context.
Key: Possible values are as follows:
+ One or more translations, with all gender correct
- One or more translations, with all gender incorrect
? One or more translations, some correct, some incorrect
+~ One or more translations, most correct, some unresolved
-~ One or more translations, most incorrect, some unresolvedExtracted Tokens: 8829
Tokens removed: 931
Effective number of tokens: 7954
Natural-gender seeds: 106
Used seeds: 59
Coverage: Used Seeds * 100 / Effective number of tokens = 5900/7954 = 0.7417 %
Tokens for context bootstraping (T): 7895
Tokens succesfully bootstraped (B): 2887
Coverage: B*100 / T = 36.5674 %
Tokens before morphoanalysis (TM): 5008
Tokens successfully analyzed (BM): 4793
Success Rate: BM*100 / TM = 95.7068 %
Extracted Tokens: 159 447
Removed Tokens: 7 668
Tokens to process: 151 779
Tokens processed: 14 472
Total true-gender words (TT): 12 904
Found true-gender words (FF): 9 541
Recall (computed): FF * 100 / TT = 73.93831 %
[1] Cucerzan, S., & Yarowsky, D. (2003). Minimally supervised induction of grammatical gender. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1 (pp. 40–47). Association for Computational Linguistics.
[2] Anoop Kunchukuttan, Pratik Mehta, Pushpak Bhattacharyya. The IIT Bombay English-Hindi Parallel Corpus. 2017
[3] UDPipe project, maintained by UFAL, Charles University, Prague.
[4] Google's pygtrie library, implementing trie-structures. Available on GitHub here.
[5] Treebanks for UDPipe Version 2.x available here. Date of last access: August 25, 2018.