This is NLP Domain Term Extraction project for Hindi Language written in Python. We need POS tagged corpus from 3 different domain, we have used online POS tagger for Hindi language.
Please keep below files (part of Zipped file) at some local path -
-
Term_Extraction_Project.py - This is module version of our project code.
-
corpusPath.txt - contains path to corpus of all 3 domains Content of this file "corpusPath.txt"- \Domains\Banking\Corpus \Domains\Jyotish\Corpus \Domains\Vamaniki\Corpus Kindly replace with appropraite path in corpusPath.txt before running the program.
-
hindi_stop_words.txt - contains hindi stop words, being used in our program
-
Domains.txt - contains Domain names in same order as of Domain corpus path stored in above "corpusPath.txt" file. Content of this file "Domains.txt"
-
Keep "Domains" folder (you will get it after unzipping the main file), so folder structure will be same of the path mentioned in "corpusPath.txt"
\Domains\Banking\Corpus \Domains\Jyotish\Corpus \Domains\Vamaniki\Corpus
were will be the path where you are keeping all the above listed files.
****Command to run the module - In command prompt, go to src where you have kept above files and run below command -
python Term_Extraction_Project.py ./corpusPath.txt ./hindi_stop_words.txt ./Domains.txt
Output Format - Output file will be generated at same Outfilename - Codeoutput.txt
Implementation Limitation -
- It needs POS tagged corpus for 3 domains as of now.
- Code can run for only 3 domains as of now, however it can be generalized later to run for as many as domains we will pass in program.
- Sequence of Domain name and Domain Corpus path will be same in "Domains.txt" and ""corpusPath.txt"" respectively.
- This implementation requires POS tagged corpus.
Data Collection:
We collected data from hindi wikipedia and used an online POS tagger. Tool used for this was Selenium.