This repository contains application for trainig a NLP model and use it for evaluating an document agians a predefined benchmarks.
This module use an input directory of zip file as a source for documents (pdf files) and train a GloVe model to creat a Room for a specific industry or sector.
It uses a pretrained 300 dimensional pretraimed model as initiation(glove_6B_300d.txt). the initial pretrained mode collectedform:
from https://nlp.stanford.edu/projects/glove
the module also use a benchmark list for keeping benchmarks in the dictionary:
file_bench = 'rooted_bench.csv'
The module has several initial configuarion factors that all are sets in begining of the module.
The trained model is saved periodically (after each 20 files processed) in:
trained_model='glove_dictionary_test.pkl'
Larger files are splitted for memory limtation.
If for any reason the process is stoped, the module must be rerun and next run will autoamtically process only input files that were not already processd during earlier trainning.
After completly read all input files a meeaage appears about completion of the process. the final traiend model will be :
glove_dictionary_test.pkl
This module used a trained model as room a benchmark list and a directory of documents. the input document will evaluated agians the room and the benchmark list and creat a output
file containg the level of the similarity of the input documents words to bencmarks.
The benchmarks were catagorized weighted by help of SMEs. The catagorizes are used for grouping required similarity measures. the output will saved as CSV file:
sim_to_bench_file='sim_to_bench.csv'
The required functions are imported through:
from Carlo_ngrams_tool.utilities_recommendation import
to import:
utilityies_recomndation.py
It should be properly located to be accesible by main modules. there are a library of text processing functions that is imported by utility_recomendation.py as well. it is:
chunking_bforce_plus_space_add.py
It should be also accesible properly. the library contains all text processing functions.
the directory structure is as folllow:
├── working_directory
│ ├── Recomendation_system__training_v26.py
│ ├── Recomendation_system_NLP_v26.py
│ ├── glove_dictionary_test.pkl
│ ├── rooted_benchmarks_11_07_2022.csv
│ ├── bench-to-crawled-duck_unique.csv
│ ├── Carlo_ngrams_tool
│ │ ├── utilityies_recomndation.py
│ │ ├── chunking_bforce_plus_space_add.py