GitHub - haibonlp/basilisk: BASILISK (Bootstrapping Approach to SemantIc Lexicon Induction using Semantic Knowledge)

haibonlp / basilisk Public

Notifications You must be signed in to change notification settings
Fork 4
Star 8

BASILISK (Bootstrapping Approach to SemantIc Lexicon Induction using Semantic Knowledge)

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
sample-data		sample-data
BASILISK2.0.jar		BASILISK2.0.jar
LICENSE.txt		LICENSE.txt
README		README
extractionFormatConvertor.py		extractionFormatConvertor.py
stopwords.dat		stopwords.dat

Repository files navigation

How to Use Basilisk:
====================

To use Basilisk to extract semantic lexicons you need first manually generate seeds for each semantic class, then prepare pattern extractions, and run Basilisk at the last.

(1) Select Seeds:

It's better to select at least 10 seeds for each semantic class. Simply, the seeds are generated by sorting words in the whole corpus by frequency, and manually identify the 10 most frequent nouns that belong to each semantic category.

Seeds belong to same semantic class are stored in one separate file (each seed word per line). And all seed files are put in one same directory.

(2) Prepare Pattern Extractions:
In our setting any pattern extractions could be used in Basilisk. But each line of the extraction file should be like this :
'extractedNoun * extractionPattern '.

Here we give a example about how to generate pattern extractions using Stanford Dependency Parser.
Raw data file: test.txt
Use Stanford Dependency Parser to generate dependency file : test.parse
Then use extractionFormatConvertor.py to convert to extraction file that could be used in Basilisk.

(3) Run Basilisk

Command line:

java -jar BASILISK.jar seed_slists extractions_file stopwords_file [(options) (flags)]

seed_slists:
An 'slist' file must have a directory path on the first
line, followed by the names of individual files found
in that directory. The 'seed_slists' file should list
the files containing the seed words for each semantic
category to be learned. When running in single category
mode, the slist file should only list a single seed
file. When running in multiple category mode, the slist
file should list two or more seed files.

Options:
-n [num_iterations] Number of iterations to run basilisk for.
Default value is 5.

-c [0 or 1] 0: simple conflict resolution; 1: improved conflict resolution
Default is improved conflict resolution

-o [directory] Output directory.
Default is the root directory.

-s Runs Basilisk in "Snowball" mode. That is, Basilisk will only select
the top scorer from each category.
Default is to not use this feature.

-t Also writes a trace file outlining the words and their scoring during each
iteration of basilisk.