GenCor

##About

Using a text corpus' (Jane Austen novels) vocabulary and probability distribution model, will either generate or correct sentences.

###Generating Text Using a Markov Model

Generates sentences using the trigram model. In this model, each Xt is a discrete random variable denoting the t-th word in a sentence, taking values from the vocabulary.

###Sentence Correction Using Hidden Markov Models

Using first-order Markov Models and the Viterbi algorithm to correct noisy sentences by giving the most likely sequence computation.

##Usage

This software is a command line tool. The vocab and ngram files need to be located in the same directory as the *.jar.

###Sentence Generation

Use the -g flag.
java -jar gencor.jar -g

###Sentence Correction

Use the -c flag.
java -jar gencor.jar -c misspelled sentence to correct

###Example Output

Sentence generation:

<s> I am very saucy . </s>
<s> They could go directly to the comfort of the company of some great uncle . </s>
<s> She was quite a different sort of young lady I learnt , in short , and all in want of complexion . </s>
<s> She is loveliness itself . </s>

Sentence correction:

"she haf heard them"
- <s> She had heard them
"She was ulreedy quit live"
- <s> She was already quite like
"he said nit word by"
- <s> He said it would be

##Format

The format for these files is as follows. vocab.txt contains the vocabulary, one word per line:

1 word1
2 word2
...
n wordn

There are two special words in the vocabulary. The words <s> and </s> denote the beginning and the end of a sentence, respectively.

The files *gram-counts.txt contain a 3-gram probability model, with one conditional probability table entry per line. For unigram-counts.txt, each line of the file contains:

i log10 P(xt = i)

For bigram-counts.txt, each line of the file contains:

i j log10 P(xt = j | xt-1 = i)

For trigram-counts.txt, each line of the file contains:

i j k log10 P(xt = k | xt-1 = j; xt-2 = i)

##To Do

Dramatically improve performance
Improve correction accuracy
Improve documentation

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
Driver.java		Driver.java
FileHandler.java		FileHandler.java
README.md		README.md
bigram_counts.txt		bigram_counts.txt
trigram_counts.txt		trigram_counts.txt
unigram_counts.txt		unigram_counts.txt
vocab.txt		vocab.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GenCor

About

Uh oh!

Releases

Packages

Languages

kin3tik/GenCor

Folders and files

Latest commit

History

Repository files navigation

GenCor

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages