Skip to content

kin3tik/GenCor

Repository files navigation

GenCor

##About

Using a text corpus' (Jane Austen novels) vocabulary and probability distribution model, will either generate or correct sentences.

###Generating Text Using a Markov Model

Generates sentences using the trigram model. In this model, each Xt is a discrete random variable denoting the t-th word in a sentence, taking values from the vocabulary.

###Sentence Correction Using Hidden Markov Models

Using first-order Markov Models and the Viterbi algorithm to correct noisy sentences by giving the most likely sequence computation.

##Usage

This software is a command line tool. The vocab and ngram files need to be located in the same directory as the *.jar.

###Sentence Generation

Use the -g flag.
java -jar gencor.jar -g

###Sentence Correction

Use the -c flag.
java -jar gencor.jar -c misspelled sentence to correct

###Example Output

Sentence generation:

  • <s> I am very saucy . </s>
  • <s> They could go directly to the comfort of the company of some great uncle . </s>
  • <s> She was quite a different sort of young lady I learnt , in short , and all in want of complexion . </s>
  • <s> She is loveliness itself . </s>

Sentence correction:

  • "she haf heard them"
    • <s> She had heard them
  • "She was ulreedy quit live"
    • <s> She was already quite like
  • "he said nit word by"
    • <s> He said it would be

##Format

The format for these files is as follows. vocab.txt contains the vocabulary, one word per line:

1 word1
2 word2
...
n wordn

There are two special words in the vocabulary. The words <s> and </s> denote the beginning and the end of a sentence, respectively.

The files *gram-counts.txt contain a 3-gram probability model, with one conditional probability table entry per line. For unigram-counts.txt, each line of the file contains:

i log10 P(xt = i)

For bigram-counts.txt, each line of the file contains:

i j log10 P(xt = j | xt-1 = i)

For trigram-counts.txt, each line of the file contains:

i j k log10 P(xt = k | xt-1 = j; xt-2 = i)

##To Do

  • Dramatically improve performance
  • Improve correction accuracy
  • Improve documentation

About

A sentence generator and corrector made using hidden markov models and ngrams.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages