Skip to content

Muhomorik/DSS_Capstone

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Readme

About

🚫 This document is automatically generated from README.Rmd 🚫

Diagrams are created using Pencil (.ep file).

Note to myself: non-reactive version seems to be faster.

Foreword

This is a final project for Data Science Specialization at Coursera.

This is a longer version of About page on the Shiny Apps.

Data files are not included, they are >500Mb in size, but are downloaded by the ngrams generator.

Ready database is not included either because it's too big. It also takes quite a while to make it from scratch and the structure is not really described here, so this repo is useless for anybody else, except for me 👿

Structure

app_structure App flow

The app includes offline part, pre-processing database, and online and online part, app.

Offline part and pre-processing are described later here.

Some top predictors are listed in the Examples section, try entering them slowly and see how values changes.

Some numbers

NGrams by N-size chart and creation time for each ngram table in hours.

Same table in text, the scaling defines the size of corpus (2 means 1/2 of file).

nGram skipGram GramsSize scaling runtime

1

0

109249

1
4

2

0

504013

3

5.5

3

0

543075

3

16.1

4

0

275753

3

18.5

For storage, the SQLite database is used. Db stores values in binary format. That is very compact and only takes 44.0 MB.

Prediction

App uses a simplest Stupid Back-off prediction scheme. First it tries to find a matching four-grams and if it fails - goes down to trigrams, bigrams and unigrams. Size of bars shows appr. ngram size.

The tricky past was filtering the twitter set because it includes lot's of typos, hash tags (appr. 60%) and web pages. In fact, words like: "aaaaaghhhhh" are still in prediction table because they are widely used, but have a very low values.

My own tests shows 68-75% coverage for bigrams depending on skip-grams and test set.

To illustrate the idea all intermediate steps return prediction.

predictionScheme

I am using quanteda package and inversemax weightening scheme described here: https://en.wikipedia.org/wiki/Tf–idf.

weight_inversemax inversemax formula

The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. Basically, as a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer to 0.

In other words, in this application lower values is better.

How it works (short)

how_works_short Prediction output back to user

Pre-processed values are stored in the SQLite database and accessed using the RSQlite package.

User input is spitted and filtered into the variables that are passed to the query builder and SQLite database. The result (last row) is returned.

Data analysis

Analysis was done using the ChecknGramCoverage.R script (not included?)

app Exploratory data analysis

Script takes a small sample as Train set and a next sample, not included in Train as Test set. Sample is around 3Mb,

Same rules are applied to both sets. Then both results are merged into the data.table and merged using the LEFT OUTER JOIN.

The LEFT OUTER JOIN returns all the rows from the left table, filling in matched columns (or NA) from the right table. After that, the row filter is applied, filtering NA values on Train.

Result <- merge(freqDt.test,freqDt.train, all.x=TRUE, by="keyName")
Result <-Result[is.na(valueTrain)]#[,long:=nchar(keyName)][long>3]

The result table includes missing values, that can be additionally filtered (comment) by the row length.

The Result is then inspected for max/min values and may show values that should NOT have been filtered out.

The Train set is expected the same way for values that should have been filtered.

After that, the filter functions, trim() and selectFeature() are adjusted. If the result is OK, the filter can be applied to the bigger set.

Creating database.

The prediction data, ngrams, are stored in the single SQLite database file. can easy be viewed and edited using the (SQLiteStudio (free))[http://sqlitestudio.pl/].

app ngrams generator flow

The database is created by running a single FillGramsDb.R file. File accepts minimum and maximum ngrams values to be calculated and has some configuration for other parameters inside.

The script loops he ngrams and writes words to the database file: grams_db1.sqlite

app ngrams generator running

Image shows generator running inside the conEmu windows terminal (it's like cmd.exe, but with tabs).

Best way to run it is to create a cmd script with Rscript.exe.

"C:\Program Files\Microsoft\MRO\R-3.2.3\bin\Rscript.exe" "HelpersSQLite\FillGramsDb.R" 1 4

The R file can be passed to the Rscript.exe and will execute in it's own session with different parameters and without starting the RStudio.

Multiple files can be run at the same time without interfering with each other.

app cmd file

Parameters can be accessed from R like:

args <- commandArgs(trailingOnly = TRUE)
nGrams.Min <- as.numeric(args[1])

About

Data Science Capstone SwiftKey project.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published