Readme

About

🚫 This document is automatically generated from README.Rmd 🚫

Diagrams are created using Pencil (.ep file).

Note to myself: non-reactive version seems to be faster.

Foreword

This is a final project for Data Science Specialization at Coursera.

Milestone report HTML
Slide deck, R Presentation.
ShinyApp, when running.

This is a longer version of About page on the Shiny Apps.

Data files are not included, they are >500Mb in size, but are downloaded by the ngrams generator.

Ready database is not included either because it's too big. It also takes quite a while to make it from scratch and the structure is not really described here, so this repo is useless for anybody else, except for me 👿

Structure

App flow

The app includes offline part, pre-processing database, and online and online part, app.

Offline part and pre-processing are described later here.

Some top predictors are listed in the Examples section, try entering them slowly and see how values changes.

Some numbers

NGrams by N-size chart and creation time for each ngram table in hours.

Same table in text, the scaling defines the size of corpus (2 means 1/2 of file).

nGram	skipGram	GramsSize	scaling	runtime
1	`0`	109249	`1`	`4`
2	`0`	504013	`3`	5.5
3	`0`	543075	`3`	16.1
4	`0`	275753	`3`	18.5

For storage, the SQLite database is used. Db stores values in binary format. That is very compact and only takes 44.0 MB.

Prediction

App uses a simplest Stupid Back-off prediction scheme. First it tries to find a matching four-grams and if it fails - goes down to trigrams, bigrams and unigrams. Size of bars shows appr. ngram size.

The tricky past was filtering the twitter set because it includes lot's of typos, hash tags (appr. 60%) and web pages. In fact, words like: "aaaaaghhhhh" are still in prediction table because they are widely used, but have a very low values.

My own tests shows 68-75% coverage for bigrams depending on skip-grams and test set.

To illustrate the idea all intermediate steps return prediction.

I am using quanteda package and inversemax weightening scheme described here: https://en.wikipedia.org/wiki/Tf–idf.

inversemax formula

The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. Basically, as a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf-idf closer to 0.

In other words, in this application lower values is better.

How it works (short)

Prediction output back to user

Pre-processed values are stored in the SQLite database and accessed using the RSQlite package.

User input is spitted and filtered into the variables that are passed to the query builder and SQLite database. The result (last row) is returned.

Data analysis

Analysis was done using the ChecknGramCoverage.R script (not included?)

Exploratory data analysis

Script takes a small sample as Train set and a next sample, not included in Train as Test set. Sample is around 3Mb,

Same rules are applied to both sets. Then both results are merged into the data.table and merged using the LEFT OUTER JOIN.

The LEFT OUTER JOIN returns all the rows from the left table, filling in matched columns (or NA) from the right table. After that, the row filter is applied, filtering NA values on Train.

Result <- merge(freqDt.test,freqDt.train, all.x=TRUE, by="keyName")
Result <-Result[is.na(valueTrain)]#[,long:=nchar(keyName)][long>3]

The result table includes missing values, that can be additionally filtered (comment) by the row length.

The Result is then inspected for max/min values and may show values that should NOT have been filtered out.

The Train set is expected the same way for values that should have been filtered.

After that, the filter functions, trim() and selectFeature() are adjusted. If the result is OK, the filter can be applied to the bigger set.

Creating database.

The prediction data, ngrams, are stored in the single SQLite database file. can easy be viewed and edited using the (SQLiteStudio (free))[http://sqlitestudio.pl/].

ngrams generator flow

The database is created by running a single FillGramsDb.R file. File accepts minimum and maximum ngrams values to be calculated and has some configuration for other parameters inside.

The script loops he ngrams and writes words to the database file: grams_db1.sqlite

ngrams generator running

Image shows generator running inside the conEmu windows terminal (it's like cmd.exe, but with tabs).

Best way to run it is to create a cmd script with Rscript.exe.

"C:\Program Files\Microsoft\MRO\R-3.2.3\bin\Rscript.exe" "HelpersSQLite\FillGramsDb.R" 1 4

The R file can be passed to the Rscript.exe and will execute in it's own session with different parameters and without starting the RStudio.

Multiple files can be run at the same time without interfering with each other.

cmd file

Parameters can be accessed from R like:

args <- commandArgs(trailingOnly = TRUE)
nGrams.Min <- as.numeric(args[1])

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
AppAbout_files/figure-markdown_strict		AppAbout_files/figure-markdown_strict
Helpers		Helpers
HelpersSQLite		HelpersSQLite
README_files/figure-markdown_github		README_files/figure-markdown_github
SlideDeck-figure		SlideDeck-figure
my_TestNGrams		my_TestNGrams
myimg		myimg
.gitignore		.gitignore
AppAbout.Rmd		AppAbout.Rmd
AppAbout.md		AppAbout.md
AppExamles.Rmd		AppExamles.Rmd
AppExamles.md		AppExamles.md
DssSwiftKeyApp.Rproj		DssSwiftKeyApp.Rproj
FillGramsDb.R		FillGramsDb.R
Milestone Report.Rmd		Milestone Report.Rmd
Milestone_Report.html		Milestone_Report.html
README.Rmd		README.Rmd
README.md		README.md
SlideDeck-rpubs.html		SlideDeck-rpubs.html
SlideDeck.Rpres		SlideDeck.Rpres
SlideDeck.md		SlideDeck.md
flow_pencil.ep		flow_pencil.ep
makedb.cmd		makedb.cmd
server.R		server.R
ui.R		ui.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Readme

About

Foreword

Structure

Some numbers

Prediction

How it works (short)

Data analysis

Creating database.

About

Uh oh!

Releases

Packages

Languages

Muhomorik/DSS_Capstone

Folders and files

Latest commit

History

Repository files navigation

Readme

About

Foreword

Structure

Some numbers

Prediction

How it works (short)

Data analysis

Creating database.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages