sd17fall · feberhardt · Oct 11, 2017 · Oct 15, 2017 · Oct 19, 2017
diff --git a/A tale of two cities by Dickens.txt b/A tale of two cities by Dickens.txt
diff --git a/Christmas Carol by Dickens.txt b/Christmas Carol by Dickens.txt
diff --git a/Crito by Plato.txt b/Crito by Plato.txt
diff --git a/Great Expectations by Dickens.txt b/Great Expectations by Dickens.txt
diff --git a/Metarmophosis by Kafka.txt b/Metarmophosis by Kafka.txt
diff --git a/Moby Dick by Melville.txt b/Moby Dick by Melville.txt
diff --git a/Platos Republic by Plato.txt b/Platos Republic by Plato.txt
diff --git a/Project Reflection.ipynb b/Project Reflection.ipynb
@@ -0,0 +1,84 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Project Writeup and Reflection\n",
+    "\n",
+    "\n",
+    "#### Project Overview \n",
+    "What data source(s) did you use and what technique(s) did you use analyze/process them? What did you hope to learn/create?\n",
+    "\n",
+    "I used several e-books from Project Gutenberg and put them into my cosine similarity algorithm, in order to measure similarity between different books and authors. \n",
+    "\n",
+    "\n",
+    "#### Implementation\n",
+    "\n",
+    "The goal is to measure the similarity between two strings by the amount of usage of certain words. This should be implemented by measuring the cosine between two vectors. The Vector gets created by a wordcounter that returns a counter with all words and its frequencies. The comparison takes place in the cosine function. Its syntax is based on the eucledian dot product. To measure the similarity, you take the cosine of the angle between both. If both are colinear (the same) the angle is 0 degree, so the cosine is 1. if they have no similarities the angle is 90 degree, which means the cosine is 0. As result the output range goes from 0 to 1.\n",
+    "\n",
+    "\n",
+    "#### Results\n",
+    "   \n",
+    "I chose \"Great Expectations\", \"A tale of two cities\" and \"A christmas carol\"  by Charles Dickens, \"Crito\" and \"Plato's Republic\" by Plato \"Metarmophosis\" by Kafka, and Moby Dick form Melville for my analyis. \n",
+    "    \n",
+    "Dickens and Melville are really similar both, which can be caused by the fact, that both are from the same era. Plato's Crito also really similar to the mentioned Dickens and Melville. Nneither the text of Kafka nor the ones from plato were originally written in english, wherefore there's a tremendous impact by the translater of the book, what needs to be noted.\n",
+    "\n",
+    "To sum it up, you are able to get information about date the book emerged, but probably no (at least secure) information about the author. You should also be cautious with using texts that wasn't originally written in the same language as the others.\n",
+    "\n",
+    "[[1.0, 0.95, 0.93, 0.85, 0.81, 0.86, 0.9],\n",
+    " [0.95, 1.0, 0.97, 0.92, 0.86, 0.92, 0.97],\n",
+    " [0.93, 0.97, 1.0, 0.89, 0.83, 0.91, 0.94],\n",
+    " [0.85, 0.92, 0.89, 1.0, 0.93, 0.81, 0.92],\n",
+    " [0.81, 0.86, 0.83, 0.93, 1.0, 0.77, 0.84],\n",
+    " [0.86, 0.92, 0.91, 0.81, 0.77, 1.0, 0.88],\n",
+    " [0.9, 0.97, 0.94, 0.92, 0.84, 0.88, 1.0]]\n",
+    "\n",
+    "#### Reflection \n",
+    " \n",
+    "what went well?\n",
+    "\n",
+    "I structured a complex program in many functions on my own for the first time. Also the output was informative.\n",
+    "\n",
+    "What could you improve?\n",
+    "\n",
+    "I am still not 100 % familar with the process of Recursion and tried to avoid it. Because of the. \n",
+    "\n",
+    "Was your project appropriately scoped?\n",
+    "\n",
+    "It took a while to get in touch, especially because it wasn't really clear what my task was. I also did a lot of research for this topic. I also wanted to deal with the markov synthesis, but because of other homework I unfortunatel didn't have the time to do so. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.1"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/README.md b/README.md
@@ -1,3 +1,9 @@
 # TextMining
 
-This is the base repo for the text mining and analysis project for Software Design at Olin College.
+Installed libraries: imported requests, re, math, collections operator 
+
+Which file to run to get your results: run [python text_mining.py](https://github.com/flxbrhrdt/TextMining/blob/master/python%20text_mining.py) and it will analyze all .txt files in your folder
+
+Reflection: [Project Reflection.ipynb](https://github.com/flxbrhrdt/TextMining/blob/master/Platos%20Republic%20by%20Plato.txt)
+
+Also, there no all caps variable names anymore and the final function that sums it all up was implemented
diff --git a/python text_mining.py b/python text_mining.py
@@ -0,0 +1,114 @@
+import requests
+import re, math
+from collections import Counter
+from operator import itemgetter
+import operator
+
+def text_preprocessing(text):
+    """Check if the the input is a string or pathfile. convert it to string
+
+    >>> text_preprocessing(4)
+
+    >>> text_preprocessing('4')
+    '4'
+    """
+    if isinstance(text, str):
+        if text.endswith('.txt'):
+            path = text
+            text = open(path,'r')
+            return text.read()
+        else:
+            return text
+    else:
+        return None
+
+def word_counter(text):
+    """return a list with word frequencies
+
+    >>>
+    """
+    word = re.compile(r'\w+')
+    text = text_preprocessing(text)
+    words = word.findall(text)
+    wordcount = Counter(words)
+    return wordcount
+
+def word_frequency(text):
+    """Take a string, count the word frequencies in it and display it in a descending order
+
+    >>> word_frequency('text text hello text python')
+    [('text', 3), ('hello', 1), ('python', 1)]
+    """
+    text = text.lower() #     Preprocessor: all lower case letters
+    word_frequency = word_counter(text) #Calling the function
+    return sorted(word_frequency.items(), key=lambda pair: pair[1], reverse=True)
+
+def cosine(vec1, vec2):
+    """ Calculate the cosine between two vectors, used as a measurment of similarity
+
+    """
+    intersection = set(vec1.keys()) & set(vec2.keys())
+    numerator = sum([vec1[x] * vec2[x] for x in intersection])
+    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
+    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
+    denominator = math.sqrt(sum1) * math.sqrt(sum2)
+    # Check if denominator is any kind of zero, empty container or False
+    if denominator is not None:
+        return float(numerator) / denominator
+    else:
+        return 0.0
+
+def Cosine_similarity(text_1, text_2):
+    """find similarities between two text files, by looking at the number of word usage
+
+    >>> Cosine_similarity('This is a test.', 'This is the second test.')
+    0.6708203932499369
+     """
+    # WORD = re.compile(r'\w+')
+    # transform the two input strings into a vector
+    vector_1 = word_counter(text_1)
+    vector_2 = word_counter(text_2)
+    return cosine(vector_1, vector_2)
+
+
+def listmaker(*texts):
+    """Make a list that gets iterated in the Text_similarity function
+
+    >>> listmaker('a', 'c', 'y', 'x', 'gfhj')
+    ['a', 'c', 'y', 'x', 'gfhj']
+    """
+
+    text_list=[]
+    for a in texts:
+        text_list.append(text_preprocessing(a))
+    return text_list
+
+def Text_similarity(*texts):
+    """compare an indefinite number of texts with each other
+
+    >>> Text_similarity('This is the first test', 'This is the second test, what a Test. One more Test', 'This is the firs test')
+    [[0.9999999999999998, 0.49613893835683387, 0.7999999999999998], [0.49613893835683387, 1.0000000000000002, 0.49613893835683387], [0.7999999999999998, 0.49613893835683387, 0.9999999999999998]]
+    """
+    # print(*texts)
+    # Loop through list of Texts
+    text = listmaker(*texts)
+    sims1 =[]
+    for p in range(len(text)):
+        sims2 =[]
+        for i in range(len(text)):
+            sim = Cosine_similarity(text[p], text[i])
+            sims2.append(float("{0:.2f}".format(sim)))
+        sims1.append(sims2)
+    return sims1
+
+
+
+def final_function():
+    """sum the program up and uses every .txt file in the folder """
+    from os import listdir
+    a = listdir()
+    a = tuple(f for f in a if ".txt" in f)
+    return Text_similarity(*a)
+
+
+final_function()