Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15,801 changes: 15,801 additions & 0 deletions A tale of two cities by Dickens.txt

Large diffs are not rendered by default.

3,438 changes: 3,438 additions & 0 deletions Christmas Carol by Dickens.txt

Large diffs are not rendered by default.

666 changes: 666 additions & 0 deletions Crito by Plato.txt

Large diffs are not rendered by default.

5,002 changes: 5,002 additions & 0 deletions Great Expectations by Dickens.txt

Large diffs are not rendered by default.

1,949 changes: 1,949 additions & 0 deletions Metarmophosis by Kafka.txt

Large diffs are not rendered by default.

21,460 changes: 21,460 additions & 0 deletions Moby Dick by Melville.txt

Large diffs are not rendered by default.

16,538 changes: 16,538 additions & 0 deletions Platos Republic by Plato.txt

Large diffs are not rendered by default.

84 changes: 84 additions & 0 deletions Project Reflection.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Project Writeup and Reflection\n",
"\n",
"\n",
"#### Project Overview \n",
"What data source(s) did you use and what technique(s) did you use analyze/process them? What did you hope to learn/create?\n",
"\n",
"I used several e-books from Project Gutenberg and put them into my cosine similarity algorithm, in order to measure similarity between different books and authors. \n",
"\n",
"\n",
"#### Implementation\n",
"\n",
"The goal is to measure the similarity between two strings by the amount of usage of certain words. This should be implemented by measuring the cosine between two vectors. The Vector gets created by a wordcounter that returns a counter with all words and its frequencies. The comparison takes place in the cosine function. Its syntax is based on the eucledian dot product. To measure the similarity, you take the cosine of the angle between both. If both are colinear (the same) the angle is 0 degree, so the cosine is 1. if they have no similarities the angle is 90 degree, which means the cosine is 0. As result the output range goes from 0 to 1.\n",
"\n",
"\n",
"#### Results\n",
" \n",
"I chose \"Great Expectations\", \"A tale of two cities\" and \"A christmas carol\" by Charles Dickens, \"Crito\" and \"Plato's Republic\" by Plato \"Metarmophosis\" by Kafka, and Moby Dick form Melville for my analyis. \n",
" \n",
"Dickens and Melville are really similar both, which can be caused by the fact, that both are from the same era. Plato's Crito also really similar to the mentioned Dickens and Melville. Nneither the text of Kafka nor the ones from plato were originally written in english, wherefore there's a tremendous impact by the translater of the book, what needs to be noted.\n",
"\n",
"To sum it up, you are able to get information about date the book emerged, but probably no (at least secure) information about the author. You should also be cautious with using texts that wasn't originally written in the same language as the others.\n",
"\n",
"[[1.0, 0.95, 0.93, 0.85, 0.81, 0.86, 0.9],\n",
" [0.95, 1.0, 0.97, 0.92, 0.86, 0.92, 0.97],\n",
" [0.93, 0.97, 1.0, 0.89, 0.83, 0.91, 0.94],\n",
" [0.85, 0.92, 0.89, 1.0, 0.93, 0.81, 0.92],\n",
" [0.81, 0.86, 0.83, 0.93, 1.0, 0.77, 0.84],\n",
" [0.86, 0.92, 0.91, 0.81, 0.77, 1.0, 0.88],\n",
" [0.9, 0.97, 0.94, 0.92, 0.84, 0.88, 1.0]]\n",
"\n",
"#### Reflection \n",
" \n",
"what went well?\n",
"\n",
"I structured a complex program in many functions on my own for the first time. Also the output was informative.\n",
"\n",
"What could you improve?\n",
"\n",
"I am still not 100 % familar with the process of Recursion and tried to avoid it. Because of the. \n",
"\n",
"Was your project appropriately scoped?\n",
"\n",
"It took a while to get in touch, especially because it wasn't really clear what my task was. I also did a lot of research for this topic. I also wanted to deal with the markov synthesis, but because of other homework I unfortunatel didn't have the time to do so. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
8 changes: 7 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
# TextMining

This is the base repo for the text mining and analysis project for Software Design at Olin College.
Installed libraries: imported requests, re, math, collections operator

Which file to run to get your results: run [python text_mining.py](https://github.com/flxbrhrdt/TextMining/blob/master/python%20text_mining.py) and it will analyze all .txt files in your folder

Reflection: [Project Reflection.ipynb](https://github.com/flxbrhrdt/TextMining/blob/master/Platos%20Republic%20by%20Plato.txt)

Also, there no all caps variable names anymore and the final function that sums it all up was implemented
114 changes: 114 additions & 0 deletions python text_mining.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
import requests
import re, math
from collections import Counter
from operator import itemgetter
import operator

def text_preprocessing(text):
"""Check if the the input is a string or pathfile. convert it to string

>>> text_preprocessing(4)

>>> text_preprocessing('4')
'4'
"""
if isinstance(text, str):
if text.endswith('.txt'):
path = text
text = open(path,'r')
return text.read()
else:
return text
else:
return None

def word_counter(text):
"""return a list with word frequencies

>>>
"""
word = re.compile(r'\w+')
text = text_preprocessing(text)
words = word.findall(text)
wordcount = Counter(words)
return wordcount

def word_frequency(text):
"""Take a string, count the word frequencies in it and display it in a descending order

>>> word_frequency('text text hello text python')
[('text', 3), ('hello', 1), ('python', 1)]
"""
text = text.lower() # Preprocessor: all lower case letters
word_frequency = word_counter(text) #Calling the function
return sorted(word_frequency.items(), key=lambda pair: pair[1], reverse=True)

def cosine(vec1, vec2):
""" Calculate the cosine between two vectors, used as a measurment of similarity

"""
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x]**2 for x in vec1.keys()])
sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
# Check if denominator is any kind of zero, empty container or False
if denominator is not None:
return float(numerator) / denominator
else:
return 0.0

def Cosine_similarity(text_1, text_2):
"""find similarities between two text files, by looking at the number of word usage

>>> Cosine_similarity('This is a test.', 'This is the second test.')
0.6708203932499369
"""
# WORD = re.compile(r'\w+')
# transform the two input strings into a vector
vector_1 = word_counter(text_1)
vector_2 = word_counter(text_2)
return cosine(vector_1, vector_2)


def listmaker(*texts):
"""Make a list that gets iterated in the Text_similarity function

>>> listmaker('a', 'c', 'y', 'x', 'gfhj')
['a', 'c', 'y', 'x', 'gfhj']
"""

text_list=[]
for a in texts:
text_list.append(text_preprocessing(a))
return text_list

def Text_similarity(*texts):
"""compare an indefinite number of texts with each other

>>> Text_similarity('This is the first test', 'This is the second test, what a Test. One more Test', 'This is the firs test')
[[0.9999999999999998, 0.49613893835683387, 0.7999999999999998], [0.49613893835683387, 1.0000000000000002, 0.49613893835683387], [0.7999999999999998, 0.49613893835683387, 0.9999999999999998]]
"""
# print(*texts)
# Loop through list of Texts
text = listmaker(*texts)
sims1 =[]
for p in range(len(text)):
sims2 =[]
for i in range(len(text)):
sim = Cosine_similarity(text[p], text[i])
sims2.append(float("{0:.2f}".format(sim)))
sims1.append(sims2)
return sims1



def final_function():
"""sum the program up and uses every .txt file in the folder """
from os import listdir
a = listdir()
a = tuple(f for f in a if ".txt" in f)
return Text_similarity(*a)


final_function()