diff --git a/Frankenstein.txt b/Frankenstein.txt new file mode 100644 index 0000000..812272b Binary files /dev/null and b/Frankenstein.txt differ diff --git a/Frankenstein_tfidf.png b/Frankenstein_tfidf.png new file mode 100644 index 0000000..4883cde Binary files /dev/null and b/Frankenstein_tfidf.png differ diff --git a/Frankenstein_wf.png b/Frankenstein_wf.png new file mode 100644 index 0000000..649c871 Binary files /dev/null and b/Frankenstein_wf.png differ diff --git a/Paradise_Lost.txt b/Paradise_Lost.txt new file mode 100644 index 0000000..581063a Binary files /dev/null and b/Paradise_Lost.txt differ diff --git a/Paradise_Lost_tfidf.png b/Paradise_Lost_tfidf.png new file mode 100644 index 0000000..98b98b7 Binary files /dev/null and b/Paradise_Lost_tfidf.png differ diff --git a/Paradise_Lost_wf.png b/Paradise_Lost_wf.png new file mode 100644 index 0000000..ab20813 Binary files /dev/null and b/Paradise_Lost_wf.png differ diff --git a/Project Writeup and Reflection.md b/Project Writeup and Reflection.md new file mode 100644 index 0000000..3aa44cf --- /dev/null +++ b/Project Writeup and Reflection.md @@ -0,0 +1,136 @@ +# Project Writeup and Reflection + +### Project Overview [Maximum 100 words] + +*What data source(s) did you use and what technique(s) did you use analyze/process them? What did you hope to learn/create?* + +For this project, I wanted to compare the word frequency and sentiment of novels with very different genres. I pickled the novels from Project Gutenberg and performed word frequency analysis on each of them after filtering out stop words [1, 2]. I performed sentiment analysis for the most common words in each novel as well as for the entire novel. Then I created a word cloud [3] of the most common words. In my second iteration, I implemented TF-IDF features [4] in order to compare the "most important words" of each novel. I also performed sentiment analysis and created a word cloud for those words. + +From this project, I hoped to learn more about parsing and storing text from the Internet, using dictionaries, creating functions, applying different computational methods to compare my text, implementing relevant libraries, and figuring out how to use those libraries for my purposes. I also wanted to compare the results of using word frequency analysis and TF-IDF. + +### Implementation [~2-3 paragraphs] + +*Describe your implementation at a system architecture level. You should NOT walk through your code line by line, or explain every function (we can get that from your docstrings). Instead, talk about the major components, algorithms, data structures and how they fit together. You should also discuss at least one design decision where you had to choose between multiple alternatives, and explain why you made the choice you did.* + +I began with a simple word frequency analyzer. This consisted of three parts: grab the text from a url and store it in a cache, filter the text (strip away header comments, punctuation, etc.) and return individual words in a list, and find the frequency of each unique word and sort the words by frequency. This could have been done with three functions or even less, but I thought it would give me more room to play with the code in the future if I split the functions as I did. With the main portion done, I added a sentiment analyzer and a word cloud generator to obtain visual data. + +At first, I manually plugged in the title and url of each novel I wanted to analyze, but eventually I made a tuple with all the titles and a dictionary that matched the title with its url. I used placeholders for the title and number of words to analyze to minimize the manual work when I wanted to analyze a different text and use a different word count. Because I tested my code as a whole at each stage instead of writing doctests, I ran the frequency analyzer, sentiment analyzer, and word cloud functions from the main code instead of writing another function to run them. + +There were many instances in which I had to decide on the data structure to use. I decided to use a dictionary for the histogram because it would match each unique word with its word count. In turn, I decided to return a word list after filtering the text because it would be easier to find the frequency of each unique word in a list with a dictionary. However, sentiment analyzer and word count required the input to be a string, so I added a conditional to those functions that would convert the input to a string if it were originally a list. Most of my decisions were made based on how little I needed to change the code to make it work. Since I implemented a lot of libraries, which I decided to do because the details of those functions I felt were outside the scope of this project, I chose to treat each one as a black box and figure out how to work around it instead of mess too much with it. + +At some point, while I was still fixing up the word frequency analysis script, I decided to try to implement TF-IDF. I considered importing the functions from the original file I was working with, but I didn't want to mess anything up by accident, so I copied over what I had into another file. Googling led me to a very helpful tutorial [4], which I used as reference to write a function to calculate the TF-IDF score. I very much wanted the rest of my code to stay as close to the original as possible, but calculating the TF-IDF scores for every word in the entire text proved too time-consuming. I left the code running for over an hour before deciding that it wasn't worth the time. I went back and changed the get_top_n_words function to output a string with the top n words scaled for the number of times each word occurs (e.g. 'three three three' if the word 'three' appears three times). Then I used that to calculate the TF-IDF scores. It still took a while, but definitely not as long as before for it to be worth figuring out how to make the code run faster. Finally, I changed the outputs to show the top words, sentiment, and word cloud for the text based on the TF-IDF scores. + +### Results [~2-3 paragraphs + figures/examples] + +*Present what you accomplished:* +- *If you did some text analysis, what interesting things did you find? Graphs or other visualizations may be very useful here for showing your results.* +- *If you created a program that does something interesting (e.g. a Markov text synthesizer), be sure to provide a few interesting examples of the program’s output.* + +I compared three novels: Frankenstein, Paradise Lost, and The Romance of Lust. Essentially, as a "feel-good" novel, The Romance of Lust is overall the most positive and the least negative. Even calculated with the top 50 words (using either word frequency analysis or TF-IDF), The Romance of Lust contains the highest positive score. On the other end of the spectrum, Frankenstein is the most negative, which makes sense because it's of a depressing horror genre. Meanwhile, Paradise Lost is the most neutral, not falling into either extreme of romance or horror. + +In terms of the comparison between word frequency analysis and TF-IDF, each has its pros and cons. Word frequency analysis is significantly easier to implement than TF-IDF and hard-coding a list or two of known stop words weeds out most of the irrelevant words. TF-IDF, however, generates words very unique to each text by comparing their appearance in other texts. As a result, the top 50 words and resulting word cloud generated with TF-IDF are much more exciting to look at than those generated with word frequency analysis. However, depending on how you calculate the TF-IDF score, the algorithm can often overlook motifs of a text that offhandedly appear in other texts. For example, "man", "life", and "father" are very important words in Frankenstein, but as they are words common to most other texts, they are deemed less important in TF-IDF and don't show up in the top 50 words. + +One interesting thing to note: with TF-IDF, the compound score, or sentiment intensity, for each novel also has more variety. Frankenstein, in particular, has a compound score of -0.7184, showing that it is indeed uniquely depressing in its genre (top words include "miserable", "misery", "despair", "horror", "tears", and "grief", just like my life). + +The top 50 words, sentiments, and word clouds for each novel are included below. + +Also, going totally off topic: if you pickle the HTML text from https://google.com and open the .txt file, the encoded characters are a hodgepodge of questionable Chinese characters and radioactive symbols. If you copy it into Google translate and try to translate it into English, you end up with more repeating Chinese characters as well as gems such as: + +- "This item is eligible for Free International Shipping This item is eligible for Free International Shipping This item is eligible for Free International Shipping This item is eligible for Free International Shipping Important information about purchasing this product: This product is out of print and no longer available from the publisher Related promotions" +- "The King of Cream is a kind of a cup of coffee and a cup of rice and a cup of rice." +- "In this case, you will not be able to do so. If you have any questions, please do not hesitate to do so. If you have any questions, please do not hesitate to contact us." +- "The United States of America" x100 + +**Overall** + +Sentiment of Frankenstein: + +- {'neg': 0.139, 'neu': 0.7, 'pos': 0.162, 'compound': 1.0} + +Sentiment of Paradise_Lost: + +- {'neg': 0.12, 'neu': 0.724, 'pos': 0.157, 'compound': 1.0} + +Sentiment of The_Romance_of_Lust: + +- {'neg': 0.089, 'neu': 0.707, 'pos': 0.204, 'compound': 1.0} + +**Computed Using Word Frequency Analysis** + +Top 50 Words in Frankenstein: + +- ['man', 'life', 'father', 'shall', 'eyes', 'said', 'time', 'saw', 'night', 'elizabeth', 'mind', 'day', 'felt', 'death', 'heart', 'feelings', 'thought', 'dear', 'soon', 'friend', 'passed', 'miserable', 'heard', 'like', 'love', 'place', 'little', 'human', 'appeared', 'clerval', 'misery', 'friends', 'justine', 'country', 'nature', 'words', 'cottage', 'feel', 'great', 'old', 'away', 'hope', 'felix', 'return', 'happiness', 'know', 'despair', 'days', 'voice', 'long'] + +Sentiment of Top 50 Words in Frankenstein: + +- {'neg': 0.169, 'neu': 0.492, 'pos': 0.339, 'compound': 0.9201} + +![Frankenstein WF Word Cloud](https://github.com/vivienyuwenchen/TextMining/blob/master/Frankenstein_wf.png) + +Top 50 Words in Paradise_Lost: + +- ['thir', 'thy', 'thou', 'thee', "heav'n", 'shall', 'th', 'god', 'earth', 'man', 'high', 'great', 'death', 'till', 'hath', 'hell', 'stood', 'day', 'good', 'like', 'things', 'night', 'light', 'farr', 'love', 'eve', 'o', 'world', 'adam', 'soon', 'let', 'hee', 'son', 'life', 'know', 'place', 'long', 'forth', 'self', 'mee', 'ye', 'way', 'power', 'hand', 'new', 'deep', 'end', 'fair', 'men', 'satan'] + +Sentiment of Top 50 Words in Paradise_Lost: + +- {'neg': 0.122, 'neu': 0.573, 'pos': 0.305, 'compound': 0.8957} + +![Paradise Lost WF Word Cloud](https://github.com/vivienyuwenchen/TextMining/blob/master/Paradise_Lost_wf.png) + +Top 50 Words in The_Romance_of_Lust: + +- ['prick', 'dear', 'time', 'delicious', 'cunt', 'said', 'little', 'oh', 'aunt', 'hand', 'miss', 'way', 'pleasure', 'doctor', 'shall', 'quite', 'long', 'night', 'head', 'delight', 'took', 'great', 'excited', 'bed', 'felt', 'day', 'away', 'gave', 'let', 'told', 'lay', 'mrs', 'arms', 'mamma', 'fuck', 'frankland', 'soon', 'room', 'mother', 'clitoris', 'came', 'boy', 'exquisite', 'come', 'moment', 'lips', 'darling', 'began', 'mouth', 'course'] + +Sentiment of Top 50 Words in The_Romance_of_Lust: + +- {'neg': 0.141, 'neu': 0.501, 'pos': 0.358, 'compound': 0.9548} + +[The Romance of Lust WF Word Cloud](https://github.com/vivienyuwenchen/TextMining/blob/master/The_Romance_of_Lust_wf.png) + +**Computed Using TF-IDF** + +Top 50 Words in Frankenstein: + +- ['elizabeth', 'feelings', 'miserable', 'sometimes', 'clerval', 'misery', 'friends', 'justine', 'country', 'several', 'cottage', 'felix', 'despair', 'scene', 'horror', 'creature', 'ice', 'affection', 'months', 'countenance', 'soul', 'possessed', 'geneva', 'mountains', 'journey', 'forever', 'hours', 'around', 'believe', 'discovered', 'resolved', 'remained', 'tale', 'cold', 'tears', 'sensations', 'existence', 'family', 'monster', 'appearance', 'companion', 'arrived', 'letter', 'read', 'science', 'girl', 'grief', 'endeavoured', 'wind', 'beauty'] + +Sentiment of Top 50 Words in Frankenstein: + +- {'neg': 0.257, 'neu': 0.571, 'pos': 0.171, 'compound': -0.7184} + +![Frankenstein TF-IDF Word Cloud](https://github.com/vivienyuwenchen/TextMining/blob/master/Frankenstein_tfidf.png) + +Top 50 Words in Paradise_Lost: + +- ['thir', 'thou', 'thee', "heav'n", 'th', 'though', "'d", 'o', 'till', 'hath', 'hell', 'things', 'farr', 'eve', 'adam', 'hee', 'self', 'mee', 'ye', 'fair', 'call', 'satan', 'gods', 'hast', 'paradise', "heav'ns", 'onely', 'spake', 'wide', 'fruit', 'bright', 'least', 'oft', 'angel', 'thence', 'warr', 'tree', 'works', 'behold', 'taste', 'seat', 'king', 'ere', 'angels', 'throne', 'created', 'eternal', 'divine', "heav'nly", 'fall'] + +Sentiment of Top 50 Words in Paradise_Lost: + +- {'neg': 0.073, 'neu': 0.687, 'pos': 0.24, 'compound': 0.8555} + +![Paradise Lost TF-IDF Word Cloud](https://github.com/vivienyuwenchen/TextMining/blob/master/Paradise_Lost_tfidf.png) + +Top 50 Words in The_Romance_of_Lust: + +- ['prick', 'delicious', 'cunt', 'aunt', 'miss', 'doctor', 'quite', 'bottom', 'excited', 'bed', 'told', 'mrs', 'mamma', 'fuck', 'frankland', 'clitoris', 'boy', 'exquisite', 'darling', 'mouth', 'got', 'charming', 'charlie', 'lust', 'three', 'get', 'legs', 'excitement', 'harry', 'evelyn', 'fucking', 'passions', 'ellen', 'fucked', 'count', 'cock', 'mary', 'lizzie', 'done', 'exciting', 'wife', 'movements', 'fine', 'belly', 'begged', 'lascivious', 'dale', 'gently', 'together', 'sisters'] + +Sentiment of Top 50 Words in The_Romance_of_Lust: + +- {'neg': 0.21, 'neu': 0.434, 'pos': 0.356, 'compound': 0.9137} + +[The Romance of Lust TF-IDF Word Cloud](https://github.com/vivienyuwenchen/TextMining/blob/master/The_Romance_of_Lust_tfidf.png) + +### Reflection [~1 paragraph] + +*From a process point of view, what went well? What could you improve? Other possible reflection topics: Was your project appropriately scoped? Did you have a good plan for unit testing? How will you use what you learned going forward? What do you wish you knew before you started that would have helped you succeed?* + +I got more than what I hoped for to work. Originally, I was just thinking of comparing the most common words and sentiments of novels of various genres and making an aesthetically pleasing word cloud for each novel. I started the project early enough to get that code working, so I decided to try to implement TF-IDF as well. I ended up relying quite heavily on a python tutorial to figure out what TF-IDF even was (I have a keeping myself awake when faced with Wikipedia articles; hence, my project involved text from Project Gutenberg and not Wikipedia). However, the tutorial helped me a lot with figuring out new syntax, so that was great. I also spent a lot of time working with the code because I didn't want to change my original code too much, so I feel like I have a strong grasp on the concept now. Going forward, I will likely use the idioms I learned and the libraries I found. I kind of wish I had more of a direction for my project before I started, but I nevertheless had a good time figuring things out along the way. As a result, though, I ended up running my functions from the main code instead of thinking of doctests, although I found it easier than doctests in this case because I already had a working output from my Word Frequency Toolbox to compare my results to. I ended up improving upon the code I had written for the toolbox as well. I kind of want to consolidate my code further and try out different sources and analyses, but at this point, I feel that I've spent too much time on this project and will leave that endeavor for future, bored me. + +### References + +[1] https://pypi.python.org/pypi/stop-words + +[2] http://xpo6.com/list-of-english-stop-words/ + +[3] https://github.com/amueller/word_cloud + +[4] http://stevenloria.com/finding-important-words-in-a-document-using-tf-idf/ diff --git a/README.md b/README.md index 8cce527..8eb9629 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,18 @@ # TextMining This is the base repo for the text mining and analysis project for Software Design at Olin College. + +[Project Writeup and Reflection](https://github.com/vivienyuwenchen/TextMining/blob/master/Project%20Writeup%20and%20Reflection.md) + +Required packages: +- pip install nltk requests vaderSentiment +- pip install stop_words +- pip install wordcloud + - might have to install Microsoft Visual C++ 2015, in which case there will be an error message with a link to the download page +- pip install textblob + +To run the text mining code with word frequency analysis: +- python text_mining.py + +To run the text mining code with TF-IDF: +- python text_mining_tfidf.py diff --git a/The_Romance_of_Lust.txt b/The_Romance_of_Lust.txt new file mode 100644 index 0000000..d3018fb Binary files /dev/null and b/The_Romance_of_Lust.txt differ diff --git a/text_mining.py b/text_mining.py new file mode 100644 index 0000000..e5eae4e --- /dev/null +++ b/text_mining.py @@ -0,0 +1,217 @@ +"""Mines text from Project Gutenburg. Performs word frequency and sentiment analysis +on text and generates a word cloud from most common words. + +@author: Vivien Chen + +""" + +import requests +import string +import math +from pickle import dump, load +from os.path import exists +from stop_words import get_stop_words # pip install stop_words +from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer +from wordcloud import WordCloud # pip install wordcloud + + +def get_cache(url, file_name): + """Writes text from url into file_name and returns text if file_name does not exist. + Otherwise, simply returns text from file_name. Assumes that text from file_name + matches text from url if file_name exists. + + Args: + url: url you want to read text from + file_name: name of file you want to write to/read from + + Returns: + text from url + """ + if not exists(file_name): + file_ = open(file_name, 'wb') + text = requests.get(url).text + dump(text, file_) + + return open(file_name, 'r', encoding='utf-8', errors='ignore') + + +def filter_PG_text(text): + """Takes the raw text of a Project Gutenberg book as input. Strips away header + comments and returns the book portion of the text. + + Args: + text: the raw text from a Project Gutenberg book + + Returns: + the book portion of the Project Gutenberg book + """ + lines = text.readlines() + + start_line = 0 + while lines[start_line].find('START OF THIS PROJECT GUTENBERG EBOOK') == -1: + start_line += 1 + lines = lines[start_line+1:] + + end_line = 0 + while lines[end_line].find('END OF THIS PROJECT GUTENBERG EBOOK') == -1: + end_line += 1 + lines = lines[:end_line-3] + + return ' '.join(lines) + + +def get_word_list(text): + """Takes a string of text as input. Strips away punctuation and whitespace. + Converts all words to lowercase. Returns a list of the words used in the string + with stop words removed. + + Args: + text: a string of text, such as the filtered text from a Project Gutenberg book + + Returns: + a list of words, all lowercase, from the string with stop words removed + """ + text = text.lower() + word_list = text.split() + + for i in range(len(word_list)): + word_list[i] = word_list[i].strip(string.punctuation) + + stop_words = get_stop_words('en') + stop_words_2 = ["a", "about", "above", "across", "after", "afterwards", "again", "against", "all", + "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", + "amount", "an", "and", "another", "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", + "around", "as", "at", "back", "be", "became", "because", "become", "becomes", "becoming", "been", "before", + "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "bill", "both", + "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry", "de", + "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven", + "else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", + "everywhere", "except", "few", "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", + "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", + "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", + "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "ie", "if", "in", "inc", + "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", + "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", + "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", + "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", + "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", + "ours", "ourselves", "out", "over", "own","part", "per", "perhaps", "please", "put", "rather", "re", + "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", + "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", + "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", + "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", + "thereupon", "these", "they", "thick", "thin", "third", "this", "those", "though", "three", "through", + "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", + "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", + "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", + "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", + "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", + "yourselves", "the"] + + filtered_word_list = [word for word in word_list if word not in stop_words and word not in stop_words_2] + + return filtered_word_list + + +def get_histogram(word_list): + """Takes a list of words as input and returns a dictionary with all the unique + words and their word counts + + Args: + word_list: a list of words (assumed to all be in lower case with no punctuation) + + Returns: + a histogram; a dictionary with all the unique words and their word counts + """ + word_counts = dict() + + for word in word_list: + if word != '': + word_counts[word] = word_counts.get(word, 0) + 1 + + return word_counts + + +def get_top_n_words(word_list, n): + """Takes a list of words as input and returns a list of the n most frequently + occurring words ordered from most to least frequently occurring. + + Args: + word_list: a list of words (assumed to all be in lower case with no punctuation) + n: the number of words to return + + Returns: + a list of n most frequently occurring words ordered from most to least frequently occurring + """ + word_counts = get_histogram(word_list) + + ordered_by_frequency = sorted(word_counts, key=word_counts.get, reverse=True) + + return ordered_by_frequency[:n] + + +def sentiment_analyzer(text): + """Takes a string of text as input and returns the sentiment analysis of the text. + Converts text to string if text is not already a string. + + Args: + text: a string of text to be analyzed + + Returns: + a sentiment analysis of the text + """ + if type(text) != str: + text = ' '.join(text) + analyzer = SentimentIntensityAnalyzer() + return analyzer.polarity_scores(text) + + +def word_cloud(text, title): + """Takes a string of text and a title as input and saves the generated wordcloud as a png + with the title as the file name. Converts text to string if text is not already a string. + + Args: + text: a string of text used to generate a wordcloud + title: the file name of the generated wordcloud + + Returns: + nothing; saves the wordcloud to title.png + """ + if type(text) == list: + text = ' '.join(text) + wordcloud = WordCloud(width = 1000, height = 500, background_color="white").generate(text) + wordcloud.to_file('%s_wf.png' % title) + + +if __name__ == "__main__": + titles = ('Frankenstein', + 'Paradise_Lost', + 'The_Romance_of_Lust', + ) + + url = {'Frankenstein': 'http://www.gutenberg.org/cache/epub/84/pg84.txt', + 'Paradise_Lost': 'http://www.gutenberg.org/cache/epub/20/pg20.txt', + 'The_Romance_of_Lust': 'http://www.gutenberg.org/cache/epub/30254/pg30254.txt', + } + + # for each title + for title in titles: + # get the text from url and strip the header comments + text = filter_PG_text(get_cache(url['%s' % title], '%s.txt' % title)) + # get a list of the filtered words from the text + word_list = get_word_list(text) + # top n words + n = 50 + # get a list of the top n words + top_n_words = (get_top_n_words(word_list, n)) + # print the top n words + print('Top %d Words in %s:' % (n, title)) + print(top_n_words, '\n') + # print the sentiment of the top n words + print('Sentiment of Top %d Words in %s:' % (n, title)) + print(sentiment_analyzer(top_n_words), '\n') + # print the sentiment of the whole text + print('Sentiment of %s:' % title) + print(sentiment_analyzer(text), '\n\n') + # generate a wordcloud with the filtered words from the text + word_cloud(word_list, title) diff --git a/text_mining_tfidf.py b/text_mining_tfidf.py new file mode 100644 index 0000000..a35b7b1 --- /dev/null +++ b/text_mining_tfidf.py @@ -0,0 +1,172 @@ +"""Mines text from Project Gutenburg. Performs TF-IDF and sentiment analysis +on text and generates a word cloud from most important words. + +@author: Vivien Chen + +""" + +import requests +import string +import math +from pickle import dump, load +from os.path import exists +from stop_words import get_stop_words # pip install stop_words +from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer +from wordcloud import WordCloud # pip install wordcloud +from textblob import TextBlob as tb # pip install textblob +from text_mining import get_cache, filter_PG_text, get_histogram + + +def get_word_list(text): + """Takes a string of text as input. Strips away punctuation and whitespace. + Converts all words to lowercase. Returns a list of the words used in the string. + + Args: + text: a string of text, such as the filtered text from a Project Gutenberg book + + Returns: + a list of words, all lowercase, from the string + """ + text = text.lower() + text = text.split() + for i in range(len(text)): + text[i] = text[i].strip(string.punctuation) + + return text + + +def get_top_n_words(word_list, n): + """Takes a list of words as input and returns a string of the n most frequently + occurring words, adjusted for the number of times each word occurs, ordered from + most to least frequently occurring. + + Args: + word_list: a list of words (assumed to all be in lower case with no punctuation) + n: the number of words to consider + + Returns: + a string of n most frequently occurring words, adjusted for the number of times + each word occurs, ordered from most to least frequently occurring + """ + word_counts = get_histogram(word_list) + + ordered_by_frequency = sorted(word_counts, key=word_counts.get, reverse=True) + ordered_by_frequency = ordered_by_frequency[0:n] + + list_ = [] + # for the top n words, multiply the word by the frequency to simulate the frequency of the word in the text + for word in ordered_by_frequency: + list_.append((word + ' ') * word_counts[word]) + + return ' '.join(list_) + + +def sentiment_analyzer(text): + """Takes a string of text as input and returns the sentiment analysis of the text. + Converts text to string if text is not already a string. + + Args: + text: a string of text to be analyzed + + Returns: + a sentiment analysis of the text + """ + # make sure the text is a string before using the sentiment analyzer + if type(text) != str: + text = ' '.join(text) + analyzer = SentimentIntensityAnalyzer() + return analyzer.polarity_scores(text) + + +def word_cloud(text, title): + """Takes a string of text and a title as input and saves the generated wordcloud as a png + with the title as the file name. Converts text to string if text is not already a string. + + Args: + text: a string of text used to generate a wordcloud + title: the file name of the generated wordcloud + + Returns: + nothing; saves the wordcloud to title.png + """ + # make sure the text is a string before using the wordcloud generator + if type(text) == list: + text = ' '.join(text) + wordcloud = WordCloud(width = 1000, height = 500, background_color="black").generate(text) + wordcloud.to_file('%s_tfidf.png' % title) + + +def tfidf(word, text, text_list): + """Calculates and returns the TF-IDF score of a word from the text by computing + the TF score from the text and the IDF score from all the texts combined. + + Args: + word: a given word from the text + text: a string of text from which the TF-IDF score of each word is calculated + text_list: a list of all the texts to compare with + + Returns: + the TF-IDF score of a word from the text + """ + # term frequency = number of times the word appears in the text / total number of words in the text + tf = text.words.count(word) / len(text.words) + # number of texts containing the word + n_containing = sum(1 for text in text_list if word in text.words) + # inverse document frequency = log(number of texts / (1 + number of texts containing the word)) + idf = math.log(len(text_list) / (1 + n_containing)) + # term frequency-inverse document frequency = term frequency * inverse document frequency + return tf * idf + + +if __name__ == "__main__": + titles = ('Frankenstein', + 'Paradise_Lost', + 'The_Romance_of_Lust', + ) + + url = {'Frankenstein': 'http://www.gutenberg.org/cache/epub/84/pg84.txt', + 'Paradise_Lost': 'http://www.gutenberg.org/cache/epub/20/pg20.txt', + 'The_Romance_of_Lust': 'http://www.gutenberg.org/cache/epub/30254/pg30254.txt', + } + + # create a list for the text from each title + text_list = [] + + # for each title + for title in titles: + # get the text from url and strip the header comments + text = filter_PG_text(get_cache(url['%s' % title], '%s.txt' % title)) + # append the adjusted top 500 words converted to textblob to the list of texts + # saves time by only calculating the TF-IDF scores for the top 500 words with the top 500 words + text_list.append(tb(get_top_n_words(get_word_list(text), 500))) + + # create list of lists, one list for each title to store the top n words + list_ = [[] for i in range(len(titles))] + n = 50 + + # for each text + for i, text in enumerate(text_list): + print('Top %d Words in %s with TF-IDF Scores:' % (n, titles[i])) + # create a dictionary with the words as the key and the scores as the value + scores = {word: tfidf(word, text, text_list) for word in text.words} + # sort the words based on the scores from highest to lowest + sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True) + + for word, score in sorted_words[:n]: + # print the word and its TF-IDF score for the top n words + print('\tWord: {}, TF-IDF: {}'.format(word, round(score, 10))) + # append the word to the appropriate list + list_[i].append(word) + print('') + print('') + + # for each title + for i, title in enumerate(titles): + # print the top n words + print('Top %d Words in %s:' % (n, title)) + print(list_[i], '\n') + # print the sentiment of the top n words + print('Sentiment of Top %d Words in %s:' % (n, title)) + print(sentiment_analyzer(list_[i]), '\n') + # generate a wordcloud with the top n words + word_cloud(list_[i], title)