-
Notifications
You must be signed in to change notification settings - Fork 16
Complete text mining #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
|
||
| Top 50 Words in Paradise_Lost: | ||
|
|
||
| - ['thir', 'thy', 'thou', 'thee', "heav'n", 'shall', 'th', 'god', 'earth', 'man', 'high', 'great', 'death', 'till', 'hath', 'hell', 'stood', 'day', 'good', 'like', 'things', 'night', 'light', 'farr', 'love', 'eve', 'o', 'world', 'adam', 'soon', 'let', 'hee', 'son', 'life', 'know', 'place', 'long', 'forth', 'self', 'mee', 'ye', 'way', 'power', 'hand', 'new', 'deep', 'end', 'fair', 'men', 'satan'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I wonder how much of the differences could be associated with unrecognized words - e. g., archaic spellings or contractions which the sentiment analyzer doesn't recognize and thus returns "neutral" for.
text_mining.py
Outdated
| Returns: | ||
| text from url | ||
| """ | ||
| if exists(file_name) == False: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small style thing - rather than checking if a boolean is equal to false, we can just do "if not exists(filename) :" or "!exists".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note, I like the structure of reading a file unless the file doesn't exist, then grabbing from the URL instead. It makes the program nice and portable!
| word_list[i] = word_list[i].strip(string.punctuation) | ||
|
|
||
| stop_words = get_stop_words('en') | ||
| stop_words_2 = ["a", "about", "above", "across", "after", "afterwards", "again", "against", "all", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, what happens if you only use the stop_words words, rather than your manually-assembled ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stop_words = ['a', 'about', 'above', 'after', 'again', 'against', 'all', 'am', 'an', 'and', 'any', 'are', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', "can't", 'cannot', 'could', "couldn't", 'did', "didn't", 'do', 'does', "doesn't", 'doing', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', "hadn't", 'has', "hasn't", 'have', "haven't", 'having', 'he', "he'd", "he'll", "he's", 'her', 'here', "here's", 'hers', 'herself', 'him', 'himself', 'his', 'how', "how's", 'i', "i'd", "i'll", "i'm", "i've", 'if', 'in', 'into', 'is', "isn't", 'it', "it's", 'its', 'itself', "let's", 'me', 'more', 'most', "mustn't", 'my', 'myself', 'no', 'nor', 'not', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'ought', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'same', "shan't", 'she', "she'd", "she'll", "she's", 'should', "shouldn't", 'so', 'some', 'such', 'than', 'that', "that's", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', "there's", 'these', 'they', "they'd", "they'll", "they're", "they've", 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 'very', 'was', "wasn't", 'we', "we'd", "we'll", "we're", "we've", 'were', "weren't", 'what', "what's", 'when', "when's", 'where', "where's", 'which', 'while', 'who', "who's", 'whom', 'why', "why's", 'with', "won't", 'would', "wouldn't", 'you', "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself', 'yourselves']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It worked pretty well, but there were a few common words that weren't included in stop_words (I can only remember 'one' being a very common word off the top of my head), so I just googled another set of stop words and copied it over.
text_mining.py
Outdated
|
|
||
| ordered_by_frequency = sorted(word_counts, key=word_counts.get, reverse=True) | ||
|
|
||
| return ordered_by_frequency[0:n] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, you can shorten [0:n] to [:n] and the 0 is implied. Your thing still works, though - just a personal preference thing, really.
| # print the sentiment of the top n words | ||
| print('Sentiment of Top %d Words in %s:' % (n, title)) | ||
| print(sentiment_analyzer(top_n_words), '\n') | ||
| # print the sentiment of the whole text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a little too much commenting - print statements like these mostly stand on their own.
Though, a bit too much documentation is better than a bit too little!
text_mining_tfidf.py
Outdated
| from textblob import TextBlob as tb # pip install textblob | ||
|
|
||
|
|
||
| def get_cache(url, file_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to look into: you can import your own functions in Python (e. g., "from text_mining import get_cache" - it'd save you some repeated code!
Revised. Fixed syntax. Imported functions from text_mining instead of repeating them in text_mining_tfidf. Removed redundant stop_words removal from text_mining_tfidf, which changes the TF-IDF score of each word (weighs stop words into the score).