My next step in learning Machine Learning was to implement a Twitter Sentiment Analysis. I wanted to do a 3-class classification with “negative”, “neutral” and “positive”. Therefore it was much better to use a RandomForestClassfier instead of a SVM which is optimized for 2-class classification.
Before you can start you need the following data:
a) List of stop words (i.e. “and”, “because”, “it”, etc. – words that do not have any meaning)
b) List of slang words
c) List of affinity words
d) List of emoticons (i.e. “:-)”, etc.)
e) Set of negative, neutral and positive tweets (for training the model)
Then you can implement the sentiment analysis like this:
1) Download unclassified tweets
2) Pre-process tweets (remove stop words, replace slang words, …)
3) Create bag of words set (from the negative, neutral and positive tweets)
3a) Optional: you can do a k-cross validation to find the best attributes for training the model
4) Train the model
5) Classify the downloaded tweets
5a) Optional: store the classified tweets
1 2 3 |
def createForestModel(estimators = 100): rf = RandomForestClassifier(n_estimators = estimators) return rf |
1 2 3 4 |
def trainModel(X, Y, knel, c, estimators): # relaxation parameter clf = createForestModel(estimators) clf.fit(X, Y) return clf |
1 2 3 4 5 6 |
def predict(tweet, model, scaler, normalizer, positive, negative, neutral, stopWords, slangs, affinity, emoticons): # test a tweet against a built model z = mapTweet(tweet, positive, negative, neutral, stopWords, slangs, affinity, emoticons) # mapping z_scaled = scaler.transform(z) z = normalizer.transform([z_scaled]) z = z[0].tolist() return model.predict([z]).tolist()[0] # transform nympy array to list |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
def buildBagOfWords(positiveTweetsFile, neutralTweetsFile, negativeTweetsFile): positive = ngramGenerator.mostFreqList(positiveTweetsFile, 3000) negative = ngramGenerator.mostFreqList(neutralTweetsFile, 3000) neutral = ngramGenerator.mostFreqList(negativeTweetsFile, 3000) for w in positive: if w in negative+neutral : positive.remove(w) for w in negative: if w in positive+neutral : negative.remove(w) for w in neutral: if w in negative+positive : neutral.remove(w) # equalize uni-grams sizes m = min([len(positive), len(negative), len(neutral)]) positive = positive[0:m-1] negative = negative[0:m-1] neutral = neutral[0:m-1] return positive, negative, neutral |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
def writeUnclassifiedTweetsCsv(filename, model, scaler, normalizer, positive, negative, neutral, stopWords, slangs, affinity, emoticons): # function to load test file in the csv format : sentiment,tweet tweets = [] labels = [] fo = open(filename + ".svm_result", "w", encoding="utf-16") fo.write("TWEET, CLASSIFICATION\n") fo.write("<< SEE ADDITIONAL INFORMATION (SENTIMENT RATIO, ...) AT THE END OF THE FILE >>\n") fo.write("---------------------------------------------------------------------------------\n") with open(filename, newline='', encoding="utf-16") as csvfile: reader = csv.reader(csvfile, delimiter=',', quotechar='"') for line in reader: tweet = line[3] tweet = preprocessing.processTweet(tweet, stopWords, slangs) nl = predict(tweet, model, scaler, normalizer, positive, negative, neutral, stopWords, slangs, affinity, emoticons) fo.write(r'"' + tweet + r'","' + str(nl) + r'"' + "\n") tweets.append(tweet) labels.append(nl) csvfile.close() pos, neu, neg = getTweetsRatio(labels) fo.write("---------------------------------------------------------------------------------\n") fo.write("------------------ ADDITIONAL INFORMATION ---------------------------------------\n") rowText = '%.2f%% POSITIVE, %.2f%% NEUTRAL, %.2f%% NEGATIVE\n' % (pos * 100.0, neu * 100.0, neg * 100.0) fo.write(rowText) fo.write("MODEL TYPE: %s\n" % (type(model))) localtime = time.localtime() timeString = time.strftime("%Y-%m-%d %H:%M:%S", localtime) fo.write("DATE TIME: %s\n" % (timeString)) fo.write("---------------------------------------------------------------------------------\n") fo.close() return tweets, labels |
You can download my project files here: Twitter Sentiment
You would need Scikit-Learn and NLTK to run the code. I use Eclipse+PyDev+Anaconda3 as development environment.
Next steps: parallelize download of tweets and processing / classifying them.
I needed to thank you for this good read!! I certainly enjoyed every bit of it. I have got you book marked to check out new stuff you post…