Sentiment Analysis of Indian Print Media News
Podcast¶
Press play and let my words come alive!
Abstract ¶
Newspapers or the print media is still one of the significant news publishing medium in several parts of the world. India has the second-largest newspaper market in the world, with daily newspapers reporting a combined circulation of over 240 million copies as of 2018. The overall sentimental pulse of a location (example, a city) can be determined by the news published in the location on any given day. It would be interesting to observe the variation in the sentiment across different cities on a particular day or over a period of time.
Problem Statement ¶
The project aims to identify any variations in the sentiment of newspaper articles published across various geographical locations in India.
Objectives¶
- Identify overall sentiment of news published in a location.
- Identify variation in sentiments across multiple locations.
Constraints¶
- Sometimes various news agencies may print the same news differently (i.e., positively or negatively based on their perspectives). To mitigate this issue, news articles from different locations publshed by a single news agency will be used for analysys.
[^top]
Analytical Approach ¶
Workflow¶
- Data source: News clippings or paper cuttings of articles from a newspaper company that has publication and distribution centers at more than one location is selected as a data source to minimize the constraint mentioned above.
- Optical Character Recognition (OCR): The news clippings are passed through an OCR to convert them in to text.
- Sentiment Analysis: The text output from OCR is analysed using NLTK sentiment analysis to identify the overall sentiment of the news published in the location.
- Sentiment Variation: Steps 1 through 3 are repeated for various locations and variations in sentiments are measured.
Packages and Statistical tools¶
- zipfile to uncompress and access images from a zip file.
- PIL Python Image Library to read images.
- pytesseract A Python wrapper for Tesseract to perform Optical Character Recognition and convert images to texts.
- nltk Python Natural Language Took Kit library to perform sentiment analysis. wordnet for stopwords and punkt for lemmatization.
- matplotlib library to plot wordclouds and bar graphs for variation in sentiments across cities.
- Bar plot is used to compare sentiments of cities
[^top]
Data Preperation and Cleansing ¶
Data Collection¶
Newspaper clippings are collected from Deccan Chronicle newspaper(dated 2020-09-18) for the cities of Hyderabad and Chennai. The images are stored in two seperate zip files (hyderabad.zip, chennai.zip) corresponding to each city.
Data source: http://epaper.deccanchronicle.com/epaper_main.aspx
Datasets: https://gitlab.com/datasets_a/sentiment-analysis-of-indian-print-media-news
[^top]
Image to text conversion using OCR ¶
# Importing libraries
import zipfile as z
from PIL import Image
import pytesseract
# txtDB is a list of list of texts from each location.
# For example, txtDB[0] is a list of texts from Hyderabad
# and txtDB[1} is a list of texts from Chennai
# When news from more locations are added, they will
# be appended to this list
txtDB=[]
locations=["Hyderabad","Chennai"]
zipfiles=[location.lower()+".zip" for location in locations] # List of file paths of zip files
for filepath in zipfiles: # for each location's zip file
txtList=[] # initializing empty text list to add OCR output
with z.ZipFile(filepath) as myZip: # access the zip file
for archive in myZip.infolist(): # access archive
with myZip.open(archive) as imagefile: # access each image file
image = Image.open(imagefile) # reading each file into image
ocrText=pytesseract.image_to_string(image) # Image to text
txtList.append(ocrText) # append text to list
txtDB.append(txtList) # append text list to main list
# We have news from two locations so txtDB should have two elements
print(len(txtDB))
# Location 1 Hyderabad had 24 news clippings and
# Location 2 Chennai had 35 news clippings.
# Checking if all news clippings are converted to texts
for i in txtDB:
print(len(i))
# Checking a sample text output.
#Example: Hyderabad first news clipping
print(txtDB[0][0])
#Example: Chennai last news clipping
print(txtDB[1][-1])
[^top]
Sentiment Analysys using NLTK ¶
# Importing libraries
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from wordcloud import WordCloud
import re
import matplotlib.pyplot as plt
stop_words = set(stopwords.words('english'))
sentiment_scores=[]
for Location,LocationText in zip(locations,txtDB):
print("------------------\nLocation: {}\n------------------".format(Location))
txt=" ".join(LocationText).lower()
print("Initial text length:",len(txt))
# remove single quotes
txt = txt.replace("'", "")
# Tokenization
tokens = nltk.word_tokenize(txt)
# Remove extra characters and remove stop words.
txt = [''.join(re.split("[ =—.,;:!?‘’``''@#$%^_&*()<>{}~\n\t\\\-]", word)) for word in nltk.Text(tokens)]
# Filter stop words and empty entries
txt=[w for w in txt if not w in stop_words and len(w) != 0]
# Lemmatization
lemmatizer=nltk.WordNetLemmatizer()
txt=[lemmatizer.lemmatize(w) for w in txt]
print("Reduced text length after removing stop words: {}\n".format(len(txt)))
# Plots
(w,h,f)=(1000,800,(10,10)) # word cloud width and height
# Plot word frequency
freq = nltk.FreqDist(txt)
plt.figure(figsize = (10, 4))
freq.plot(20, cumulative=False, title=Location+" Word Frequency")
plt.show()
# Plot word cloud
print("\n------------------------------------------------------------------")
plt.figure(figsize = f,edgecolor="black")
wordcloud=WordCloud(background_color='White',colormap='seismic',width=w,height=h).generate(" ".join(txt))
plt.imshow(wordcloud)
plt.axis("off")
plt.title(Location+" Word Cloud",fontsize=16)
plt.show()
# Positive words
print("\n------------------------------------------------------------------")
with open("positive-words.txt","r") as pos:
poswords = pos.read().split("\n")
txtpos=[w for w in txt if w in poswords]
plt.figure(figsize = f)
wordcloud=WordCloud(background_color='White',colormap='seismic',width=w,height=h).generate(" ".join(txtpos))
plt.imshow(wordcloud)
plt.axis("off")
plt.title(Location+" Positive Word Cloud",fontsize=16)
plt.show()
# Negative words
print("\n------------------------------------------------------------------")
with open("negative-words.txt","r") as neg:
negwords = neg.read().split("\n")
txtneg=[w for w in txt if w in negwords]
plt.figure(figsize = f)
wordcloud=WordCloud(background_color='White',colormap='seismic',width=w,height=h).generate(" ".join(txtneg))
plt.imshow(wordcloud)
plt.axis("off")
plt.title(Location+" Negative Word Cloud",fontsize=16)
plt.show()
sentiment_scores.append([len(txtpos),len(txtneg)])
[^top]
Variation in sentimental scores ¶
Overall sentiment is calculated as (Number of positive words - Number of negative words)/(Sum of number of positive and negative words)
Variation of sentiment across locations is observed by plotting a bar graph of their overall sentiments.
import pandas as pd
df=pd.DataFrame(sentiment_scores,locations,["Pos","Neg"])
df["Sentiment"]=(df.Pos-df.Neg)/(df.Pos+df.Neg)
print(df)
ax=df.plot.bar(y=["Pos","Neg"],color=["lightblue","red"],edgecolor="black")
plt.title("Variation in Pos/Neg news across different locations",fontsize=14)
plt.ylabel('Word count',fontsize=12)
for p in ax.patches:
ax.annotate(str(p.get_height()), (p.get_x()+0.07, p.get_height()*0.9),
fontsize=12)
plt.xticks(rotation=0)
plt.show()
ax=df.plot.bar(y="Sentiment",color="darkred",edgecolor="black")
plt.title("Variation in Overall Sentiment across different locations",fontsize=14)
plt.ylabel('Sentiment score',fontsize=12)
plt.xticks(rotation=0)
for p in ax.patches:
ax.annotate(str(round(p.get_height(),2)), (p.get_x()+0.15, p.get_height() * 0.95),
fontsize=12,color="white")
ax.get_legend().remove()
ax.xaxis.tick_top()
plt.show()
[^top]
Conclusions ¶
- The overall sentiment of news articles published today in Hyderabad and Chennai is negative.
- There is a variation in the level of negative sentiment in news published in both cities. Today, Chennai newspaper had more negative articles than Hyderabad.
[^top]
Future Directions ¶
- A time series analysis could be done using historical data to understand changes in sentiments.
- This study can be expanded to all the locations to create a nationwide or global heat map of news sentiment.
- Multiple newspapers can be compared to classify how positive or negative a news agency is in the overall spectrum.
[^top]
Notes:
I did this project on 2020-09-18 when I attended the Artificial Intelligence workshop conducted by SASTRA in September 2020, however, it's published three months later, so the data for sentiment analysis is from September and not from December 2020.
Last updated 2020-12-11 22:04:46.917359 IST
Comments