Sentiment Analysis of Indian Print Media News

Krishnakanth Allika

2020-12-11 21:31

Comments

wordcloud

Podcast¶

Press play and let my words come alive!

Contents

Abstract
Problem Statement
Analytical Approach
Data Preperation and Cleansing
Image to text conversion using OCR
Sentiment Analysys using NLTK
Variation in sentimental scores
Conclusions
Future Directions

Abstract ¶

Newspapers or the print media is still one of the significant news publishing medium in several parts of the world. India has the second-largest newspaper market in the world, with daily newspapers reporting a combined circulation of over 240 million copies as of 2018. The overall sentimental pulse of a location (example, a city) can be determined by the news published in the location on any given day. It would be interesting to observe the variation in the sentiment across different cities on a particular day or over a period of time.

Problem Statement ¶

The project aims to identify any variations in the sentiment of newspaper articles published across various geographical locations in India.

Objectives¶

Identify overall sentiment of news published in a location.
Identify variation in sentiments across multiple locations.

Constraints¶

Sometimes various news agencies may print the same news differently (i.e., positively or negatively based on their perspectives). To mitigate this issue, news articles from different locations publshed by a single news agency will be used for analysys.

[^top]

Analytical Approach ¶

Workflow¶

Data source: News clippings or paper cuttings of articles from a newspaper company that has publication and distribution centers at more than one location is selected as a data source to minimize the constraint mentioned above.
Optical Character Recognition (OCR): The news clippings are passed through an OCR to convert them in to text.
Sentiment Analysis: The text output from OCR is analysed using NLTK sentiment analysis to identify the overall sentiment of the news published in the location.
Sentiment Variation: Steps 1 through 3 are repeated for various locations and variations in sentiments are measured.

Packages and Statistical tools¶

zipfile to uncompress and access images from a zip file.
PIL Python Image Library to read images.
pytesseract A Python wrapper for Tesseract to perform Optical Character Recognition and convert images to texts.
nltk Python Natural Language Took Kit library to perform sentiment analysis. wordnet for stopwords and punkt for lemmatization.
matplotlib library to plot wordclouds and bar graphs for variation in sentiments across cities.
Bar plot is used to compare sentiments of cities

[^top]

Data Preperation and Cleansing ¶

Data Collection¶

Newspaper clippings are collected from Deccan Chronicle newspaper(dated 2020-09-18) for the cities of Hyderabad and Chennai. The images are stored in two seperate zip files (hyderabad.zip, chennai.zip) corresponding to each city.

Data source: http://epaper.deccanchronicle.com/epaper_main.aspx

Datasets: https://gitlab.com/datasets_a/sentiment-analysis-of-indian-print-media-news

[^top]

Image to text conversion using OCR ¶

In [89]:

# Importing libraries
import zipfile as z
from PIL import Image
import pytesseract

# txtDB is a list of list of texts from each location.
# For example, txtDB[0] is a list of texts from Hyderabad
# and txtDB[1} is a list of texts from Chennai
# When news from more locations are added, they will 
# be appended to this list
txtDB=[]
locations=["Hyderabad","Chennai"]
zipfiles=[location.lower()+".zip" for location in locations] # List of file paths of zip files
for filepath in zipfiles: # for each location's zip file
    txtList=[] # initializing empty text list to add OCR output
    with z.ZipFile(filepath) as myZip: # access the zip file
        for archive in myZip.infolist(): # access archive
            with myZip.open(archive) as imagefile: # access each image file
                image = Image.open(imagefile) # reading each file into image
                ocrText=pytesseract.image_to_string(image) # Image to text
                txtList.append(ocrText) # append text to list
    txtDB.append(txtList) # append text list to main list

In [90]:

# We have news from two locations so txtDB should have two elements
print(len(txtDB))

In [91]:

# Location 1 Hyderabad had 24 news clippings and 
# Location 2 Chennai had 35 news clippings.
# Checking if all news clippings are converted to texts
for i in txtDB:
    print(len(i))

24
35

In [92]:

# Checking a sample text output. 
#Example: Hyderabad first news clipping
print(txtDB[0][0])

 

Battleground Srinagar
J aOR

   

Kashmiri protesters clash with the police after a gun battle between mi
security forces at Batamaloo of Srinagar on Thursday. Police used tear gas shells and
pellet guns to disperse protesters. — H.U. NAQASH

In [93]:

#Example: Chennai last news clipping
print(txtDB[1][-1])

Old trick: PLA borrows leaf
from 2,200 years old battle

Chinese troops play Punjabi songs on loudspeakers at the LAC

Men have decided to stay
back and assist the Army

PAWAN BALI | DC
NEW DELHI, SEPT. 17:

While Indians are amused
over Chinese troops play-
ing Punjabi songs on loud-
speakers at Line of Actual
Control (LAC) at Eastern
Ladakh where for the first
time in 45 years shots
have been fired, its seems
PLA is using the
employed by Chinese Han
generals some 2,200 years
ago during so called
“Battle of Gaixia”.

In this battle, Han King
Liu Bang beat his Chu
rival Xiang Yu and estab-
lished the Han Dynasty in
China.

Chinese state run Global

  

  

playing Punjabi
LAC said that the Indian
army is in a situation of
“hearing the Chu songs on
four sides.”

‘The Global Times said
that the move of playing
Punjabi songs sends a
message to India that the
Indian army is isolated
and besieged on all sides.

“Hearing the Chu songs
on four sides” is a Chinese
idiom, which is said to
have originated from the
supposed “Battle of,
Gaixia”. After the fall
the Qin'Dyn
of Chu and the State of
Han fought for control of
China around 206 BC to
202 BC.

‘As per the legends
around 202 BC, Chu king

  

 

 

 

    

 

© IT SEEMS foolish if
PLA assumes that Punjabi
songs played at the LAC
will have the same impact
‘on a professional and bat-
tle hardened Indian Army
as what happened with
Chu soldiers some thou-
sands of years back.

Xiang Yu was trapped and
surrounded by the rival
Han forces in hills at Ga
Xia, The legends say to
weaken the morale of the
and King Xiang
ferocious fight:
er, Han soldiers started
singing Chu songs from all
sides.

After listening to these
songs, Chu" soldiers
thought that Han people
had captured their home-
land and brought Chu peo-
ple to the battlefield.

‘As per these stories, Chu
soldiers became worried
about their family mem-
bers, homesick and lost
the will to fight.

Many — Chu’ soldie
deserted, Even Chu King
is said to have become des:
perate after listening to
the Chu songs

 

   

  

  

 

 

 

committed suicide after
his forces tried to break
the trap.

However, it seems foolish
if the PLA assumes that
Punjabi songs played at
the LAC will have the
same impact on a profes:
sional and battle hardened
Indian Army as what hap-

 

FROM PG 1

‘These villages used to be
pit stops for
tourists and tho
ling to Adi Kailash;
would be offered home:
tays by the locals. Due to
the Covid-19 pandemic,
the people have been hit
hard. Most of them sell
various products includ
ing herbs which they
painstakingly collect from
mountains at the annual
trade between India and
China but even that did
not take place this year.
“Our main earnings are
through the trade every
year. ‘When we return
from China, we buy some
products which we sell
when we go down to the
town. This time around,
we do not know how to
survive,” said a local.

‘The villagers are’ wor-
ried that if tensions con-
tinue to rise between India
and China, they will pe
manently lose their sou
of income and in the
worst scenario, even their

    

 

   

 

 

 

 

    

© OF THE FIVE villages
in Dharchula of Pithoragarh district of Uttarakhand,
Kuti has a population of about 365 people living in
15

houses.

© NABHI HAS 78 people and 32 houses, Raung Kong
about 120 living in 35 houses, Napalachchu 74 in 25
houses and

Gunji about 335 in 194 houses.

© LOCATED AT an elevation of about 10,000 feet,
locals stay there for
six months every year and then migrate to the plains

© ALL THE villages are few kilometres away
from the tri-junction between India, China and Nepal

© CHINA HAS deployed surface-to-air missile near
Mount Kailash — not very far away from the villages

© ABOUT 975 people including 395 females reside
across the five border villages, leading tough lives.

houses. About 975 people gave our rations to them
including 395 females and transported their
reside across the five bor- equipment, arms and
der villages, leading tough ammunition on our backs.
lives . “During the 1962 We will do the same even
war, our ancestors assist- now if there is a repeat of
ed the Indian Army and 1962,” Virendra Singh
stood by them in the most Nabiyal, a local of Nabi
difficult times. We even

  

 

  

  

 

 

pened with Chu soldiers
some thousands of years
back. Only last week
Indian army's Sikh

Regiment observed the fought thousands of
123rd_Saragarhi_ Day Pathans in North West
where 22 soldiers of 4Sikh Frontier Province on.
in a famous ‘last-stand’ September 12, 1897.

[^top]

Sentiment Analysys using NLTK ¶

In [94]:

# Importing libraries
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords 
from wordcloud import WordCloud
import re
import matplotlib.pyplot as plt

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!

In [99]:

stop_words = set(stopwords.words('english'))
sentiment_scores=[]
for Location,LocationText in zip(locations,txtDB):
    print("------------------\nLocation: {}\n------------------".format(Location))
    txt=" ".join(LocationText).lower()
    print("Initial text length:",len(txt))
    # remove single quotes
    txt = txt.replace("'", "")
    # Tokenization
    tokens = nltk.word_tokenize(txt)
    # Remove extra characters and remove stop words.
    txt = [''.join(re.split("[ =—.,;:!?‘’``''@#$%^_&*()<>{}~\n\t\\\-]", word)) for word in nltk.Text(tokens)]
    # Filter stop words and empty entries
    txt=[w for w in txt if not w in stop_words and len(w) != 0] 
    # Lemmatization
    lemmatizer=nltk.WordNetLemmatizer()
    txt=[lemmatizer.lemmatize(w) for w in txt]
    print("Reduced text length after removing stop words: {}\n".format(len(txt)))
    # Plots
    (w,h,f)=(1000,800,(10,10)) # word cloud width and height
    # Plot word frequency
    freq = nltk.FreqDist(txt)
    plt.figure(figsize = (10, 4))
    freq.plot(20, cumulative=False, title=Location+" Word Frequency")
    plt.show()
    # Plot word cloud
    print("\n------------------------------------------------------------------")
    plt.figure(figsize = f,edgecolor="black")
    wordcloud=WordCloud(background_color='White',colormap='seismic',width=w,height=h).generate(" ".join(txt))
    plt.imshow(wordcloud)
    plt.axis("off") 
    plt.title(Location+" Word Cloud",fontsize=16)
    plt.show()
    # Positive words
    print("\n------------------------------------------------------------------")
    with open("positive-words.txt","r") as pos:
        poswords = pos.read().split("\n")
    txtpos=[w for w in txt if w in poswords]
    plt.figure(figsize = f)
    wordcloud=WordCloud(background_color='White',colormap='seismic',width=w,height=h).generate(" ".join(txtpos))
    plt.imshow(wordcloud)
    plt.axis("off") 
    plt.title(Location+" Positive Word Cloud",fontsize=16)
    plt.show()
    # Negative words
    print("\n------------------------------------------------------------------")
    with open("negative-words.txt","r") as neg:
        negwords = neg.read().split("\n")
    txtneg=[w for w in txt if w in negwords]
    plt.figure(figsize = f)
    wordcloud=WordCloud(background_color='White',colormap='seismic',width=w,height=h).generate(" ".join(txtneg))
    plt.imshow(wordcloud)
    plt.axis("off") 
    plt.title(Location+" Negative Word Cloud",fontsize=16)
    plt.show()
    sentiment_scores.append([len(txtpos),len(txtneg)])

------------------
Location: Hyderabad
------------------
Initial text length: 43686
Reduced text length after removing stop words: 4550

No description has been provided for this image

------------------------------------------------------------------

------------------------------------------------------------------

------------------------------------------------------------------

------------------
Location: Chennai
------------------
Initial text length: 59661
Reduced text length after removing stop words: 6147

------------------------------------------------------------------

------------------------------------------------------------------

------------------------------------------------------------------

[^top]

Variation in sentimental scores ¶

Overall sentiment is calculated as (Number of positive words - Number of negative words)/(Sum of number of positive and negative words)

Variation of sentiment across locations is observed by plotting a bar graph of their overall sentiments.

In [100]:

import pandas as pd
df=pd.DataFrame(sentiment_scores,locations,["Pos","Neg"])
df["Sentiment"]=(df.Pos-df.Neg)/(df.Pos+df.Neg)
print(df)

           Pos  Neg  Sentiment
Hyderabad  126  151  -0.090253
Chennai    179  235  -0.135266

In [106]:

ax=df.plot.bar(y=["Pos","Neg"],color=["lightblue","red"],edgecolor="black")
plt.title("Variation in Pos/Neg news across different locations",fontsize=14)
plt.ylabel('Word count',fontsize=12)
for p in ax.patches:
    ax.annotate(str(p.get_height()), (p.get_x()+0.07, p.get_height()*0.9),
                fontsize=12)
plt.xticks(rotation=0)
plt.show()

In [108]:

ax=df.plot.bar(y="Sentiment",color="darkred",edgecolor="black")
plt.title("Variation in Overall Sentiment across different locations",fontsize=14)
plt.ylabel('Sentiment score',fontsize=12)
plt.xticks(rotation=0)
for p in ax.patches:
    ax.annotate(str(round(p.get_height(),2)), (p.get_x()+0.15, p.get_height() * 0.95),
                fontsize=12,color="white")
ax.get_legend().remove()
ax.xaxis.tick_top()
plt.show()

[^top]

Conclusions ¶

The overall sentiment of news articles published today in Hyderabad and Chennai is negative.
There is a variation in the level of negative sentiment in news published in both cities. Today, Chennai newspaper had more negative articles than Hyderabad.

[^top]

Future Directions ¶

A time series analysis could be done using historical data to understand changes in sentiments.
This study can be expanded to all the locations to create a nationwide or global heat map of news sentiment.
Multiple newspapers can be compared to classify how positive or negative a news agency is in the overall spectrum.

[^top]

Notes:

I did this project on 2020-09-18 when I attended the Artificial Intelligence workshop conducted by SASTRA in September 2020, however, it's published three months later, so the data for sentiment analysis is from September and not from December 2020.

Last updated 2020-12-11 22:04:46.917359 IST