Text and Image Recognition with pytesseract and OpenCV

word count

The goal of the project is to search for a keyword in each newspaper image and if the keyword appears anywhere on that page then print out the faces shown on that page. Text extraction using Tesseract and face detection using OpenCV are quite resource intensive. In order to save processing time and memory usage, the code is written in two parts.

Part 1: Creating a database of texts and face canvases of newspaper images

This part of the code is required to be run only once. No need to run this every time we search a keyword.
Description: Trying to extract text and faces from images every time we search a keyword is going to be time consuming. Instead, all the images are scanned for one time and the extracted text and face canvases (collection of faces in required output format) are stored in a list database ("searchDB"). This database can be used to quickly search keywords multiple times without touching the images, thereby saving time.
The database will not contain original images as they big in size and are not required. To minimize memory footprint, the database will only contain the OCR texts, face canvases and a boolean flag to indicate if the keyword exists in the image or not.

In [1]:
# Importing libraries
import zipfile as z
from PIL import Image, ImageOps, ImageDraw, ImageFont
import pytesseract
import cv2 as cv
import numpy as np

# loading the face detection classifier
face_cascade = cv.CascadeClassifier('http://link.datascience.eu.org/p002d1')

# A function to create a canvas of search results for each image.
def createCanvas(image,filename,facelist):
    thumbsize=100
    imgsPerRow=5
    padding=5
    fontsize=18
    textpadding=3
    canvasWidth=thumbsize*imgsPerRow
    font = ImageFont.truetype("http://link.datascience.eu.org/p002d2", fontsize)
    if len(facelist)>0:
        canvasHeight=(len(facelist)//imgsPerRow)*thumbsize
        if len(facelist)%imgsPerRow>0:
            canvasHeight+=thumbsize
        blackcanvas=Image.new("RGB",(canvasWidth,canvasHeight),color=(0,0,0))
        canvasHeight+=fontsize+textpadding
        canvas=Image.new("RGB",(canvasWidth,canvasHeight),color=(255,255,255))
        draw = ImageDraw.Draw(canvas)
        draw.text((0,0), "Results found in file {}".format(filename), font=font,fill=(0,0,0))
        row,column=0,0
        for x,y,w,h in facelist:
            faceimage=image.resize((thumbsize,thumbsize),resample=Image.LANCZOS,box=(x,y,x+w,y+h))
            blackcanvas.paste(faceimage,(thumbsize*column,thumbsize*row))
            column+=1
            if column==imgsPerRow:
                column=0
                row+=1
        canvas.paste(blackcanvas,(0,fontsize+textpadding))
    else:
        canvasHeight=(fontsize+textpadding)*2
        canvas=Image.new("RGB",(canvasWidth,canvasHeight),color=(255,255,255))
        draw = ImageDraw.Draw(canvas)
        draw.text((0,0), "Results found in file {}".format(filename), font=font,fill=(0,0,0))
        draw.text((0,fontsize+textpadding), "But there were no faces in that file!", font=font,fill=(0,0,0))
    canvas=ImageOps.expand(canvas,border=padding,fill=(255,255,255))
    return canvas

# Accessing images from the zip file and creating a database of OCR texts and detected faces.
searchDB=[]
filepath="http://link.datascience.eu.org/p002d3"
with z.ZipFile(filepath) as myZip:
    filelist=myZip.namelist()
    i=0
    for archive in myZip.infolist():
        with myZip.open(archive) as imagefile:
            image = Image.open(imagefile)
            ocrText=pytesseract.image_to_string(image)
            cv_img=cv.cvtColor(np.array(image), cv.COLOR_RGB2GRAY)
            faces = face_cascade.detectMultiScale(cv_img, scaleFactor=1.3, minNeighbors=5)
            imageCanvas=createCanvas(image,filelist[i],faces)
            searchDB.append([ocrText,imageCanvas,False])
            i+=1
Part 2: Seaching the keyword and creating output

It is sufficient to re-run only this part of the code to search for keywords. No need to run the whole notebook code.
Description: User is prompted for a search keyword. The database is searched for the keyword and the output is displayed.

In [3]:
searchKeys=input("Enter keywords seperated by comma: ")
searchKeys=searchKeys.split(",")
for searchKey in searchKeys:
    # Searching keyword and calculating dimensions of the output image.
    textFound=False
    finalCanvasHeight=0
    for fileDB in searchDB:
        fileDB[2]=False
        if searchKey.strip().lower() in fileDB[0].lower():
            textFound=True
            fileDB[2]=True
            finalCanvasWidth=fileDB[1].width
            finalCanvasHeight+=fileDB[1].height
    # Creating the output image
    if textFound:
        yPos=0
        finalCanvas=Image.new("RGB",(finalCanvasWidth,finalCanvasHeight))
        for fileDB in searchDB:
            if fileDB[2]:
                finalCanvas.paste(fileDB[1],(0,yPos))
                yPos+=fileDB[1].height
        finalCanvas=ImageOps.expand(finalCanvas,border=1,fill=(128,128,128))
        print("\nResults for keyword '{}'".format(searchKey.strip()))
        display(finalCanvas)
    else:
        print("Keyword '{}' not found in any newspaper.".format(searchKey.strip()))
Enter keywords seperated by comma: Chris, Mark

Results for keyword 'Chris'
Results for keyword 'Mark'
Notes:
  1. While testing, I observed that in this particular project, pytesseract recognized more text when the image is in RGB mode than in Grayscale mode. For example, image a-5.png contained the word "Mark" and a-9.png contained both "Chris" and "Mark". However, pytesseract did not read them when images were converted to grayscale but recognized them when in RGB mode. So the images are not converted to grayscale before extracting text.
  • During face detection, the optimal values scaleFactor=1.3, minNeighbors=5 are obtained after several trials of scaleFactor ranging from 1.05 to 1.4 and minNeighbors ranging from 2 to 6.

Comments