Giter Club home page Giter Club logo

webscraper's Introduction

A newer, cleaner version is being written here. WIP

A scrapper for multiple covid state websites. Triedcatched's ghost!

Used by www.covid19india.org admin teams. Not for general consumption :P :D

Usage

Currently there are three types of bulletins:

  1. Images - AP, AR, BR, CT, JH, JK, HP, MH, MP, NL, PB, RJ, TG, TN, UK, UP
  2. PDFs - HR, KA, KL, PB, TN, WB
  3. Dashboards - CH, GJ, ML, OR, PY, TR, Vaccines

For all those where ocr is supported (optical character recognition using google vision api), the command to run is:
./ocr.sh "fully qualified path to image" "State Name" "starting text, ending text" "True/False" "ocr/table/automation"

Parameter description:

  1. "fully qualified path to image": Example "./home/covid/mh.jpg" The path cannot be relative path but it should have the fully qualified path.
  2. "State Name": This is the state for which the image is being passed. Example: "Andhra Pradesh".
  3. "starting text, ending text": This is the starting text of an image which considered to be the begining of a bulletin. In case you want auto detection to kick in, use "auto,auto". In some of the cases, if the bulletin has a text above the table with district names, consider cropping the image to have only the table with district data.
  4. "True/False": This parameter is used in case you want to translate the district name (True: yes, please translate. False: No, do not translate). As of now this is applicable only to UP and BR bulletins.
  5. "ocr/table/automation": This is an option provided where in case you want to skip one or more of the steps (ocr, table creation or automation.py run), you can provide those steps in comma separated manner. Example: "ocr,automation" will skip both ocr step and the automation step. "ocr,table" will skip image reading and table creation, but will run automation.py step to compute the delta.

How does ocr.sh work?

Detailed Flow of ocr.sh

For any bulletin to be parsed, we use Google Vision API free tier. All the steps are called via ocr.sh.

  1. First, google vision api is called on the image to generate bounds.txt file. This is a direct output from the Google Vision API. This needs to be parsed to figure out the tablular structure.
  2. Next, "ocrconfig.meta.orig" file is parsed and this generates ocrconfig.meta file. This file is used to tweak the way a table is interpreted in a bulletin.
  3. ocr.sh internaly invokes googlevision.py file as well. This file is responsible for using ocrconfig.meta file and read the output generated by the Google Vision API (bounds.txt). The output of this step is generation of an output.txt file. This file has textually converted data of the image passed (basically a csv file with a row for each district given in the image).
  4. In the last step, the bounds.txt file is copied into automation/.tmp/stateCode.txt and automation.py is invoked to generate the delta values for the state across all districts.

NOTES:

  • Since output.txt is an intermediate file, in case there are issues wrt bulletin being converted into text, then, the feature of skipping ocr and table generation can be used after correcting values in output.txt. Example: ./ocr.sh auto,auto False "ocr,table".
  • OCR is heavily dependent on how good the image quality is. If the quality of image is bad, the output of google vision api might not be good enough to generate data.
  • Since googlevision.py script tries to auto identify the bulletin table, it alwalys searches for district names and assumes the line with the first occurance of a district name is the starting of the table. Hence, in case there are notes above a table with district names, the image has to be cropped to remove the text above the table.

How does googlevision.py work?

  1. Google Vision API gives each text that it recognises and coordinates of a rectangle around the text it matches. Example:
  9248|bounds|245,326|281,326|281,343|245,343

This shows 9248 was found with bottom left coordinate of 245,326, bottom right of 281,343, top right of 281,343 and top left of 245,343

The idea is to use this information to figure out which all texts in an image fall on the same lines and same columns. 2. googlevision.py uses the bounds.txt file which contains this information to generate an internal class per text. This class has the following definition:

class cellItem:
def __init__(self, value, x, y, lbx, lby, w, h, col, row, index):
  self.value = value
  self.x = x
  self.y = y
  self.col = col
  self.row = row
  self.index = index
  self.lbx = lbx
  self.lby = lby
  self.h = h
  self.w = w

Definitions:

x - mid point of the text in x direction
y - mid point of the text in y direction
col - a column number assigned to the text (all texts that fall with same x coordinate with a given tolerance will have the same col number)
row - a row number assigned to the text (all texts that fall with same y coordinate with a given tolerance will have the same row number)
index - a unique number identifying each text
lbx - the left bottom x coordinate of the text (used for drawing a rectangle around the text)
lby - the left bottom y coordinate of the text (used for drawing a rectangle around the text)
h - height of the text (calculated using  left top y - left bottom y)
w - width of the text (calculated using right bottom x - left bottom x)
  1. For each text found in the image, using it's rectangular coordinates, the mid points are calculated.
  2. The next steps involve figuring out the row and column numbers. For this the logic is simple:
    - If the x coordinates are same, then the lie on the same column (in case hough transformation is used, all texts within the bounds of two consecutive lines should have the same column number).
    - If the y coordinates are same, then they lie on the same row.
    However, a tolerance is considered while arriving at col and row numbers.
  3. In order to arrive at the rows (lines) that matter, the starting text and the ending text parameters are used. The moment a line with the starting text is encountered, it is assumed to be the first line of the table. If the starting text is kept as "auto", in that case, the code checks for the first line containing a district name as the starting of the table.
  4. Next step is to print all those with the same row in the same line but with a sorting on the x coordinate (column value). While printing some of the corner scenarios like district names with space need to be considered and handled.
  5. In case the district names are in Hindi, then before printing, the text has to be converted into English using a translation dictionary which is used.
  6. The output is put into a file named output.txt. This file will have a 1-1 conversion of the bulletin table that has districts information.

How does automation.py work?

  1. automation.py uses api endpoint at covid19india.org to figure out the difference per district from bulletin to the api endpoint.
  2. automation.py has different modes of operation - ocr, pdf, dashboard.
    - For ocr, the .tmp/statecode.txt file is used to compute the delta (this comes from ocr.sh run).
    - For pdfs, pdftotext and camelot are used to convert a pdf into a csv file and then use it for delta calculation.
    - For dashboards, beautifulsoup or sometimes plain json pulls are used to get the information to calculate the delta.
  3. In case of pdfs, there's an option to specify which page number to read and parse. The format is:
  ./automation.py "statename" full pdf=<urlOfThepdf>=<pageNumber> 
  ./automation.py "stateName" full pdf==<pageNumber> (this in case you manually place the pdf as .tmp/stateCode.pdf)
  1. For dashboards, a meta file automation.meta has the dashboard endpoint from which to read and parse the data.
  2. For each state, there has to be an entry in automation.meta file (even if it's driven by ocr). The meta file has the stateCode to consider for picking up the file from .tmp folder. The state code also allows for standardization of code. Each state has a GetData() function which acts as the entry point for the calculations. Example:
def TRGetData():
  response = requests.request("GET", metaDictionary['Tripura'].url)
  soup = BeautifulSoup(response.content, 'html.parser')
  table = soup.find("tbody").find_all("tr")

  districtArray = []
  for index, row in enumerate(table):
    dataPoints = row.find_all("td")
    
    districtDictionary = {}
    districtDictionary['districtName'] = dataPoints[1].get_text().strip()
    districtDictionary['confirmed'] = int(dataPoints[8].get_text().strip())
    districtDictionary['recovered'] = int(dataPoints[10].get_text().strip())
    districtDictionary['deceased'] = int(dataPoints[12].get_text().strip())
    districtArray.append(districtDictionary)

  deltaCalculator.getStateDataFromSite("Tripura", districtArray, option)

How does this code sit in the grand scheme of automation at covid19india.org?

botto.pngimage Essentially, the idea is that volunteers send the request over a telegram bot that is then configured to trigger the script when a command is required.

webscraper's People

Contributors

bee-rickey avatar sudevschiz avatar junaidbabu avatar manikandanj avatar

Stargazers

Gouenji Shuuya avatar Abhiram R avatar Vivek S avatar Asghar C avatar AM avatar gieoon avatar mld avatar  avatar  avatar Ashutosh Meena avatar rk avatar Hrishikesh Bhanja avatar Arko Chatterjee avatar Cibin Joseph avatar  avatar

Watchers

rk avatar  avatar Harsh Jain avatar

webscraper's Issues

Use new xInterval and yInterval when startingText is given

As of now the xInterval and yInterval for the ocr code is purely taken based on all the texts detected. Use the width and height of the texts that lie within bounds using starting text instead. This will prevent unwanted column and row merges due to unnecessary texts.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.