Giter Club home page Giter Club logo

assessor-scraper's Introduction

assessor-scraper

The goal of this project is to transform the data from the Orleans Parish Assessor's Office website into formats that are better suited for data analysis.

development environment setup

prerequisites

You must have Python 3 installed. You can download it here.

first setup a python virtual environment

python3 -m venv .venv
. .venv/bin/activate

install the dependencies with pip

pip install -r requirements.txt

Getting started

Set up the database

By default, the scraper is setup to load data into a PostgreSQL database. Docs on setting up and making changes to the database are here. You can quickly get the database running locally using Docker.

docker-compose up -d db

If you want to explore how to extract data using scrapy, use the scrapy shell to interactively work with the response.

For example,

scrapy shell http://qpublic9.qpublic.net/la_orleans_display.php?KEY=1500-SUGARBOWLDR
owner = response.xpath('//td[@class="owner_value"]/text()').get()
total_value = response.xpath('//td[@class="tax_value"]/text()')[3].get().strip()
next_page = response.xpath('//td[@class="header_link"]/a/@href').get()

Get all the parcel ids

Getting a list of parcel ids allows us to build urls for every property so we can scrape the data for that parcel. These parcel ids are used in the url like http://qpublic9.qpublic.net/la_orleans_display.php?KEY=701-POYDRASST, where 701-POYDRASST is the parcel id.

Running the parcel_id_extractor.py script will cleverly use the owner search to extract all available parcel ids, then save them in a file parcel_ids.txt.

The file is checked in to the repo, but if you want to run it yourself to update it with the latest, run

python parcel_id_extractor.py

Running the spider

Running the spider from the command line will crawl the assessors website and output the data to a destination of your choice.

By default, the spider will output data to a postgres database, which is configured in scraper/settings.py. You can use a hosted postgres instance or run one locally using Docker:

Important Note: Scraping should always be done responsibly so check the robots.txt file to ensure the site doesn't explicitly instruct crawlers to not crawl. Also when running the scraper, be careful not to cause unexpected load to the assessors website - consider running during non-peak hours or profiling the latency to ensure you aren't overwhelming the servers.

To run the spider,

scrapy runspider scraper/spiders/assessment_spider.py

Warning: this will take a long time to run...you can kill the process with ctrl+c.

To run the spider and output to a csv

scrapy runspider scraper/spiders/assessment_spider.py -o output.csv

Running on Heroku

Set required environment variables:

heroku config:set DATABASE_URL=postgres://user:pass@host:5432/assessordb

You can run the scraper on Heroku by scaling up the worker dyno:

heroku ps:scale worker=1

See the Heroku docs for more info on how to deploy Python code.

Running in aws with Terraform

  1. Install terraform
  2. cd terraform
  3. terraform init
  4. terraform plan
  5. terraform apply
  6. ssh ubuntu@{public_dns}

assessor-scraper's People

Contributors

dstuck avatar mrcnc avatar nonsenseless avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

assessor-scraper's Issues

find a better way to geocode addresses

Currently geocoding is done using the Mapzen API, which is rate limited at 25,000 geocoding requests per month. Since we're dealing with ~160,000 properties, we'll need to find another solution to geocode all the addresses.

One option is to setup a pelias server, which is the same thing Mapzen uses anyway.

make mapzen token configurable

Currently we're using Mapzen's geocoding service as a fallback if the coordinates cannot be parsed from the parcel link.....but the token is hardcoded and it should instead be configurable using an environment variable, in accordance with the best practices of 12 factor apps

scraper takes too long to run

Currently, on my machine with a 2.2 GHz i7 Processor, it's crawling ~600 pages/minute. Since there are ~165000 parcels, it will take ~4.5 hours to complete. This is too long and it would be a lot better if we could speed up this process.

Update dependency versions to fix security alerts

The following security alerts exist at the time of creation of this issue:

  • requests (moderate severity) • opened 16 days ago by GitHub • requirements.txt
  • pyopenssl (high severity) • opened on Oct 10 by GitHub • requirements.txt
  • cryptography (high severity) • opened on Jul 31 by GitHub • requirements.txt

These dependencies should be updated soon.

values are put into the incorrect keys

The parcel_map column should be a link (or null) and the legal description should be the legal text that looks like this

1. PLUM ORCHARD SUB DIV SQ 14
 2. LOTS 39 40 CAMELIA AND DREUX
 3. 45X110   1/ST SGLE V/SIDING

Currently the assessment areas are null and the values are in the legal_description.

screen shot 2017-12-14 at 9 00 18 pm

This needs to be fixed b/c it's happening for a few other columns as well.

better error handling for properties with no data

Some properties have no data like this one

The page looks like this
no data

It can cause errors like this:

 2017-11-22 01:15:25 [scrapy.core.scraper] ERROR: Spider error processing <GET http://qpublic9.qpublic.net/la_orleans_display.php?KEY=1761-PACEBL> (referer: None)
 Traceback (most recent call last):
   File "/app/.heroku/python/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
     yield next(it)
   File "/app/.heroku/python/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
     for x in result:
   File "/app/.heroku/python/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
     return (_set_referer(r) for r in result or ())
   File "/app/.heroku/python/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
     return (r for r in result or () if _filter(r))
   File "/app/.heroku/python/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
     return (r for r in result or () if _filter(r))
   File "/app/scraper/spiders/assessment_spider.py", line 38, in parse
     property_info = self.parse_property_info(response)
   File "/app/scraper/spiders/assessment_spider.py", line 91, in parse_property_info
     logging.warning("No parcel map link for " + info['location_address'])
 KeyError: 'location_address'

if no map parcel link exists, fallback to another geocoding method

Currently if there is not a link to the parcel map, the spider just logs it. However, without the parcel map link, we cannot determine the coordinates for the property. For these parcels, we should try to use a geocoding service to get the coordinates based on the property address as an alternative method for obtaining the coordinates.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.