Giter Club home page Giter Club logo

idi-datadive-2018's Introduction

idi-datadive-scoping

IDI scoping repo - see Project Brief for more detailed information on background, goals, etc.

Proposed Methodology

At the DataDive - data scientists will work to develop a simple scraper tool that take as an input “search terms” and provide the links to projects that resulted when those terms were searched. This tool will be comprised of numerous subtools that search specific DFI websites. At the DataDive we will prioritize the creation of scrapers based on IDI’s prioritization. Scrapers will be written in Python likely using some combination of Selenium, Beautifulsoup, requests, etc.

General Requirements

  • Tool(s) take a csv/excel list of search terms as input
  • Tool(s) return a csv/excel file with a list of links that were found for a specific search term, each row should include (where available)
  • Project Name
  • Search Term Used
  • DFI Site scrape extracted from (what bank is being scraped)
  • Tool(s) are easy to execute and have straightforward instructions for use.
  • Tool(s) cover high priority DFI sites
  • All scrapers return data in an identical format - meaning the output has the same columns and meanings as all other scrapers. If some data is only available for a subset of DFIs then that column will be present but empty for the DFIs that do not have that information.
  • Tool(s) deduplicate projects by specific DFI site, tool should not deduplicate across DFI sites.

Getting Started

Prerequisites

To start contributing to one of the scrapers

Most people will follow this route.

  1. Clone the repository (Click "Clone or download"; click the copy button; git clone {copied text})
  2. Setup a virtual environment (optional, but recommended): mkvirtualenv idi-datadive-2018
  3. Navigate to ./scrapers
  4. pip install -r requirements.txt to install the requirements
  5. If you decide you need Selenium, you may need to install ChromeDriver. Call out in Slack if you need help.
  6. Sign up for a scraper to work on, then use the following files in scrapers to get started on your own:
    • miga_scraper.py - blank template
    • worldbank_scraper.py - example of using requests and a data API
    • ifc_scraper.py - example of using Selenium and Beautiful Soup
  7. Once you're ready to test your work, update test_scraper.py to point to your new work. Then run python test_scraper.py
  8. Examine the outputs and the file, test.csv
  9. If everything looks good, submit it for checking and merging. Congrats! Share any tips or tricks you used with the others on Slack.

To test the full app

This route is for anyone who wants to contribute to the Flask app, directly. The demo Flask app & scrapers are found in /flask_scraper_demo.

  1. Clone the repository (Click "Clone or download"; click the copy button; git clone {copied text})
  2. Setup a virtual environment (optional, but recommended): mkvirtualenv idi-datadive-2018
  3. Navigate to ./flask_scraper_demo
  4. make setup to install the requirements
  5. make run to run the app
  6. You may need to install ChromeDriver. Call out in Slack if you need help.
  7. Navigate to http://localhost:5000/ to make sure the app works
  8. On the app, choose the Search_Terms.txt file & click "Submit"
  9. It should show 2 search terms: click "Run Scraper"
  10. Make sure the status of the scrapers is printed to your terminal
  11. Make sure the table displays on the web page at the end of the run
  12. The main files of interest are:
    • app/routes.py - This file defines the routes and does some of the data munging.
    • app/helpers.py - This file has a TableBuilder class to help with displaying and exporting results.
    • app/scrapers/execute_search.py - This file is responsible for calling all of the contributed scrapers and gathering results.

This proof of concept app takes as input a csv file of search terms and searches the IFC site. It only works for this site currently. If you want to run it there is a demo search terms file located at flask_scraper_demo/Search_Terms.txt.

Landing Page Landing Page


Load Search Terms Search Terms Page


Results Page Results Page

idi-datadive-2018's People

Contributors

justalfred avatar mdgis avatar jimjshields avatar joepope44 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.