Giter Club home page Giter Club logo

google_patent_scraper's Introduction

Patent Scraper

A python package to scrape patents from 'https://patents.google.com/'. The package is made up of a single python class, scraper_class. This scraper can be used both to retreive parsed html of a single patents page or a list of patents.

The following elements are always returned by the scraper class:

title                     (str)   : title
application_number        (str)   : application number
inventor_name             (json)  : inventors of patent 
assignee_name_orig        (json)  : original assignees to patent
assignee_name_current     (json)  : current assignees to patent
pub_date                  (str)   : publication date
filing_date               (str)   : filing date
priority_date             (str)   : priority date
grant_date                (str)   : grant date
forward_cites_no_family   (json)  : forward citations that are not family-to-family cites
forward_cites_yes_family  (json)  : forward citations that are family-to-family cites
backward_cites_no_family  (json)  : backward citations that are not family-to-family cites
backward_cites_yes_family (json)  : backward citations that are family-to-family cites

There are optional elements you can have returned, see examples below:

abstract_text             (str)   : text of abstract (if exists)       

Package Installation

The package is available on PyPi, and can be installed using pip:

pip install google_patent_scraper

Main Use Cases

There are two primary ways to use this package:

  1. Scrape a single patent
# ~ Import packages ~ #
from google_patent_scraper import scraper_class

# ~ Initialize scraper class ~ #
scraper=scraper_class() 

# ~~ Scrape patents individually ~~ #
patent_1 = 'US2668287A'
patent_2 = 'US266827A'
err_1, soup_1, url_1 = scraper.request_single_patent(patent_1)
err_2, soup_2, url_2 = scraper.request_single_patent(patent_2)

# ~ Parse results of scrape ~ #
patent_1_parsed = scraper.get_scraped_data(soup_1,patent_1,url_1)
patent_2_parsed = scraper.get_scraped_data(soup_2,patent_2,url_2)
  1. Scrape a list of patents
# ~ Import packages ~ #
from google_patent_scraper import scraper_class
import json

# ~ Initialize scraper class ~ #
scraper=scraper_class()

# ~ Add patents to list ~ #
scraper.add_patents('US2668287A')
scraper.add_patents('US266827A')

# ~ Scrape all patents ~ #
scraper.scrape_all_patents()

# ~ Get results of scrape ~ #
patent_1_parsed = scraper.parsed_patents['US2668287A']
patent_2_parsed = scraper.parsed_patents['US266827A']

# ~ Print inventors of patent US2668287A ~ #
for inventor in json.loads(patent_1_parsed['inventor_name']):
  print('Patent inventor : {0}'.format(inventor['inventor_name']))

Additional optional information

Return Abstract Information

Scrape a patent and retrieve abstract information

# ~ Import packages ~ #
from google_patent_scraper import scraper_class

# ~ Initialize scraper class ~ #
scraper=scraper_class(return_abstract=True)  #<- TURN ON ABSTRACT TEXT  

# ~~ Scrape patents individually ~~ #
patent_1 = 'US7742806'
err_1, soup_1, url_1 = scraper.request_single_patent(patent_1)

# ~ Parse results of scrape ~ #
patent_1_parsed = scraper.get_scraped_data(soup_1,patent_1,url_1)

# ~ Print abstract text ~ #
print(patent_1_parsed['abstract_text'])

Example Files

I have provided two seperate example scripts for usage of this package in the /example/ folder:

  1. Examples from this readme: readme_example
  2. Scrape multiple patents using Python's multiprocessing package: multiprocess_example

google_patent_scraper's People

Contributors

ryanlstevens avatar crabdancing avatar riesi avatar a-kosykh avatar phamphituan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.