irahorecka / pycraigslist Goto Github PK

View Code? Open in Web Editor NEW

34.0 6.0 12.0 203 KB

Craigslist API wrapper

License: MIT License

Makefile 0.69% Python 99.31%

craigslist webscraping api python

pycraigslist's Introduction

pycraigslist

Craigslist API wrapper

⚠ January 2023: This library does not currently work as intended because of Craigslist's anti-scraping efforts. Craigslist is likely using the JavaScript experimental technology navigator.webdriver in the navigator interface, which can detect automation tools such as Selenium.

⚠ September 2021: Craigslist added a rate-limiter, and it's advised to throttle requests to prevent a 403 HTTP status code. View the Exceptions section below to handle this exception.

Disclaimer

I do not work or have an affiliation with Craigslist.
This library is intended for educational purposes.

Installation

pip install pycraigslist

Quick start

Find cars & trucks for sale with keyword "Mazda Miata" in the East Bay Area, California:

import pycraigslist

miatas = pycraigslist.forsale.cta(site="sfbay", area="eby", query="Mazda Miata")
for miata in miatas.search():
    print(miata)

>>> {'country': 'US',
    'region': 'CA',
    'site': 'sfbay',
    'area': 'eby',
    'category': 'cto',
    'id': '7291715564',
    'repost_of': '',
    'last_updated': '2021-03-15 09:06',
    'title': '1990 Mazda Miata',
    'neighborhood': 'oakland lake merritt / grand',
    'price': '$5,000',
    'url': 'https://sfbay.craigslist.org/eby/cto/d/oakland-1990-mazda-miata/7291715564.html'}
    # ...

Background

Search for anything on Craigslist with Python!

pycraigslist classes

pycraigslist.community (craigslist.org > community)
pycraigslist.events (craigslist.org > event calendar)
pycraigslist.forsale (craigslist.org > for sale)
pycraigslist.gigs (craigslist.org > gigs)
pycraigslist.housing (craigslist.org > housing)
pycraigslist.jobs (craigslist.org > jobs)
pycraigslist.resumes (craigslist.org > resumes)
pycraigslist.services (craigslist.org > services)

Search for posts in parent classes for a broader query. For example, finding paid gigs in Portland, Oregon:

import pycraigslist

paid_gigs = pycraigslist.gigs(site="portland", is_paid=True)
for gig in paid_gigs.search():
    print(gig)

>>> {'country': 'US',
    'region': 'OR',
    'site': 'portland',
    'area': 'mlt',
    'category': 'lbg',
    'id': '7295392821',
    'repost_of': '7292985211',
    'last_updated': '2021-03-22 13:00',
    'title': 'Packing and moving',
    'neighborhood': 'SE Portland',
    'price': '',
    'url': 'https://portland.craigslist.org/mlt/lbg/d/portland-packing-and-moving/7295392821.html'}
    # ...

pycraigslist subclasses

Most pycraigslist classes have subclasses to allow for a targeted query. For example:

pycraigslist.forsale.bia (craigslist.org > for sale > bikes)
pycraigslist.forsale.cta (craigslist.org > for sale > cars & trucks)
pycraigslist.housing.apa (craigslist.org > housing > apartments / housing for rent)
pycraigslist.housing.roo (craigslist.org > housing > apartments / rooms & shares)

Use class method .get_categories() to search for subclasses. The resulting keys are the subclass names.

import pycraigslist

print(pycraigslist.housing.get_categories())

>>> {'apa': 'apartments / housing for rent',
    'swp': 'housing swap',
    'off': 'office & commercial',
    'prk': 'parking & storage',
    'rea': 'real estate',
    'reb': 'real estate - by dealer',
    'reo': 'real estate - by owner',
    'roo': 'rooms & shares',
    'sub': 'sublets & temporary',
    'vac': 'vacation rentals',
    'hou': 'wanted: apts',
    'rew': 'wanted: real estate',
    'sha': 'wanted: room/share',
    'sbw': 'wanted: sublet/temp'}

E.g., use pycraigslist.housing.vac to search for vacation rentals.

Finding and using filters

Apply search filters to narrow your query. Use .get_filters() to find valid filters for a class or subclass instance.

Search filters are sensitive to the language of the region. E.g., get filters for cars & trucks for sale in Tokyo, Japan:

import pycraigslist

tokyo_autos = pycraigslist.forsale.cta(site="tokyo")
print(tokyo_autos.get_filters())

>>> {'query': '...', 'search_titles': 'True/False', 'has_image': 'True/False',
    'posted_today': 'True/False', 'bundle_duplicates': 'True/False',
    'search_distance': '...', 'zip_code': '...', 'min_price': '...', 'max_price': '...',
    'make_model': '...', 'min_year': '...', 'max_year': '...', 'min_miles': '...',
    'max_miles': '...', 'min_engine_displacement': '...', 'max_engine_displacement': '...',
    'condition': ['新品', 'ほぼ新品', '美品', '良品', '使用に問題なし', 'サルベージ'],
    'auto_cylinders': ['3気筒', '4気筒', '5気筒', '6気筒', '8気筒', '10気筒', '12気筒', 'その他'],
    'auto_drivetrain': ['前輪', '後輪', '4WD'],
    'auto_fuel_type': ['ガソリン', 'ディーゼル', 'ハイブリッド', '電気', 'その他'],
    'auto_paint': ['ブラック', 'ブルー', 'ブラウン', 'グリーン', 'グレー', 'オレンジ', 'パープル',
                   'レッド', 'シルバー', 'ホワイト', 'イエロー', 'カスタム'],
    'auto_size': ['コンパクト', 'フルサイズ', '中型', 'サブコンパクト'],
    'auto_title_status': ['クリーン', 'サルベージ', '再生', '部品のみ', '先取特権', '不明'],
    'auto_transmission': ['MT', 'AT', 'その他'],
    'auto_bodytype': ['バス', 'コンバーチブル', 'クーペ', 'ハッチバック', 'ミニバン', 'オフロード',
                      'ピックアップ', 'セダン', 'トラック', 'SUV', 'ワゴン', 'バン', 'その他'],
    'language': ['afrikaans', 'català', 'dansk', 'deutsch', 'english', 'español', 'suomi',
                 'français', 'italiano', 'nederlands', 'norsk', 'português', 'svenska',
                 'filipino', 'türkçe', '中文', 'العربية', '日本語', '한국말', 'русский',
                 'tiếng việt']}

E.g., use the filter parameter "クリーン" to find cars & trucks with clean titles:

import pycraigslist

tokyo_autos = pycraigslist.forsale.cta(site="tokyo", auto_title_status="クリーン")
for auto in tokyo_autos.search():
    print(auto)

>>> {'country': 'JP',
    'region': '',
    'site': 'tokyo',
    'area': '',
    'category': 'cto',
    'id': '7301105503',
    'repost_of': '',
    'last_updated': '2021-04-03 14:04',
    'title': 'Suzuki Jimny 660 XG 4WD Keyless Entry Aluminum Wheel Non-Smoking Car',
    'neighborhood': 'Chiba Ken, Noda shi, Funakata 1630-1',
    'price': '¥650,000',
    'url': 'https://tokyo.craigslist.org/cto/d/suzuki-jimny-660-xg-4wd-keyless-entry/7301105503.html'}
    # ...

When applying many filters, pass a dictionary of filters into the filters keyword parameter. Note: keyword argument filters will override filters if there are conflicting keys. For example:

import pycraigslist

bike_filters = {
    "bicycle_frame_material": "steel",
    # Array of filter values are accepted
    "bicycle_wheel_size": ["650C", "700C"],
    "bicycle_type": "road",
}
# This would search for titanium road bikes with size 650C or 700C wheels
titanium_bikes = pycraigslist.forsale.bia(
    site="sfbay", area="sfc", bicycle_frame_material="titanium", filters=bike_filters
)

Searching for posts

General search

To search for Craigslist posts, use .search(). .search() will return a dictionary of post attributes (type str) and will search for every post by default. Use the limit keyword parameter to add a stop limit to a query. For example, use limit=50 to get 50 posts. There is a maximum of 3000 posts per query.

E.g., find the first 20 posts for farming and gardening services in Denver, Colorado:

import pycraigslist

gardening_services = pycraigslist.services.fgs(site="denver")
for service in gardening_services.search(limit=20):
    print(service)

>>> {'country': 'US',
    'region': 'CO',
    'site': 'denver',
    'area': '',
    'category': 'fgs',
    'id': '7301324564',
    'repost_of': '6974119634',
    'last_updated': '2021-04-03 11:47',
    'title': '🌲 Tree Removal/Trimming, Stump Grind: LICENSED/INSURED! 720-605-1584',
    'neighborhood': 'All Areas',
    'price': '',
    'url': 'https://denver.craigslist.org/fgs/d/littleton-tree-removal-trimming-stump/7301324564.html'}
    # ...

Detailed search

Use .search_detail() to get detailed Craigslist posts. The limit keyword parameter in .search also applies to .search_detail. Set include_body=True to include the post's body in the output. By default, include_body=False. Disclaimer: .search_detail is more time consuming than .search.

E.g., get detailed posts with the post body for all cars & trucks for sale in Abilene, Texas:

import pycraigslist

all_autos = pycraigslist.forsale.cta(site="abilene")
for auto in all_autos.search_detail(include_body=True):
    print(auto)

>>> {'country': 'US',
    'region': 'TX',
    'site': 'abilene',
    'area': '',
    'category': 'cto',
    'id': '7309894792',
    'repost_of': '',
    'last_updated': '2021-04-20 12:17',
    'title': '2009 Mercedes GL-320',
    'neighborhood': 'Brownwood',
    'price': '$12,000',
    'url': 'https://abilene.craigslist.org/cto/d/brownwood-2009-mercedes-gl-320/7309894792.html',
    'lat': '31.729000',
    'lon': '-99.019000',
    'address': '',
    'misc': ['2009 mercedes-benz gl-class'],
    'condition': 'excellent',
    'drive': 'fwd',
    'fuel': 'diesel',
    'odometer': '100700',
    'paint_color': 'black',
    'title_status': 'clean',
    'transmission': 'automatic',
    'body': 'BEAUTIFUL car inside and out!! Diesel with only 100k, mechanic says its in great condition.'}
    # ...

Additional attributes

__doc__: Gets category name.
url: Gets full URL.
count: Gets number of posts.

import pycraigslist

east_bay_apa = pycraigslist.housing.apa(site="sfbay", area="eby", max_price=800)

# 1
print(east_bay_apa.__doc__)
>>> 'apartments / housing for rent'

# 2
print(east_bay_apa.url)
>>> 'https://sfbay.craigslist.org/search/eby/apa?searchNearby=1&s=0&max_price=800'

# 3
print(east_bay_apa.count)
>>> 56

Exceptions

pycraigslist has the following exceptions:

ConnectionError : exceeded maximum retries for a query
HTTPError : encountered a client or server error
InvalidFilterValue : filter is not recognized or has an invalid value

To use pycraigslist exceptions, import / import from pycraigslist.exceptions. For example:

import pycraigslist
from pycraigslist.exceptions import ConnectionError, HTTPError, InvalidFilterValue

try:
    sf_bikes = pycraigslist.forsale.bia(site="sfbay", area="sfc", min_price=50)
    for bike in sf_bikes.search():
        print(bike)
except ConnectionError:
    print("Yikes! Something's up with the network.")
except HTTPError as e:
    print(f"Bad HTTP response encountered: {e.status_code} {e.detail}")
except InvalidFilterValue as e:
    print(f"Craigslist filter validation failed. Filter: '{e.name}', Value: '{e.value}'")

Contribute

Support

If you are having issues or would like to propose a new feature, please use the issues tracker.

License

This project is licensed under the MIT license.

pycraigslist's People

Contributors

Stargazers

Watchers

Forkers

jgrugru luenix christopherjhart nick-forks swipswaps zad0xlik lukeandmiller zach-hill agwebberley bpnguyen johndennisjackson almastreet

pycraigslist's Issues

Filter error?

Using the test code below, it seems that the filter options isn't working properly. If i run the code, all type of bikes show in the results.

import pycraigslist

bike_filters = {
    'posted_today': True,
    'bundle_duplicates': True,
    'make': 'Harley',
    'model': 'Softail'
}

bikes = pycraigslist.forsale.mca(site='miami', area='brw', filters=bike_filters)

for bike in bikes.search(limit=5):
    print(bike)

Missing categories (subcategories?)

Hi,

Your API does not allow me me to use some categories (subcategories?) . For example I specifically want to use the "mcy" category which a subset of the "mca" category. The API return an error if I use this:

bikes = pycraigslist.forsale.mcy(site='miami', area='brw', filters=bike_filters)

This is the error return.

Traceback (most recent call last):
File "C:/Users/mgpd/PycharmProjects/molivo/py_clist.py", line 11, in <module>
bikes = pycraigslist.forsale.mcy(site='miami', area='brw', filters=bike_filters)
AttributeError: type object 'forsale' has no attribute 'mcy'

The "mcy" category is valid. This is the motorcycles for sale by owners.

This category is returned in the data set. Here the full test script:

import pycraigslist

bike_filters = {
    'posted_today': True,
    'bundle_duplicates': True,
    'make': 'Harley',
    'model': 'Softail'
}

bikes = pycraigslist.forsale.mca(site='miami', area='brw', filters=bike_filters)

for bike in bikes.search(limit=5):
    print(bike)

No results from query

The following Python code was returning accurate results for several months. In tests with results showing up properly on craigslist, the library is returning 0 results using the same query.

city = "newyork"
result = pycraigslist.gigs.cpg(site=city, posted_today=True, search_distance=100)

It appears Craigslist has added a redirect to the search resulting in adding something like this to the URL:

#search=1~list~0~0

Might this be affecting the library? I also noticed that when modifying searchNearby to 0 in the URL, the page returns results, but this is not able to be set using this library.

Error with get_filters()

From the beginning, I was using the following to get the filters.

import pycraigslist

print(pycraigslist.forsale.mca.get_filters())

I needed to verify my filters again and when I issued the command, it replied:

Traceback (most recent call last):
  File "C:\Users\mgpd\PycharmProjects\MOlivo\py_clist.py", line 3, in <module>
    print(pycraigslist.forsale.mca.get_filters())
TypeError: get_filters() missing 1 required positional argument: 'self'

pycraigslist.query.filters.parse_filters() should throw custom exception

Right now, the pycraigslist.query.filters.parse_filters() function throws ValueError with a string indicating a filter has an incorrect value.

It would be nice if we can create a custom exception inherited from ValueError that takes in the filter name and incorrect value as parameters. This promotes more graceful error-handling in applications using pycraigslist.

I'm thinking something like this in pycraigslist.exceptions:

class InvalidFilterValue(ValueError):
    def __init__(self, name, value):
        self.name = name
        self.value = value

Unable to fetch filters

query/filters.py:get_addl_filters is unable to crawl the page (search_html = next(sessions.yield_html(url)))

window.cl.specialCurtainMessages = { unsupportedBrowser: [ "We've detected you are using a browser that is missing critical features.", "Please visit craigslist from a modern browser." ], unrecoverableError: [ "There was an error loading the page." ] };

I guess Craigslist put some new anti-crawling features in place.

Limit function has problems.

Using the script below yields different results based on the environment. When used with Google Collab (python 3.8.8) the limit function works fine. When used in my PyCharm (also python 3.8.8) it returns exactly 3 times the limit requested. In this case 15 results are returned in PyCharm.

import pycraigslist

bike_filters = {
    'posted_today': True,
    'bundle_duplicates': True,
    'make': 'Harley',
    'model': 'Softail'
}

bikes = pycraigslist.forsale.mca(site='miami', area='brw', filters=bike_filters)

for bike in bikes.search(limit=5):
    print(bike)

Let it be known, that my PyCharm is on Windows 10.

Results count

Ira,

Is there a way we could get a result count before running the loop? Does not need an exact number. It just could be a False/True situation. I don't know where you would put this, but if the has_results = True you run the loop.

Not a deal-breaker, but it would be nice.

Thanks

Validate sites on object instantiation

Is there an easy way to verify if a site (e.g. sfbay, raleigh, etc.) is valid prior to searching via pycraigslist?

Consider the example below:

import pycraigslist

def search_miatas(site: str) -> bool:
    miatas = pycraigslist.forsale.cta(site=site, query="mx5|miata")
    for miata in miatas.search(limit=1):
        print(miata)

If I call this function using a known good site, then pycraigslist works as expected.

search_miatas("raleigh")

>>> {'country': 'US', 'region': 'NC', 'site': 'raleigh', 'area': '', 'category': 'ctd', 'id': '7358600557', 'repost_of': '7342453821', 'last_updated': '2021-07-30 15:51', 'title': '2016 Mazda MX-5 Miata Grand Touring 2dr Convertible 6M 8000 Miles', 'neighborhood': 'Durham', 'price': '$27,495', 'url': 'https://raleigh.craigslist.org/ctd/d/durham-2016-mazda-mx-miata-grand/7358600557.html'}

However, if I call this function with a known incorrect site, then pycraigslist eventually raises MaximumRequestsError.

search_miatas("test")

>>> MaximumRequestsError: Maximum requests attempted - check network connection.

This makes logical sense, because in the background, pycraigslist is trying to hit https://test.craigslist.org, which doesn't exist, so httpx can't connect, so tenacity raises TryError, which pycraigslist re-raises as MaximumRequestsError.

The problem is that the root cause of MaximumRequestsError could be unclear if I'm accepting site input from an unsanitized source (meaning, raw user input). From what I can tell, there's not an easy way to verify whether MaximumRequestsError is raised because there's a legitimate network connectivity issue, or if the user-defined site is invalid.

I can think of two ways to solve this, assuming it's not solved already:

Whenever a pycraigslist.api.BaseAPI object is instantiated, prior to fetching additional filters, parse the list of Craigslist sites, validate that the site parameter is in the list of sites, and raise an exception if it's not. Downside of this approach is increased execution time (meaning, decreased performance) whenever instantiating a new object.
Keep an internal mapping of valid site subdomains and perform the same validation described above. Downside of this approach is, whenever Craigslist adds a new site, you have to throw together a new release to add support for it.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.