Giter Club home page Giter Club logo

foursquare-crawler's Introduction

Foursquare-Crawler

A crawler used to crawl foursquare check-in data.

Features

  • Check-in Data Crawling: allows you to crawl check-in data from Foursquare web pages by providing a URL. It supports both batch and debug modes. In batch mode, you can crawl all check-in URLs from a file and save the results, while in debug mode, you can crawl a single URL and print the output.

  • Progress Bar: displays the progress of the crawling process, giving you real-time feedback on the current state of your crawling tasks.

  • Error Handling and Logging: saves both successful and failed crawling attempts. For failed attempts, logs the reason for failure, such as WrongUrl (incorrect URL) or PageNotFound (deleted or hidden check-in). See data/sample_url_failed.csv FYI.

  • Random User-Agent Generation: Supports the generation of random User-Agents to simulate requests from different browsers, enhancing the stealthiness of the crawling process.

  • Location statistics: besides the time, location, and category of each check-in, also gathers statistical information about the POI, including the total number of check-ins, total number of visitors, and tips count.

  • IP Pool Support: Utilizes an IP pool to counter anti-crawling measures, ensuring more reliable and continuous data crawling.

  • Multi processing Support: Support multi-process crawling to improve efficiency.

Notice: This repository is for research purposes only. I created it for fun.

Output Example

{
    "fsq_id": "4b89fd41f964a520685a32e3",
    "name": "Tokyo International (Haneda) Airport (HND)",
    "time": "2019-04-24 06:40:42",
    "category": [
        {
            "id": "63be6904847c3692a84b9c29",
            "name": "International Airport",
            "primary": true
        }
    ],
    "location": {
        "city": "東京",
        "lng": 139.7802546992898,
        "contextLine": "日本東京",
        "state": "東京都",
        "neighborhood": "羽田空港",
        "country": "日本",
        "postalCode": "144-0041",
        "address": "羽田空港3-3-2",
        "cc": "JP",
        "lat": 35.54846517643472
    },
    "uid": "61142668",
    "location_stats": {
        "checkinsCount": 1178831,
        "usersCount": 169816,
        "tipCount": 492
    },
    "fsq_url": "https://www.swarmapp.com/c/1XCAyGz7knL",
    "tweet_url": "https://t.co/9bSYbewb5K",
    "user_screenname": "test_user"
}

Requirement

# more details > requirements.txt
python3.9 
fake-useragent
bs4
requests
pandas 
tqdm 
lxml
pyyaml

Usage

# batch mode
python fsq_crawler.py --file data/sample_url.csv --outdir ./data/ 
# debug mode 
python fsq_crawler.py --mode debug --url https://t.co/Qm5WGMZiGE 
# large scale crawling mode
python batch_run.py --pool_size 200  --indir your_data_dir --outdir target_dir --ip_pool your_ip_pool_config 

# Output two files by default.
**.successed.csv 
**.failed.csv 

Appendix

Explanation for the Output Fields:

- fsq_id: The POI (Point of Interest) ID on Foursquare, e.g., '12345'
- name: The name of the POI, e.g., 'Subway Chikusa Station'
- time: The check in time. e.g., '2019-04-24 06:40:42'
- category: Information about the category of the POI
    - category['id']: The ID of the category, e.g., '4bf58dd8d48988d1fd931735'
    - category['name']: The name of the category, e.g., 'Metro Station'
    - category['primary']: Indicates if it is the primary category, e.g., True
- location: Information about the location of the POI
    - location['contextLine']: Contextual information (place name), e.g., 'Nagoya, Japan'
    - location['country']: The country, e.g., 'Japan'
    - location['state']: The prefecture, e.g., 'Aichi Prefecture'
    - location['city']: The city, e.g., 'Nagoya'
    - location['lat']: Latitude, e.g., 35.170653
    - location['lng']: Longitude, e.g., 136.929907
    - location['address']: The address, e.g., '3-15-21 Aoi, Higashi-ku'
    - location['postalCode']: The postal code, e.g., '461-0004'
    - location['cc']: The country code, e.g., 'JP'
    - location['neighborhood']: The neighborhood, e.g., 'Higashi-ku' (Note: some locations may not have this field)
- location_stats: Statistical information about the location
    - location_stats['checkinsCount']: Number of check-ins, e.g., 10523
    - location_stats['usersCount']: Number of users, e.g., 1306
    - location_stats['tipCount']: Number of tips, e.g., 3
- uid: User ID, e.g., '1272416'
- fsq_url: The Foursquare url, e.g., https://www.swarmapp.com/c/k3miIHJz7DU
- tweet_url: The Twitter url, e.g., https://t.co/Qm5WGMZiGE

foursquare-crawler's People

Contributors

sustc-chuangyang avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.