Giter Club home page Giter Club logo

gosber's Introduction

GoSber

Go Postgres SQLite

Overview

GoSber is a versatile GoLang script for web scraping. It offers multiple modes for scraping, the ability to specify search prompts, and the option to scrape specific links. Additionally, it can automatically connect to a PostgreSQL database using the DB_URL environment variable.

Will soon include parser and promo duplicator. Functionality will be combined in one project SberTool

Features

  • Selection of different modes for customizable scraping
  • Search prompts to target specific content
  • Scraping of data from specific URLs
  • Automatic connection to a PostgreSQL database using DB_URL

Usage

To get started with sber-scrape, follow the instructions below:

Using pre-build executables

  1. Download appropriate executable depending on your platform from releases

  2. Run with with flags if needed

    ./sber-scrape -mode <mode> -seatch <search-prompt> -table-name <url>

Building from source

  1. Clone this repository:

    git clone https://github.com/malvere/GoSber.git
    cd GoSberScrape
  2. Build the project

        go build
  3. Run

    ./sber-scrape -mode <mode> -search <search-prompt> 

    3.1 Available Flags:

    -mode - Mode to run in. makes HTTP requests and parses HTML body, while searches for .html file. uses API requests and parses JSON body (preferred method).

    -search - Searhces with specific prompt.

    -url - Parses using predifined url. You can set up your search prompt with filters and then copy the url from megamarket and paste it to the scraper.

    -table-name - Tables name in the DataBase.

    -pages - How many pages to parse.

    3.2 Usage:

    If -search is passed, then it will search by your specific prompt.

    If -url is passed, search will be done according to the specified link.

PostgreSQL Connection

If you have a PostgreSQL database, sber-scrape can connect to it by setting the DB_URL environment variable. The script will use it to establish a connection.

export DB_URL="postgres://username:password@localhost/database"
./sber-scrape

.csv support

If DB_URL is not specified, a .csv file with parse results will be generated near the executable.

License

MIT

Contribution

Contributions are welcome! Feel free to open issues and submit pull requests to help improve this project.

gosber's People

Contributors

malvere avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

gosber's Issues

"SQL logic error: near "COPY": syntax error (1)" when postgres is not active

$ bin/sber-scrape -mode web  -search велостанок
2023/12/22 23:53:15 Searching for:  велостанок
2023/12/22 23:53:15 Could not ping 'postgres' driver
2023/12/22 23:53:15 Driver 'sqlite' is active
2023/12/22 23:53:15 https://megamarket.ru/catalog/page-1/?q=%D0%B2%D0%B5%D0%BB%D0%BE%D1%81%D1%82%D0%B0%D0%BD%D0%BE%D0%BA
** extra
2023/12/22 23:53:15 Statement: {} COPY "product_data" ("title", "price", "bonuses", "bonus_percent", "discount", "product_id", "link") FROM STDIN
**
2023/12/22 23:53:15 SQL logic error: near "COPY": syntax error (1)

Script fails to find any products when -mode web is selected

It seems, they added some protection to the site and now first GET request to https://megamarket.ru/catalog/page-1/?q= respond with a page that doesn't contain any data.

E.g. this request works in browser, but not with curl

curl 'https://megamarket.ru/catalog/page-1/?q=%D0%B2%D0%B5%D0%BB%D0%BE%D1%81%D1%82%D0%B0%D0%BD%D0%BE%D0%BA' --compressed -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' -H 'Accept-Encoding: gzip, deflate, br' -H 'Connection: keep-alive' -H 'Cookie: spid=1699508945700_7d47f0b8547e5ac417cfa40acb5678fd_ahqutpjknxm7e4bo; spsc=1703279058840_98dd34ed9d3545f7c6b6114fa599bda7_2dc4c47e5beb4aae25be080fa9d16c8093e7e989cef732b63b8bada59af3d7da; device_id=b2b63b6e-7ec3-11ee-9a11-0242ac110002; sbermegamarket_token=235148ee-432f-43aa-b069-61e918576e04; ecom_token=235148ee-432f-43aa-b069-61e918576e04; adspire_uid=AS.1209131030.1699508946; _ga_W49D2LL5S1=GS1.1.1703279059.5.1.1703280096.28.0.0; _ga=GA1.2.762944685.1699508947; ssaid=b29c3c10-7ec3-11ee-b7bc-e3315c72317d; adtech_uid=59d105e0-6780-406c-9950-8afbe88e1aa1%3Amegamarket.ru; top100_id=t1.6795753.2124336796.1699508947637; t3_sid_6795753=s1.1436342227.1703279060788.1703280104076.5.30; last_visit=1703269285128%3A%3A1703280085128; cfidsw-smm=/j3XwsHs1sfHENgrQbQ73rkeDWe2YVduQirW+qmfhsAcJQvjx2BnO410gkzwmgPh882JuwjJtqYIRrWYjsTDtPv8WkL5ysDd9bXc2VmcTFCQy6iV1lYP8/acajR8AzAiRkdQr64y4aYibLXrI4xe1kTw4MMWXvH4APYVUdE=; __zzatw-smm=MDA0dC0cTHtmcDhhDHEWTT17CT4VHThHKHIzd2UbN1ddHBEkWA4hPwsXXFU+NVQOPHVXLw0uOF4tbx5mR1whS1VNCSofGH1nFRtQSxgvS18+bX0yUCs5Lmw=xcwO/Q==; cfidsw-smm=/j3XwsHs1sfHENgrQbQ73rkeDWe2YVduQirW+qmfhsAcJQvjx2BnO410gkzwmgPh882JuwjJtqYIRrWYjsTDtPv8WkL5ysDd9bXc2VmcTFCQy6iV1lYP8/acajR8AzAiRkdQr64y4aYibLXrI4xe1kTw4MMWXvH4APYVUdE=; _sa=SA1.3c8e051a-c712-42c9-8785-506829ace190.1699508948; adid=169950894829407; _gcl_au=1.1.1421543266.1699508948; uxs_uid=b34d8ab0-7ec3-11ee-a6f3-25032278d108; rrpvid=10719368468437; rcuid=654c72d53c1a03d818b24b91; tmr_lvid=e58b043f20cedd507791013dbc7fd865; tmr_lvidTS=1699508949572; flocktory-uuid=85cd0595-b3c1-4689-8831-e7410518b9be-4; _ga_VD1LWDPWYX=GS1.2.1703279065.4.1.1703280087.0.0.0; _gpVisits={"isFirstVisitDomain":true,"idContainer":"10002472"}; adrcid=Axl3xXnbFWLzufO7qu7bkGw; _ym_uid=1703274186780428421; _ym_d=1703274186; __tld__=null; _ym_isad=2; st_uid=a705f103c38694dfd9c4c6f6e6bf937d; region_info=%7B%22displayName%22%3A%22%D0%9C%D0%BE%D1%81%D0%BA%D0%BE%D0%B2%D1%81%D0%BA%D0%B0%D1%8F%20%D0%BE%D0%B1%D0%BB%D0%B0%D1%81%D1%82%D1%8C%22%2C%22kladrId%22%3A%225000000000000%22%2C%22isDeliveryEnabled%22%3Atrue%2C%22geo%22%3A%7B%22lat%22%3A55.755814%2C%22lon%22%3A37.617635%7D%2C%22id%22%3A%2250%22%7D; _gid=GA1.2.2103616241.1703274189; _gp10002472={"hits":8,"vc":1,"ac":1,"a6":1}; tmr_detect=0%7C1703280096895; _ym_visorc=b; _gat=1' -H 'Upgrade-Insecure-Requests: 1' -H 'Sec-Fetch-Dest: document' -H 'Sec-Fetch-Mode: navigate' -H 'Sec-Fetch-Site: none' -H 'Sec-Fetch-User: ?1' -H 'TE: trailers'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.