Comments (11)
yes i did a custom version, i give him a list of url, it will scrawl all of them and all their /*
from gpt-crawler.
but its not only on custom, i changed different part
from gpt-crawler.
yes i did a custom version, i give him a list of url, it will scrawl all of them and all their /*
That's great! Any chance you can share an example?
from gpt-crawler.
Yes but how to do ? I publish on my GitHub ? Or I have to do a pr ?
from gpt-crawler.
Maybe start by pasting your example into a comment here.
from gpt-crawler.
https://github.com/julian-passebecq/gpt-crawl-multiurl
i put the instructions in readme
from gpt-crawler.
I don't understand your comment, "don't use config ts for the url to scrawl, only modify json name in config.ts put the url to crawl in tocrawl.csv". Could you please explain that more? Meaning, which file do I modify, and how do I modify it to crawl "tocrawl.csv"
from gpt-crawler.
before, you had to put the url to scrawl in config.ts. now, you only put the name of the json in config.ts, the url written there doesn't matter. yo modify tocrawl, you can do it manually, just respect the 3 columns format i used. me i use allcrawl.csv, i put all my url, then i use crawlee.py, to only input in tocrawl.csv 1 topic of several url to crawl
so to sum up : it will automatically look for urls to crawl in tocrawl.csv, just modify the json name in config.ts
u dont have to use allcrawl.csv, crawlee.py and txt.py (to convert the json in txt)
however i didn't fix the missing credentials issue, so if the website requires credentials it doesnt work, at least on some sites. also im not convinced that custom gpt accept well json, so i convert it into txt
EDIT : the tocrawl.csv to use is the one in the src folder
from gpt-crawler.
Ok, thanks for the clarification!!
from gpt-crawler.
so, did you try my multi url version ? what do u think ?
from gpt-crawler.
Haven't tried it yet, but if I do, I'll let you know. :)
from gpt-crawler.
Related Issues (20)
- Crawl websites protected by username and password? HOT 1
- WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. Request blocked - received 429 status code. HOT 2
- Trying to Crawl site nothing working HOT 1
- ERROR PlaywrightCrawler: Request failed and reached maximum retries. Navigation timed out after 60 seconds
- how to add userName & passwd to gpt-crawler
- add a method to evaluate the quality of the retrieved context
- Cookies not accepted HOT 1
- Multiple Selectors not Reflected in Output
- Does gpt-crawler server always return same site?
- How to crawl Single Page Application(SPA)
- Can i use Gemini model by google?
- error TS2322 HOT 6
- Output.json file not created
- How to supply read-able code to the GPT?
- How do I keep links to pictures and videos in my web pages?
- extracting text in hidden div blocks
- cookie example
- how to crawler this site?match not work HOT 1
- add a username and password?
- sh: cross-env: command not found
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gpt-crawler.