Robust crawler framework for getting legal document from websites
LINUX:
> mkdir '<create-a-folder>'
> cd '<create-a-folder>'
> sudo git clone https://github.com/PandorAstrum/reflect3746.git
> cd backend
> sudo pip install -r requirements.txt
> cd ..
> cd frontend
> sudo npm install
N.B: In backend/db.py
file change CONNECTION_STRING
to your own mongodb database
-
From
backend
directory start the server with:sudo python run_server.py
-
From
frontend
directory start the dashboard with:sudo npm run serve
Navigate to
localhost:5000
on your browser to use the Dashboard
For api endpoint see below
project root
├── backend
| ├── App
| | ├── __init__.py
| | ├── db.py # contains mongodb settings
| | └── routes.py # API routes
| ├── scraper
| | └── scraper
| | ├── spiders # contains spiders
| | ├── items.py
| | ├── middleswares.py
| | ├── pipelines.py
| | └── settings.py # scrapy settings
| ├── static
| ├── template
| ├── config.py # flask config
| └── run_server.py # ENTRY POINT FOR FLASK
|
└── frontend
├── public
└── src
http://127.0.0.1:5000/api/v1
endpoints after this
/server_status
returns: json (boolean if flask is running or not)
/database_status
returns: json (try connecting to mongo db)
/spider
returns: json (all spiders created by scrapy cli, e.g scrapy genspiders command)
/run
returns: json (start selected spiders with param provided) requires: params as body payload
e.g:
{"spider_kwargs":{"baseurl":"https://www.example.com/","spider_name":"Example"},"spider_settings":{"sitemap":false,"delay":1}}
/results/<_id_here>
returns: json (from mongodb scraped collection matching documents with id number)
/all
returns: json (from mongodb scraped collections all documents)
/logs
returns: json (from mongodb logs collection all documents)
Ashiquzzaman Khan – @dreadlordn
https://github.com/PandorAstrum/reflect3746.git
- Fork it (https://github.com/PandorAstrum/reflect3746.git/fork)
- Create your feature branch (
git checkout -b feature/fooBar
) - Commit your changes (
git commit -am 'Add some fooBar'
) - Push to the branch (
git push origin feature/fooBar
) - Create a new Pull Request