metakgp / mftp Goto Github PK
View Code? Open in Web Editor NEWCDC noticeboard scraper
License: GNU Affero General Public License v3.0
CDC noticeboard scraper
License: GNU Affero General Public License v3.0
Dockerise the codebase in for easy deployment.
So, here's the plan.
kgpian.iitkgp.ac.in
email.anonymous session
or client IP
.Something about the notice format changed
Implement mftp as a cronjob
along with a continuously running service in infinite loop.
mftp cronjob disable
mftp cronjob enable
mftp cronjob enable 30
Cronjob will be best for:
Service will be best for:
This is there as a TODO in the README.
I have understood most of the code now, and I think it will be easy to implement this now. How do you want to do this? This is what I have so far:
I have a feeling that all this will take some time? Say 10 users, 4 keywords per user, that's 40 things to search for in the PDF. I am not sure if there's a bottleneck here, and if there is what it is.
@amrav What do you think?
Implement logs rotation to save disk storage :)
Refer: https://en.wikipedia.org/wiki/Log_rotation
Heroku is removing mLab
addon and hence we will need to migrate from the same towards a better solution. Below are some references for the same.
Here are some official resources that can be referred to:
NOTE: The issue needs to be fixed by 8 December 2020.
Is it possible to host this project, so that people may actually use it?
I was thinking of creating a web UI where people can subscribe to get emails from CDC Notice Board. This will be different from the already existing CDC Notify app in that the attachments in the notice will be attached to the email.
I would like to know if this is a feasible goal.
Self host custom configured ntfy instance for our needs.
Line 22 in 3c69b05
Something has probably been changed here (the URL of the notice PDFs maybe?), and PDFs in our mail are incorrectly encoded.
Get notified of any abnormality, basically a sanity check. Will add and remove a cronjob which keeps checking on the status of mftp and logs and reports when it detects any abnormality.
mftp-doctor
scriptmftp doctor enable
mftp doctor enable 10
mftp doctor disable
Note
Use ntfy for sending notifications of doctor.
Google groups suck (see this). Need to develop a Progressive Web App with following necessary features:
Internship
and Placement
Heroku-18 (currently used and the only version to support python 2) has reached its End-of-Life.
For newer builds to take place we need to upgrade to Heroku-22. So a migration of code to the latest version of Python is necessary.
Features available to individual hosters:
Google account and google groups have many limitations, resulting in failure to use this as a solution to large scale newsletter service.
Refer following soruces:
The duplicate mails are repeatedly sent with attachments of 0 bytes.
Currently, we have a cron job on the metakgp digital ocean server that pings the URL https://mftp.herokuapp.com/
and hence fetches the new notices. This way, though has been working, leads to sometimes Heroku idling the instance of the application and no ping is able to reach or somehow the metakgp cron job fails.
It would be better to use the Heroku scheduler add-on and modify the script to work with it and set a periodic frequency for the script (or function) to run and send new notices. This would make us totally independent of the cron job as well.
Ref:
MFTP has been showing any notices lately. It is probably because @amrav 's ERP credentials have expired. Someone new need to up ownership of the project and use her credentials, information can also be scraped form http://10.3.100.27/notice/ (local site).
The work is being done on the revamp
branch.
This is the ToDo list in order of priority.
README.md
Hi, In mftp, inappropriate dependency versioning constraints can cause risks.
Below are the dependencies and version constraints that the project is using
backports-abc==0.4
backports.ssl-match-hostname==3.4.0.2
beautifulsoup4==4.4.1
certifi==2015.11.20.1
docopt==0.4.0
futures==3.0.3
pymongo==3.4
requests==2.8.1
singledispatch==3.4.0.3
six==1.10.0
tornado==4.3
wheel==0.24.0
python-dotenv==0.5.1
The version constraint == will introduce the risk of dependency conflicts because the scope of dependencies is too strict.
The version constraint No Upper Bound and * will introduce the risk of the missing API Error because the latest version of the dependencies may remove some APIs.
After further analysis, in this project,
The version constraint of dependency pymongo can be changed to >=3.0,<=4.1.1.
The above modification suggestions can reduce the dependency conflicts as much as possible,
and introduce the latest version as much as possible without calling Error in the projects.
The invocation of the current project includes all the following methods.
bson.json_util.loads pymongo.MongoClient.get_default_database pymongo.MongoClient.close bson.json_util.dumps pymongo.MongoClient
insert_from_file further_defaulters.append mc_old.get_default_database.notices.find defaulters.append start_database_export pymongo.MongoClient.close further_repeated.append pymongo.MongoClient.get_default_database open mc_new.get_default_database.notices.insert os.path.dirname bson.json_util.dumps pymongo.MongoClient dotenv.load_dotenv argparse.ArgumentParser.add_argument bson.json_util.loads argparse.ArgumentParser.add_mutually_exclusive_group os.path.join len parser.add_mutually_exclusive_group.add_argument f.write argparse.ArgumentParser argparse.ArgumentParser.parse_args f.read export_db format print repeated_notices.append insert_notice
@developer
Could please help me check this issue?
May I pull a request to fix it?
Thank you very much.
Some notices are sent repeatedly.
https://github.com/metakgp/mftp/blob/master/update.py#L49
The data needs to be extracted from the Text
portion of the message which is inside notice['text']
.The data is in HTML markup as:
<b>Type</b> : PLACEMENT
or <b>Type</b> : INTERNSHIP
.
@zorroblue @DefCon-007 I am sure this is easy. See if you can put it together really quickly. I will be off trying to get this out of BS4 meanwhile.
Implement algorithm to send notifications to subscribers (topic in case of ntfy or separate google groups) based on certain filters like:
It would be better if the subject is changed from Notice: CV Submission - Company
to Internship: CV Submission - Company
or Placement: Urgent - CV Verification
.
This is possible since the first line mentions type:placement or type:internship. This would help reduce confusions amongst people sitting for placements who might confuse notices of internship as being of placements.
MFTP currently fails to run, probably because of some changes to ERP's TNP noticeboard. The error log is:
Unhandled error occured :
Traceback (most recent call last):
File "main.py", line 27, in func
update.check_notices()
File "/app/erp.py", line 78, in wrapped_func
*args, **kwargs)
File "/app/erp.py", line 93, in wrapped_func
func(session=session, sessionData=sessionData, *args, **kwargs)
File "/app/update.py", line 48, in check_notices
m = re.search(r'ViewNotice\("(.+?)","(.+?)"\)', a.attrs['onclick'])
KeyError: 'onclick'
After the above is done, need to provide relevant credentials to be able to post to any channel.
As per github the following dependencies need to be updated and are currently vulnerable:
A few of the notices on the ERP notice board have a PDF file attached to them. It would be nice to have these attachments also available in the MFTP forums.
If and when #38 is implemented using Heroku Scheduler, we will need a better way to get notified when the service is not working. Here are key things to note:
A localized (can pinpoint the exact place that threw the error) and minimal (so that the maintainers do not mark it as spam and multiple maintainers can be added). A good way would be to use the existing mailing mechanism setup using Mailgun REST API. We should also aim at keeping the configuration and reconfiguration overhead as low as possible, if feasible, we should pass it from Github itself.
NOTE: this should ideally be done after #38
The r.history
array becomes empty and thus this line throws this error:
Traceback (most recent call last):
File "update.py", line 132, in <module>
check_notices()
File "path/erp.py", line 77, in wrapped_func
r.history[1].headers['Location']).group(1)
IndexError: list index out of range
Instead, if before accessing r.history[1]
, it should ensure that r.history
's size is at least 2 and if the size is less than 2, it should ask the user to check their secret answer settings (probably a typo in one of the answers)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.