paulmcinnis / jobfunnel Goto Github PK

View Code? Open in Web Editor NEW

1.8K 35.0 203.0 2.44 MB

Scrape job websites into a single spreadsheet with no duplicates.

License: MIT License

Python 100.00%

jobs search indeed monster glassdoor scraper job automated csv beautifulsoup

jobfunnel's Introduction

Hi there 👋

I'm Paul, I like to build software tools and hardware gadgets.

🤔 I’m looking for help with: JobFunnel, a tool for automating your job search
📫 I’m looking to collaborate on: other apps for democratizing data
⚡ Fun fact: I make music under the name Scramble Suit

jobfunnel's People

Contributors

Stargazers

Watchers

Forkers

studentbrad bunsenmurder mhuckaby sekmet kustomzone richardji7 module17 abhi-jha aaronwphillips realhellosunsun abidex4yemi abhisheksinha1506 cclauss vitorarjol braman09 megioliver abusalehnayeem chaitanyaphalak hoferm7 aleonnet ayenisholah gaurjin parampavar awesome-archive saonam shivlondon harisittu niass gunjan6565 nditah soundaraj hhy5277 louis-smit srinivasanbigdata jacobjohansen muzakparov nexzanavod chucknielsen vebg 0xinhua showering68 yoshimaa dfischer askmetoo uchihasr k3nw4y micrommer buxxux exexzian thebigg hilam rodhod clopez131211 vrushangdev devzenpro tristan625 tomarraj008 adornetejr kpennell turall maddin-619 maxwaiyaki rohithdrpl tbensonwest johnicesea jacenfox n1kk0 riyaagrahari layzhi artem090587 timmr99 productinfo spread0x satyamjindal tommeowmeow earthshaping leela93 wp-serbia arihant1798 whoisareyes nucternal18 maxcodextc curtischolon vpna09 aloraine mchalfant newc0495 zhanghui01 rayandas sullumvoe igleca tgdn dritani kznmft shruthirevanna quinnpertuit uusama shaneanderson0 arax1 greglamb

jobfunnel's Issues

Failing in ubuntu OS

Failed to install funnel on ubuntu 19.10. Am I missing something? Are there any prerequisite for funnel installation.

$ pip3 install git+https://github.com/PaulMcInnis/JobFunnel.git

Error logs

    ERROR: Command errored out with exit status 1:
     command: /home/xyz/.asdf/installs/python/3.8.0/bin/python3.8 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-n91obe4e/scikit-learn/setup.py'"'"'; __file__='"'"'/tmp/pip-install-n91obe4e/scikit-learn/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-dbbevzm6/install-record.txt --single-version-externally-managed --compile
         cwd: /tmp/pip-install-n91obe4e/scikit-learn/
    Complete output (90 lines):
    Partial import of sklearn during the build process.
    blas_opt_info:
    blas_mkl_info:
    customize UnixCCompiler
      libraries mkl_rt not found in ['/home/xyz/.asdf/installs/python/3.8.0/lib', '/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    blis_info:
      libraries blis not found in ['/home/xyz/.asdf/installs/python/3.8.0/lib', '/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    openblas_info:
      libraries openblas not found in ['/home/xyz/.asdf/installs/python/3.8.0/lib', '/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    atlas_3_10_blas_threads_info:
    Setting PTATLAS=ATLAS
      libraries tatlas not found in ['/home/xyz/.asdf/installs/python/3.8.0/lib', '/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    atlas_3_10_blas_info:
      libraries satlas not found in ['/home/xyz/.asdf/installs/python/3.8.0/lib', '/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    atlas_blas_threads_info:
    Setting PTATLAS=ATLAS
      libraries ptf77blas,ptcblas,atlas not found in ['/home/xyz/.asdf/installs/python/3.8.0/lib', '/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    atlas_blas_info:
      libraries f77blas,cblas,atlas not found in ['/home/xyz/.asdf/installs/python/3.8.0/lib', '/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    accelerate_info:
      NOT AVAILABLE
    
    /home/xyz/.asdf/installs/python/3.8.0/lib/python3.8/site-packages/numpy/distutils/system_info.py:1896: UserWarning:
        Optimized (vendor) Blas libraries are not found.
        Falls back to netlib Blas library which has worse performance.
        A better performance should be easily gained by switching
        Blas library.
      if self._calc_info(blas):
    blas_info:
      libraries blas not found in ['/home/xyz/.asdf/installs/python/3.8.0/lib', '/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    /home/xyz/.asdf/installs/python/3.8.0/lib/python3.8/site-packages/numpy/distutils/system_info.py:1896: UserWarning:
        Blas (http://www.netlib.org/blas/) libraries not found.
        Directories to search for the libraries can be specified in the
        numpy/distutils/site.cfg file (section [blas]) or by setting
        the BLAS environment variable.
      if self._calc_info(blas):
    blas_src_info:
      NOT AVAILABLE
    
    /home/xyz/.asdf/installs/python/3.8.0/lib/python3.8/site-packages/numpy/distutils/system_info.py:1896: UserWarning:
        Blas (http://www.netlib.org/blas/) sources not found.
        Directories to search for the sources can be specified in the
        numpy/distutils/site.cfg file (section [blas_src]) or by setting
        the BLAS_SRC environment variable.
      if self._calc_info(blas):
      NOT AVAILABLE
    
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-n91obe4e/scikit-learn/setup.py", line 290, in <module>
        setup_package()
      File "/tmp/pip-install-n91obe4e/scikit-learn/setup.py", line 286, in setup_package
        setup(**metadata)
      File "/home/xyz/.asdf/installs/python/3.8.0/lib/python3.8/site-packages/numpy/distutils/core.py", line 137, in setup
        config = configuration()
      File "/tmp/pip-install-n91obe4e/scikit-learn/setup.py", line 174, in configuration
        config.add_subpackage('sklearn')
      File "/home/xyz/.asdf/installs/python/3.8.0/lib/python3.8/site-packages/numpy/distutils/misc_util.py", line 1033, in add_subpackage
        config_list = self.get_subpackage(subpackage_name, subpackage_path,
      File "/home/xyz/.asdf/installs/python/3.8.0/lib/python3.8/site-packages/numpy/distutils/misc_util.py", line 999, in get_subpackage
        config = self._get_configuration_from_setup_py(
      File "/home/xyz/.asdf/installs/python/3.8.0/lib/python3.8/site-packages/numpy/distutils/misc_util.py", line 941, in _get_configuration_from_setup_py
        config = setup_module.configuration(*args)
      File "sklearn/setup.py", line 66, in configuration
        config.add_subpackage('utils')
      File "/home/xyz/.asdf/installs/python/3.8.0/lib/python3.8/site-packages/numpy/distutils/misc_util.py", line 1033, in add_subpackage
        config_list = self.get_subpackage(subpackage_name, subpackage_path,
      File "/home/xyz/.asdf/installs/python/3.8.0/lib/python3.8/site-packages/numpy/distutils/misc_util.py", line 999, in get_subpackage
        config = self._get_configuration_from_setup_py(
      File "/home/xyz/.asdf/installs/python/3.8.0/lib/python3.8/site-packages/numpy/distutils/misc_util.py", line 941, in _get_configuration_from_setup_py
        config = setup_module.configuration(*args)
      File "sklearn/utils/setup.py", line 8, in configuration
        from Cython import Tempita
    ModuleNotFoundError: No module named 'Cython'
    ----------------------------------------
ERROR: Command errored out with exit status 1: /home/xyz/.asdf/installs/python/3.8.0/bin/python3.8 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-n91obe4e/scikit-learn/setup.py'"'"'; __file__='"'"'/tmp/pip-install-n91obe4e/scikit-learn/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-dbbevzm6/install-record.txt --single-version-externally-managed --compile Check the logs for full command output.

Improved remote job search support

Description

It would be great to not search by city, as even if the filter can be set for remote, the results returned are still based on city indicated.

Steps to Reproduce

Include '-Indeed' in 'providers:' in the my_settings.yaml
run funnel load -s my_settings.yaml

Expected behavior

Successfully scrape and generate .csv

Actual behavior

Traceback and ValueError if 'province_or_state' or 'city' is not included

Environment

Build: 3.0.0
Operating system and version: Windows 10
Desktop Environment and/or Window Manager: Chrome

Package for PyPi

Is your feature request related to a problem? Please describe.
We are missing some setup items for PyPi:
https://medium.com/@joel.barmettler/how-to-upload-your-python-package-to-pypi-65edc5fe9c56

Describe the solution you'd like
We should make this package conform to PyPi specs so that we can offer a more accessible installation

Describe alternatives you've considered
We currently offer installation via github URL, this works totally fine, but I would also like to offer it on PyPi

Let's pick a better name!

option	notes
`workmate` or `jobmate`	to long?
`slacker` or `slack`	my favourite but portrays the opposite, ironic?
`workaholic`	accurate but maybe not the best, to long?
`opus` or `operis`	both mean work in Latin

Let me know what you guys think, and maybe propose your own suggestions :)

Blurbs are still being retrieved for filtered out jobs

Description

Currently the scraper is still retrieving blurbs for jobs that have been filtered out by the pre_filter method.

Please include a summary of the issue.
Please include the steps to reproduce.
List any additional libraries that are affected.

Steps to Reproduce

Run JobFunnel under any query and make sure the results are saved to a directory without a master_list.csv or duplicate_list.csv file.
Run the scraper again and take the note of the amount unique jobs found by the pre_filter, then count the amount of individual jobs that are being scraped. You should notice that they don't match.

Expected behavior

The scraper should remove jobs identified by the by the pre_filter, and only obtain blurbs for the remaining jobs.

Actual behavior

The scraper retrieves blurbs for all jobs whether they were filtered out or not.

To fix the issue, the order of the creation of the scrape_list and call to the pre_filter method would have to be switched. The screenshot below highlights the issue within the code and the debugger output :

Although this could've of been fixed in a pull request, making this fix would break date_filter called by the pre_filter method in the main JobFunnel class.

Environment

Build: Master 0a246cb
Operating system and version: Arch Linux
[Linux] Desktop Environment and/or Window Manager: Gnome

GlassDoor support (fix and re-enable)

Issue

Description

Currently we get the second page of glassdoor via the URL of the 2 button, but this no longer works as it redirects you to the first page. This is the case wether we use the webdriver or not.

Steps to Reproduce

navigate to https://www.glassdoor.ca/Job/waterloo-python-jobs-SRCH_IL.0,8_IC2280158_KO9,15.htm?radius=12&p=2

Expected behavior

We get to the second page of jobs

Actual behavior

We are redirected to the first page during the GET, which leads to every single page of jobs being a duplicate of the first page, with loads of TFIDF duplicate detection hits.

If you click the 2 button yourself, you will get toast RE: subscribing to email notifications, which will then navigate you to the second page.

Environment

Build: current development, or the branch on #85
Operating system and version: Ubntu 20.04
[Linux] Desktop Environment and/or Window Manager: Chrome

log_path has an invalid value

Issue Template

Description

the log path variable isn't being set. I tried setting the log_path in settings too, but not sure what the issue is.

Steps to Reproduce

Install via pip
Copy settings.yaml into ~/Virtualenv/Lib/site-packages/jobfunnel/config
Run 'funnel'

Expected behavior

Funnel runs

Actual behavior

$ funnel
ConfigError: 'log_path' has an invalid value

Environment

Build: Latest as of writing
Operating system and version: Windows 10 using mingw64 w/ python 3.7.3

Glassdoor is missing from search results

Glassdoor is missing from the search results.

pickles are overwritten

Seems like only data from the last call in run.py will ever make it into the master list CSV.

addressing this in branch: https://github.com/PaulMcInnis/JobPy/tree/fix_job_providers

Scraping glassdoor is too slow

Perhaps we should default the configuration to a shallower scrape?

Compatible with Europe?

Hi, is this tool compatible with european versions of those sites? (monster.co.uk, etc)

Monster results contain CSS in blurb field

Description

Hey everyone, I was gonna cut us a new release, but I noticed an issue:

Currently you get blurbs like below for all jobs scraped from Monster with GlassdoorStatic scraper:

.css-1noe2rc *{color:#505863;line-height:1.4em;}.css-1noe2rc .ecgq1xb1{padding-left:0;}.css-1noe2rc .ecgq1xb1 .ecgq1xb0{margin:0 0 8px 0;}.css-1noe2rc ol,.css-1noe2rc ul{padding-left:32px;}.css-1noe2rc li{margin:10px;margin-bottom:5px;margin-left:20px;line-height:1.4em;}.css-58vpdc{margin-bottom:24px;}.css-58vpdc ul{margin:5px 0 10px 20px;}.css-58vpdc ul > br{display:none;}.css-58vpdc ul > li{margin-left:0;}.css-58vpdc li{padding:0;}PlayStation isn't just the Best Place to Play it's also the Best Place to Work. We've thrilled gamers since 1994, when we launched the original ...

Since glassdoor runs first most of the duplicates are in other job sites and as a result most of the jobs I scrape now have blurbs like the one above.

I was just checking out GlassDoorDynamic and it seems to work well but it misses the date and blurb fields for jobs. As a side note, watching the browser windows go by made me feel like I was in the matrix 😎

Perhaps it is easier for us to purge some of the CSS from these blurbs in the GlassDoorStatic scraper with a regex for longest string in the raw scrape? Open to suggestions.

Alternatively we could just:

rename blurb to something more 'machiney' to reflect that it's not gonna be human readable all the time
switch to scraping indeed first in default settings so most of the burbs will contain human readable blurbs.

Steps to Reproduce

Easily replicable on the stock YAMl on current master with command funnel -kw Engineer, resulting ./search/masterlist.csv will contain aforementioned results.

Expected behavior

blurb should not contain CSS

Actual behavior

blurb contains CSS.

Environment

Build: current Master b30b28453a0f3528095166d0cbbe871726929b64
Operating system and version: OSX, Python3 w/ fresh install in a Virtualenv.
[Linux] Desktop Environment and/or Window Manager: n/a

This setup won't work for the urls which have prefix as domain (like Indeed for New Zealand)

Issue Template

Description

If I need to search for jobs in New-Zealand, it won't work because url formation always adds domain as prefix.

URL formation is developed in indeed.py
It will be great if it provides support for URLs which have domain in prefix.

Search term specification

First of, great idea which could definitely be useful for a lot of people!

Trying to use the application does raise a few questions for me though. From the README and the demo it is unclear to me what type of search terms I can use. The demo provides the province, city, domain and radius for the region search term.

To be concrete:

Can I provide a country to search for?
What does domain mean?
I suppose radius is in kilometres (or miles)?

Unable to run funnel --help

Issue Template

After installation, when I try to run funnel --help, i get error saying:

funnel : The term 'funnel' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of
the name, or if a path was included, verify that the path is correct and try again.
At line:1 char:1

funnel --help

  + CategoryInfo          : ObjectNotFound: (funnel:String) [], CommandNotFoundException
  + FullyQualifiedErrorId : CommandNotFoundException

Steps to Reproduce

Improve test coverage

Description

A cost of releasing version 3.0.0 is a signifcant loss of test coverage.

Steps to Reproduce

git clone https://github.com/PaulMcInnis/JobFunnel
cd JobFunnel
pytest

Expected behavior

We should cover Jobfunnel and scraper get/set methods with unit testing via pyest. Existing codecov was set to around 60%

Coverage

I've added Notes for all the modules that need testing coverage in a kanban here: https://github.com/PaulMcInnis/JobFunnel/projects/2

Please keep this up-to-date so that we don't duplicate eachother's work on upping the test coverage 👍

failed to scrape Indeed: 'NoneType' object has no attribute 'contents'

Ran
$ funnel -s /home/danny/JobFunnel/jobfunnel/config/settings.yaml

and got

jobfunnel initialized at 2020-01-05
jobfunnel indeed to pickle running @ 2020-01-05
Starting new HTTP connection (1): www.indeed.ie
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+None&radius=25&limit=50&filter=0 HTTP/1.1" 301 366
Starting new HTTP connection (1): ie.indeed.com
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+None&radius=25&limit=50&filter=0 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,+None&radius=25&limit=50&filter=0 HTTP/1.1" 301 0
Starting new HTTPS connection (1): ie.indeed.com
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,+None&radius=25&limit=50&filter=0 HTTP/1.1" 200 None
failed to scrape Indeed: 'NoneType' object has no attribute 'contents'
....

Cannot open shared object

Issue Template

Description

When I run any funnel command (--help, -s) I get the following errors:

Traceback (most recent call last):
File "/home/pi/.local/bin/funnel", line 6, in
from jobfunnel.main import main
File "/home/pi/.local/lib/python3.7/site-packages/jobfunnel/main.py", line 11, in
from .jobfunnel import JobFunnel
File "/home/pi/.local/lib/python3.7/site-packages/jobfunnel/jobfunnel.py", line 20, in
from .tools.delay import delay_alg
File "/home/pi/.local/lib/python3.7/site-packages/jobfunnel/tools/delay.py", line 9, in
from scipy.special import expit
File "/home/pi/.local/lib/python3.7/site-packages/scipy/special/init.py", line 633, in
from . import _ufuncs
ImportError: libf77blas.so.3: cannot open shared object file: No such file or directory

Environment

Raspbian OS
Raspberry Pi 4

Python 2

Python 2 has been deprecated for some time now, and it's end of life is 2020. We should NOT support it.

Proxy support

Proxy support would be nice if you were somewhere that requires them.

E.g. in indeed.py

http_proxy = "http.foo.com:8000" https_proxy = "http.foo.com:8000" proxyDict = { "http" : http_proxy, "https" : https_proxy } request_HTML = get(search, headers=self.headers, proxies=proxyDict)

Improved search keyword encoding with support for exact phrase

Issue Template

Description

For example on indeed when you want to search for an exact phrase (multiple words) as keyword you put this phrase between double quotes.

When I want to use this feature on funnel it removes the double quotes and it returns wrong results.

Steps to Reproduce

Use funnel with multiple word as keywords between double quotes
Example: -kw "Data Distribution Service"

Expected behavior

Normally when you write this keywords on indeed website this is the URL that is generated:
https://www.indeed.com/jobs?q=%22data+distribution+service%22&l=Saratoga%2C+CA&radius=25

Actual behavior

But funnel generates this url:
getting indeed page 0 : http://www.indeed.com/jobs?q=Data Distribution Service&l=Saratoga%2C+CA&radius=25&limit=50&filter=0&start=0

Environment

*Windows 10 Home

Provide requirements.txt for easy replication of dev-dependencies

Description

Having a requirements.txt file is helpful for creating virtual environments.

failed to scrape Indeed: invalid literal for int() with base 10: 'Just poste'

I get an error when running funnel, an exception gets caugh then scraping Indeed and then moves on to Monster...

 $ funnel -s JobFunnel/jobfunnel/config/settings.yaml

jobfunnel initialized at 2020-01-05
jobfunnel indeed to pickle running @ 2020-01-05
Starting new HTTP connection (1): www.indeed.ie
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0 HTTP/1.1" 301 362
Starting new HTTP connection (1): ie.indeed.com
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0 HTTP/1.1" 301 0
Starting new HTTPS connection (1): ie.indeed.com
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0 HTTP/1.1" 200 None
Found 291 indeed results for query=security
getting indeed page 0 : http://www.indeed.ie/jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=0
getting indeed page 1 : http://www.indeed.ie/jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=50
Starting new HTTP connection (1): www.indeed.ie
Starting new HTTP connection (1): www.indeed.ie
getting indeed page 2 : http://www.indeed.ie/jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=100
Starting new HTTP connection (1): www.indeed.ie
getting indeed page 3 : http://www.indeed.ie/jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=150
Starting new HTTP connection (1): www.indeed.ie
getting indeed page 4 : http://www.indeed.ie/jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=200
Starting new HTTP connection (1): www.indeed.ie
getting indeed page 5 : http://www.indeed.ie/jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=250
Starting new HTTP connection (1): www.indeed.ie
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=50 HTTP/1.1" 301 375
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=0 HTTP/1.1" 301 374
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=200 HTTP/1.1" 301 376
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=150 HTTP/1.1" 301 376
Starting new HTTP connection (1): ie.indeed.com
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=100 HTTP/1.1" 301 376
Starting new HTTP connection (1): ie.indeed.com
Starting new HTTP connection (1): ie.indeed.com
Starting new HTTP connection (1): ie.indeed.com
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=250 HTTP/1.1" 301 376
Starting new HTTP connection (1): ie.indeed.com
Starting new HTTP connection (1): ie.indeed.com
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=0 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=250 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=50 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=150 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=100 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=200 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=0 HTTP/1.1" 301 0
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=250 HTTP/1.1" 301 0
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=50 HTTP/1.1" 301 0
Starting new HTTPS connection (1): ie.indeed.com
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=100 HTTP/1.1" 301 0
Starting new HTTPS connection (1): ie.indeed.com
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=150 HTTP/1.1" 301 0
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=200 HTTP/1.1" 301 0
Starting new HTTPS connection (1): ie.indeed.com
Starting new HTTPS connection (1): ie.indeed.com
Starting new HTTPS connection (1): ie.indeed.com
Starting new HTTPS connection (1): ie.indeed.com
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=250 HTTP/1.1" 200 None
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=0 HTTP/1.1" 200 None
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=100 HTTP/1.1" 200 None
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=150 HTTP/1.1" 200 None
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=50 HTTP/1.1" 200 None
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=200 HTTP/1.1" 200 None
failed to scrape Indeed: invalid literal for int() with base 10: 'Just poste'
jobfunnel monster to pickle running @ : 2020-01-05
Starting new HTTPS connection (1): www.monster.ie
https://www.monster.ie:443 "GET /jobs/search/?q=security&whe

Notes that I've changed the location to xxxx for posting purposes.

The settings file looks like this

# This is the default settings file. Do not edit.

# All paths are relative to this file.

# Paths.
output_path: 'search'

# Providers from which to search (case insensitive)
providers:
  - 'Indeed'
  - 'Monster'
  - 'GlassDoor' # This takes ~10x longer to run than the other providers

# Filters.
search_terms:
  region:
    province: ''
    city:     'xxxx'
    domain:   'ie'
    radius:   25

  keywords:
    - 'security'

# Black-listed company names
black_list:
  - 'yyyyyyyyyy'

# Logging level options are: critical, error, warning, info, debug, notset
log_level: 'debug'
~

argument --no_scrape is broken

pauls-mbp $ funnel --no_scrape
jobfunnel initialized at 2019-07-07
Traceback (most recent call last):
  File "/usr/local/bin/funnel", line 11, in <module>
    load_entry_point('JobFunnel', 'console_scripts', 'funnel')()
  File "/Users/paulmcinnis/JobFunnel/jobfunnel/__main__.py", line 30, in main
    jp.load_pickle()
  File "/Users/paulmcinnis/JobFunnel/jobfunnel/jobfunnel.py", line 62, in load_pickle
    pickle_filepath = os.path.join(args['data_path'], 'scraped',
NameError: name 'args' is not defined
pauls-mbp $

AttributeError: module 'os' has no attribute 'add_dll_directory'

I just successfully installed it via:
pip install git+https://github.com/PaulMcInnis/JobFunnel.git

However, when I entered funnel --help it gave me the error:
AttributeError: module 'os' has no attribute 'add_dll_directory'

Integrate a Web-app

Description
It would be really nice to have some kind of web-app to use jobfunnel with, perhaps even just to provide a good demo experience.

i.e. https://pages.github.com/

Describe the solution you'd like
Ideally the user could run jobfunnel in-browser and review results with a simple web-app.

Describe alternatives you've considered
It may be more desirable to make something that runs on a user's local machine, limitations of github pages could prove a blocker.

[Feature Request] Export as JSON

I guess at some points dicts get converted to .csv so maybe we could have a --json flag which exports the jobs in JSON format instead of CSV.

Add separate log_level configuration for console and log file

Empty incoming jobs update()

Description

I seem to have botched this in the release... oops!

Everything seems to be working fine for the demo scrape, but suddenly you get no jobs added because we update it to be Nothing!

JobFunnel/jobfunnel/backend/jobfunnel.py

Line 244 in 25ce96e

jobs.update()

Steps to Reproduce

funnel load -s demo/settings.yaml

Expected behavior

We should be updating to accumulate incoming jobs.

Actual behavior

We never accumulate anything! but we get nice progress bars (and scrape goes fine)

Environment

Build: [3.0.0

Rename project to Jobfunnel

More Sites to Scrape

I attempted to adjust the 'providers' in the settings.yaml, but I found a few that raised errors. The following would be great additions to impact the tool:

'hire.google'
'Angel.co'
'greenhouse.io'
'jobs.jobvite'
'workable'

[Fix Included] Google Chrome Driver undefined

Description

Google Chrome Driver not defined

Steps to Correct

Replace line 158 of jobfunnel/tools/tools.py
webdriver.Chrome(ChromeDriverManager().install())
with
driver = webdriver.Chrome(ChromeDriverManager().install())
so function get_webdriver() has a value to return.

Why not a pull request? Too small of a fix

Environment

Installed Commit: cb9f152
Operating system and version: Windows 10
Browser: Google Chrome

ValueError: empty vocabulary

Issue Template

Description

Standard search produces web scrape error

Steps to Reproduce

Standard search with

'Indeed'
'Monster'
'GlassDoor'

Expected behavior

Results of query

Actual behavior

No loglevel

Traceback (most recent call last):
  File "C:\Users\phcre\AppData\Local\Programs\Python\Python38\Scripts\funnel-script.py", line 11, in <module>
    load_entry_point('JobFunnel', 'console_scripts', 'funnel')()
  File "c:\users\phcre\documents\jobs\jobfunnel\jobfunnel\__main__.py", line 55, in main
    jf.update_masterlist()
  File "c:\users\phcre\documents\jobs\jobfunnel\jobfunnel\jobfunnel.py", line 330, in update_masterlist
    tfidf_filter(self.scrape_data, masterlist)
  File "c:\users\phcre\documents\jobs\jobfunnel\jobfunnel\tools\filters.py", line 118, in tfidf_filter
    duplicate_ids = tfidf_filter(cur_dict)
  File "c:\users\phcre\documents\jobs\jobfunnel\jobfunnel\tools\filters.py", line 90, in tfidf_filter
    similarities = cosine_similarity(vectorizer.fit_transform(query_words))
  File "c:\users\phcre\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\text.py", line 1840, in fit_transform
    X = super().fit_transform(raw_documents)
  File "c:\users\phcre\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\text.py", line 1198, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents,
  File "c:\users\phcre\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\text.py", line 1129, in _count_vocab
    raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words

query_words is empty therefore cannot be fit_transform by vectorizer

Debug Loglevel

GET http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/source {}
http://127.0.0.1:50081 "GET /session/02b7e485dd5ae5ae4fb5c16bf406267a/source HTTP/1.1" 200 381722
Finished Request
Found 8 glassdoor results for query=Advertising-Marketing-Coordinator-Account-Agency
GET http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {}
http://127.0.0.1:50081 "GET /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 200 144
Finished Request
getting glassdoor page 1 : https://www.glassdoor.com/Job/allen-advertising-marketing-coordinator-account-agency-jobs-SRCH_IL.0,5_IC1139946_KE6,54.htm?radius=25
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/Job/allen-advertising-marketing-coordinator-account-agency-jobs-SRCH_IL.0,5_IC1139946_KE6,54.htm?radius=25"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 200 14
Finished Request
GET http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/source {}
http://127.0.0.1:50081 "GET /session/02b7e485dd5ae5ae4fb5c16bf406267a/source HTTP/1.1" 200 381666
Finished Request
DELETE http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/window {}
http://127.0.0.1:50081 "DELETE /session/02b7e485dd5ae5ae4fb5c16bf406267a/window HTTP/1.1" 200 14
Finished Request
found 8 unique job ids and 0 duplicates from glassdoor
removed 0 jobs present in filter-list
removed 0 jobs in blacklist from master-list
Calculating delay...
Done! Starting scrape!
delay of 0.00s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=101&ao=68087&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_c26c12d6&cb=1591932436271&jobListingId=3596513699
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=101&ao=68087&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_c26c12d6&cb=1591932436271&jobListingId=3596513699"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 770
Finished Request
delay of 22.19s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=102&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_4b2ba71c&cb=1591932436271&jobListingId=3593859227
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=102&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_4b2ba71c&cb=1591932436271&jobListingId=3593859227"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 22.34s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=103&ao=58033&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_9dd170bc&cb=1591932436271&jobListingId=3319079566
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=103&ao=58033&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_9dd170bc&cb=1591932436271&jobListingId=3319079566"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 24.76s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=104&ao=926135&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_3f575b87&cb=1591932436271&jobListingId=3582441465
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=104&ao=926135&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_3f575b87&cb=1591932436271&jobListingId=3582441465"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 27.24s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=105&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_98a55e04&cb=1591932436271&jobListingId=3584976096
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=105&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_98a55e04&cb=1591932436271&jobListingId=3584976096"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 29.04s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=106&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_ca4062d5&cb=1591932436271&jobListingId=3579768726
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=106&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_ca4062d5&cb=1591932436271&jobListingId=3579768726"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 29.64s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=107&ao=60939&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_64a152f4&cb=1591932436271&jobListingId=3504589748
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=107&ao=60939&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_64a152f4&cb=1591932436271&jobListingId=3504589748"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 18.15s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=108&ao=60939&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_81ad2932&cb=1591932436272&jobListingId=3543437733
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=108&ao=60939&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_81ad2932&cb=1591932436272&jobListingId=3543437733"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
glassdoor scrape job took 173.619s
removed 0 jobs present in filter-list
removed 0 jobs in blacklist from master-list
removed 0 jobs present in filter-list
removed 0 jobs in blacklist from master-list
Traceback (most recent call last):
  File "C:\Users\phcre\AppData\Local\Programs\Python\Python38\Scripts\funnel-script.py", line 11, in <module>
    load_entry_point('JobFunnel', 'console_scripts', 'funnel')()
  File "c:\users\asdf\documents\jobs\jobfunnel\jobfunnel\__main__.py", line 55, in main
    jf.update_masterlist()
  File "c:\users\asdf\documents\jobs\jobfunnel\jobfunnel\jobfunnel.py", line 330, in update_masterlist
    tfidf_filter(self.scrape_data, masterlist)
  File "c:\users\asdf\documents\jobs\jobfunnel\jobfunnel\tools\filters.py", line 118, in tfidf_filter
    duplicate_ids = tfidf_filter(cur_dict)
  File "c:\users\asdf\documents\jobs\jobfunnel\jobfunnel\tools\filters.py", line 90, in tfidf_filter
    similarities = cosine_similarity(vectorizer.fit_transform(query_words))
  File "c:\users\asdf\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\text.py", line 1840, in fit_transform
    X = super().fit_transform(raw_documents)
  File "c:\users\asdf\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\text.py", line 1198, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents,
  File "c:\users\asdf\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\text.py", line 1129, in _count_vocab
    raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words

webdriver manager returning 404 errors?

Variable Contents

`prev_dict`

None

`cur_dict.values()`

odict_values([{'status': 'new', 'title': 'Account Manager Digital Marketing - Professional Services - Entertainment and Media Industry Opportunity', 'company': 'Gannett', 'location': 'Plano, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=101&ao=68087&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_c26c12d6&cb=1591932436271&jobListingId=3596513699', 'id': '3596513699', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Account Coordinator - Marketing', 'company': 'The Point Group', 'location': 'Dallas, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=102&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_4b2ba71c&cb=1591932436271&jobListingId=3593859227', 'id': '3593859227', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Marketing Coordinator', 'company': 'Gourmet Marketing LLC', 'location': 'Plano, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=103&ao=58033&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_9dd170bc&cb=1591932436271&jobListingId=3319079566', 'id': '3319079566', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Account Coordinator - Client Service', 'company': 'RKD Group, Inc.', 'location': 'Richardson, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=104&ao=926135&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_3f575b87&cb=1591932436271&jobListingId=3582441465', 'id': '3582441465', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'COLLEGE GRADS & INTERNS - Entry Level Marketing & Advertising', 'company': 'Millennium Events Management', 'location': 'Dallas, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=105&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_98a55e04&cb=1591932436271&jobListingId=3584976096', 'id': '3584976096', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Senior Account Executive (Marketing/Advertising)', 'company': 'The Point Group', 'location': 'Dallas, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=106&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_ca4062d5&cb=1591932436271&jobListingId=3579768726', 'id': '3579768726', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Account Coordinator - Client Service', 'company': 'RKD Group', 'location': 'Richardson, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=107&ao=60939&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_64a152f4&cb=1591932436271&jobListingId=3504589748', 'id': '3504589748', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Digital Account Coordinator', 'company': 'RKD Group', 'location': 'Richardson, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=108&ao=60939&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_81ad2932&cb=1591932436272&jobListingId=3543437733', 'id': '3543437733', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}])

`query_ids`

['3596513699', '3593859227', '3319079566', '3582441465', '3584976096', '3579768726', '3504589748', '3543437733']

`query_words`

['', '', '', '', '', '', '', '']

Environment

Operating system and version: Windows 10

beautifulsoup4>=4.6.3 (4.9.1)
lxml>=4.2.4 (4.5.1)
requests>=2.19.1 (2.23.0)
python-dateutil>=2.8.0 (2.8.1)
PyYAML>=5.1 (5.3.1)
scikit-learn>=0.21.2 (0.23.1)
nltk>=3.4.1 (3.5)
scipy>=1.4.1 (1.4.1)
selenium>=3.141.0 (3.141.0)
webdriver-manager>=2.4.0 (3.1.0)
soupsieve>1.2 (2.0.1)
certifi>=2017.4.17 (2020.4.5.2)
urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (1.25.9)
chardet<4,>=3.0.2 (3.0.4)
idna<3,>=2.5 (2.9)
six>=1.5 (1.15.0)
threadpoolctl>=2.0.0 (2.1.0)
joblib>=0.11 (0.15.1)
numpy>=1.13.3 (1.18.5)
click (7.1.2)
tqdm(4.46.1)
atomicwrites>=1.0; (1.4.0)
packaging (20.4)
pluggy<1.0,>=0.12 (0.13.1)

Glassdoor.com is not working

Issue Template

Description

Just today I discovered that when scraping Glassdoor.com, JobFunnel fails.
Please include the steps to reproduce.
List any additional libraries that are affected.

Steps to Reproduce

Comment out Indeed and Monster from providers options in settings.yaml as such:

        # - 'Indeed'
         # - 'Monster'
        - 'GlassDoor'

Run job funnel funnel -s settings.yaml

Expected behavior

Scrape Glassdoor.com and store jobs in master_list.csv

Actual behavior

JobFunnel output:

jobfunnel initialized at 2020-05-05
no master-list, filter-list was not updated
jobfunnel glassdoor to pickle running @ 2020-05-05
failed to scrape GlassDoor: 'NoneType' object has no attribute 'text'
Traceback (most recent call last):
  File "/usr/local/bin/funnel", line 11, in <module>
    load_entry_point('JobFunnel==2.1.6', 'console_scripts', 'funnel')()
  File "/usr/local/lib/python3.6/dist-packages/jobfunnel/__main__.py", line 55, in main
    jf.update_masterlist()
  File "/usr/local/lib/python3.6/dist-packages/jobfunnel/jobfunnel.py", line 291, in update_masterlist
    raise ValueError('No scraped jobs, cannot update masterlist')

Environment

Operating system and version: Linux Mint(Ubuntu 18.04)
Desktop Environment and/or Window Manager: Cinnamon
Tested on .com(United States domain) and .ca(Canada domain)
NOTE: I also ran JobFunnel on an isolated docker container(Ubuntu 18.04) and the issue persisted.

I discovered this while inspecting glassdoor.py for testing. I will try my best to tackle this issue in the upcoming days. Hopefully we'll fix it soon!

Cheers!

What if the html tag attributes or values change??

We need to manually update the code to make it work if anything related to the html elements changes, right?

TFIDF content matching should check inter-scrape

Description

Currently we remove duplicates everywhere but we only remove duplicates by description (TFIDF) between the masterlist and all scrape data.

We should allow masterlist to perform a content match to itself.

Steps to Reproduce

scrape some jobs to .pkl
copy-paste a row a few times, only changing the key_id
run again with --no-scrape

Expected behavior

We should be running TFIDF inter-scrape data and inter-master csv

Actual behavior

Only duplicates in the ncoming dict are identified based on master CSV

Environment

Build: 3.0.0

Significant lag in CLI

Description

Hi all,

Installing the latest version of JobFunnel and running funnel --help takes a few seconds when opening the first time. Later calls are slightly faster but are arguable still a tad slow.

I know I could have looked into the cause of the problem right now but I'm really low on time as of lately. If this issue still persists when I have more time I will come back to this (kind of a reminder for myself then).

Steps to Reproduce

Fresh install of JobFunnel
Call funnel --help
Wait

Expected behavior

A few ms delay before showing CLI help information.

Actual behavior

Significant delay between function call and presented info.

Environment

Build: #75
Operating system and version: macOS 10.14

pip installer is broken

pip installer fails to install because it can't find pipenv.

Currently none of the installation instructions on the readme work.

Implement European locales

Is your feature request related to a problem? Please describe.
We currently only offer CANADA_ENGLISH and USA_ENGLISH scrapers, but we should also implement some european locales

Describe the solution you'd like
Implement some european locales such as UK_ENGLISH, FRANCE_FRENCH

Describe alternatives you've considered
n/a

Additional context
Existing request #45

misuse of abstract base classes + monolithic JobFunnel class + schema validation + localisation

Description

Currently we are using the JobFunnel class for to much, I want to break it down into the following:

Job(object):
    def __init__(self, title: str, company: str, location: str, tags: List[str], post_date: datetime.date, key_id: str, url: str) -> None:
        ...

Scraper(ABC):

    @abstractmethod
    def scrape(self) -> List[Job]:
        pass
    

main():
    
    # instantiate scrapers

    # run filter on list of Job

    # dump pickle

    # writeout CSV

Note: if I get to it, I'd also like our filters to be an ABC.

Steps to Reproduce

This is a structural technical debt issue. (n/a)

Expected behavior

Abstract base class should not be halfway abstract, Need seperation between JobFunnel and main() and inherited scrapers.

Actual behavior

JobFunnel being monolithic and half-abstract has allowed us to implement three script-like scrapers which share too many methods, without an actual Job object.

Environment

n/a

Current Status:

Future work:

Google jobs scraper
Ycombinator job scraper
Assess the update experience from V2.0 --> V3.0, provide a guide
cut a release
Add WAGE scraping to Indeed
Add REMOTE scraping to Indeed
Add REMOTE scraping to Monster

Bad url search keyword encoding

Description

We currently pass keywords verbatim, even if they include “+” or other special characters, we should be encoding these so as not to disrupt the url formation.

Steps to Reproduce

Search with any word containing +s such as “C++”

Expected behavior

We would get a query containing “%” encoded values in place of +s

Actual behavior

We build a search string with erroneous + chars that breaks the search query url (query is not as intended)

Environment

Build: current dev
Operating system Ubuntu 18.04

Scrapping jobs from LinkedIn?

Implement Remote scraping

Is your feature request related to a problem? Please describe.
Currently we do not scrape Remote field for Indeed or Monster, which are providing many jobs.

Indeed provides this fairly visibly, it's just a bit buried in tags.
Glassdoor has this implemented already (tags as well)
Monster does not seem to have this capability, but it may be worth looking into if we can infer it.

Describe the solution you'd like
We should scrape Remote = True if a job is fully-remote on Indeed. Temporarily remote doesn't count, sorry!

Describe alternatives you've considered
n/a

Additional context

Bad Status

When changing the status to applied and rerunning JobFunnel master_list.csv has no fields!
Yikes!
Every time running JubFunnel thereafter will result in an empty list of jobs.
If you delete master_list.csv a new list will be created next time running JobFunnel but the previous jobs are gone.

We need two things:

To fix this issue.
Make a way to dump existing pickles to form master_list.csv in case this ever happens again.

I think by default I will have the software search for existing pickles by changing:
load_pickle(self, args) to load_pickles(self, args).
This way, all pickles are important and not just today's pickle.

Check config after argparse

Description

If the user provides an invalid configuration file, then the program will throw unclear errors. I think it's good practice to check all settings of the config dictionary and show a helpful error message in case the config is invalid.

Steps to Reproduce

Provide an invalid configuration file (settings.yaml)
Run funnel -s settings.yaml

Using JobFunnel with PyCharm

I am opening this issue because I think we need a folder pycharm for debugging this project.
PyCharm has become a very popular IDE and can really help with debugging this project.
If we can make a folder that gets developers up and running in debug mode in just a few simple steps then I think it is worth doing.

Main benefits of using PyCharm for developing this project:

PyCharm has real time break points that are easy to set.
PyCharm has an intuitive interface for running expressions in debug mode.
PyCharm recognizes URLs in the log output which allows quick "one-click" access to job postings for validation.

Increase test coverage

Description

Since a few weeks, testing has been introduced to the JobFunnel software but it only covers a fraction of the entire code base. I think increasing the quality of the testing framework with unit and integration tests can make it easier for the reviewers to assess new pull requests and gives new contributors more feedback whether their changes to the code base does not break functionality.

I came up with two ideas to provide a clear and productive environment to increase test coverage:

Adding test coverage (next to the building status) could be a good way to keep track of the current state of the test environment.
Have a checklist of all functions in the code base and whether this function is covered by tests.

Note that checking whether a function is 'covered' by tests can be a bit tricky because some functions are rather long, e.g.parse_config. Hence, true unit tests are very difficult for these long functions but can be easily implemented for shorter functions. Realising that unit tests are complicated for certain functions might also be a sign that these functions should be broken into smaller parts

Test coverage

`glassdoor.py`

`indeed.py`

`monster.py`

`jobfunnel.py`

tools

`delay.py`

_c_delay
_lin_delay
_sig_delay
delay_alg

`filters.py`

id_filter
tfidf_filter

`tools.py`

config

`parser.py`

`validate.py`

validate_region
validate_delay
validate_config

Want to contribute?

Do you like this project and do you want to make it even better? Feel free to discuss below if you want to contribute to this project. All help is welcome 👍.

Want to start with something (relatively) easy? The functions in validate.py, tools.py and delay.py are relatively easy to test (at first inspection)!

Implement wage scraping for Indeed and Monster

Is your feature request related to a problem? Please describe.
Currently Glassdoor performs wage scraping, but Indeed and monster should do this as well.

Describe the solution you'd like
Add scraping for JobField.WAGE to indeed and monster scrapers

Apply NLP to condense description

Description

There are currently a number of state-of-the-art algorithms to summarize long passages of text.

We should apply one of these methods to produce the currently-unused short_description JobField

Example algorithms (whatever we use must provide a good license):

Steps to Reproduce

n/a

Expected behavior

We should populate short_description with a condensed (1-3 sentences?) description generated from the longer Job.description

Actual behavior

short_description is un-implemented.

Environment

n/a

Unknown Location Error

Files.zip

Description

I have 2 identical .yaml files. Only difference is the location the information is stored and one looks for CFO positions and the other Human Resource. However, the Human Resource has the following error where as the CFO does not:
Traceback (most recent call last):
File "/home/pittsie/.local/bin/funnel", line 8, in
sys.exit(main())
File "/home/pittsie/.local/lib/python3.8/site-packages/jobfunnel/main.py", line 28, in main
job_funnel.run()
File "/home/pittsie/.local/lib/python3.8/site-packages/jobfunnel/backend/jobfunnel.py", line 86, in run
self.master_jobs_dict = self.read_master_csv()
File "/home/pittsie/.local/lib/python3.8/site-packages/jobfunnel/backend/jobfunnel.py", line 405, in read_master_csv
locale = locale.UNKNOWN
AttributeError: 'NoneType' object has no attribute 'UNKNOWN'

Both files have been attached in a zip

Expected behavior

Run

Actual behavior

Error shown above only on Bill.yaml not 2Mom.yaml

Environment

Build: 3.0.0
Operating system and version: Ubuntu 20.04