Giter Club home page Giter Club logo

jobfunnel's Introduction

Hi there ๐Ÿ‘‹

I'm Paul, I like to build software tools and hardware gadgets.

  • ๐Ÿค” Iโ€™m looking for help with: JobFunnel, a tool for automating your job search
  • ๐Ÿ“ซ Iโ€™m looking to collaborate on: other apps for democratizing data
  • โšก Fun fact: I make music under the name Scramble Suit

jobfunnel's People

Contributors

arax1 avatar bunsenmurder avatar cclauss avatar itseez avatar jacenfox avatar lilysu avatar marchbnr avatar markkvdb avatar paulmcinnis avatar riyaagrahari avatar thebigg avatar zenahr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jobfunnel's Issues

Failing in ubuntu OS

Failed to install funnel on ubuntu 19.10. Am I missing something? Are there any prerequisite for funnel installation.

$ pip3 install git+https://github.com/PaulMcInnis/JobFunnel.git

Error logs

    ERROR: Command errored out with exit status 1:
     command: /home/xyz/.asdf/installs/python/3.8.0/bin/python3.8 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-n91obe4e/scikit-learn/setup.py'"'"'; __file__='"'"'/tmp/pip-install-n91obe4e/scikit-learn/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-dbbevzm6/install-record.txt --single-version-externally-managed --compile
         cwd: /tmp/pip-install-n91obe4e/scikit-learn/
    Complete output (90 lines):
    Partial import of sklearn during the build process.
    blas_opt_info:
    blas_mkl_info:
    customize UnixCCompiler
      libraries mkl_rt not found in ['/home/xyz/.asdf/installs/python/3.8.0/lib', '/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    blis_info:
      libraries blis not found in ['/home/xyz/.asdf/installs/python/3.8.0/lib', '/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    openblas_info:
      libraries openblas not found in ['/home/xyz/.asdf/installs/python/3.8.0/lib', '/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    atlas_3_10_blas_threads_info:
    Setting PTATLAS=ATLAS
      libraries tatlas not found in ['/home/xyz/.asdf/installs/python/3.8.0/lib', '/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    atlas_3_10_blas_info:
      libraries satlas not found in ['/home/xyz/.asdf/installs/python/3.8.0/lib', '/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    atlas_blas_threads_info:
    Setting PTATLAS=ATLAS
      libraries ptf77blas,ptcblas,atlas not found in ['/home/xyz/.asdf/installs/python/3.8.0/lib', '/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    atlas_blas_info:
      libraries f77blas,cblas,atlas not found in ['/home/xyz/.asdf/installs/python/3.8.0/lib', '/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    accelerate_info:
      NOT AVAILABLE
    
    /home/xyz/.asdf/installs/python/3.8.0/lib/python3.8/site-packages/numpy/distutils/system_info.py:1896: UserWarning:
        Optimized (vendor) Blas libraries are not found.
        Falls back to netlib Blas library which has worse performance.
        A better performance should be easily gained by switching
        Blas library.
      if self._calc_info(blas):
    blas_info:
      libraries blas not found in ['/home/xyz/.asdf/installs/python/3.8.0/lib', '/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
      NOT AVAILABLE
    
    /home/xyz/.asdf/installs/python/3.8.0/lib/python3.8/site-packages/numpy/distutils/system_info.py:1896: UserWarning:
        Blas (http://www.netlib.org/blas/) libraries not found.
        Directories to search for the libraries can be specified in the
        numpy/distutils/site.cfg file (section [blas]) or by setting
        the BLAS environment variable.
      if self._calc_info(blas):
    blas_src_info:
      NOT AVAILABLE
    
    /home/xyz/.asdf/installs/python/3.8.0/lib/python3.8/site-packages/numpy/distutils/system_info.py:1896: UserWarning:
        Blas (http://www.netlib.org/blas/) sources not found.
        Directories to search for the sources can be specified in the
        numpy/distutils/site.cfg file (section [blas_src]) or by setting
        the BLAS_SRC environment variable.
      if self._calc_info(blas):
      NOT AVAILABLE
    
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-n91obe4e/scikit-learn/setup.py", line 290, in <module>
        setup_package()
      File "/tmp/pip-install-n91obe4e/scikit-learn/setup.py", line 286, in setup_package
        setup(**metadata)
      File "/home/xyz/.asdf/installs/python/3.8.0/lib/python3.8/site-packages/numpy/distutils/core.py", line 137, in setup
        config = configuration()
      File "/tmp/pip-install-n91obe4e/scikit-learn/setup.py", line 174, in configuration
        config.add_subpackage('sklearn')
      File "/home/xyz/.asdf/installs/python/3.8.0/lib/python3.8/site-packages/numpy/distutils/misc_util.py", line 1033, in add_subpackage
        config_list = self.get_subpackage(subpackage_name, subpackage_path,
      File "/home/xyz/.asdf/installs/python/3.8.0/lib/python3.8/site-packages/numpy/distutils/misc_util.py", line 999, in get_subpackage
        config = self._get_configuration_from_setup_py(
      File "/home/xyz/.asdf/installs/python/3.8.0/lib/python3.8/site-packages/numpy/distutils/misc_util.py", line 941, in _get_configuration_from_setup_py
        config = setup_module.configuration(*args)
      File "sklearn/setup.py", line 66, in configuration
        config.add_subpackage('utils')
      File "/home/xyz/.asdf/installs/python/3.8.0/lib/python3.8/site-packages/numpy/distutils/misc_util.py", line 1033, in add_subpackage
        config_list = self.get_subpackage(subpackage_name, subpackage_path,
      File "/home/xyz/.asdf/installs/python/3.8.0/lib/python3.8/site-packages/numpy/distutils/misc_util.py", line 999, in get_subpackage
        config = self._get_configuration_from_setup_py(
      File "/home/xyz/.asdf/installs/python/3.8.0/lib/python3.8/site-packages/numpy/distutils/misc_util.py", line 941, in _get_configuration_from_setup_py
        config = setup_module.configuration(*args)
      File "sklearn/utils/setup.py", line 8, in configuration
        from Cython import Tempita
    ModuleNotFoundError: No module named 'Cython'
    ----------------------------------------
ERROR: Command errored out with exit status 1: /home/xyz/.asdf/installs/python/3.8.0/bin/python3.8 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-n91obe4e/scikit-learn/setup.py'"'"'; __file__='"'"'/tmp/pip-install-n91obe4e/scikit-learn/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-dbbevzm6/install-record.txt --single-version-externally-managed --compile Check the logs for full command output.

Improved remote job search support

Description

It would be great to not search by city, as even if the filter can be set for remote, the results returned are still based on city indicated.

Steps to Reproduce

  1. Include '-Indeed' in 'providers:' in the my_settings.yaml
  2. run funnel load -s my_settings.yaml

Expected behavior

Successfully scrape and generate .csv

Actual behavior

Traceback and ValueError if 'province_or_state' or 'city' is not included

Environment

  • Build: 3.0.0
  • Operating system and version: Windows 10
    Desktop Environment and/or Window Manager: Chrome

Package for PyPi

Is your feature request related to a problem? Please describe.
We are missing some setup items for PyPi:
https://medium.com/@joel.barmettler/how-to-upload-your-python-package-to-pypi-65edc5fe9c56

Describe the solution you'd like
We should make this package conform to PyPi specs so that we can offer a more accessible installation

Describe alternatives you've considered
We currently offer installation via github URL, this works totally fine, but I would also like to offer it on PyPi

Let's pick a better name!

option notes
workmate or jobmate to long?
slacker or slack my favourite but portrays the opposite, ironic?
workaholic accurate but maybe not the best, to long?
opus or operis both mean work in Latin

Let me know what you guys think, and maybe propose your own suggestions :)

Blurbs are still being retrieved for filtered out jobs

Description

Currently the scraper is still retrieving blurbs for jobs that have been filtered out by the pre_filter method.

Please include a summary of the issue.
Please include the steps to reproduce.
List any additional libraries that are affected.

Steps to Reproduce

  1. Run JobFunnel under any query and make sure the results are saved to a directory without a master_list.csv or duplicate_list.csv file.
  2. Run the scraper again and take the note of the amount unique jobs found by the pre_filter, then count the amount of individual jobs that are being scraped. You should notice that they don't match.

Expected behavior

The scraper should remove jobs identified by the by the pre_filter, and only obtain blurbs for the remaining jobs.

Actual behavior

The scraper retrieves blurbs for all jobs whether they were filtered out or not.

To fix the issue, the order of the creation of the scrape_list and call to the pre_filter method would have to be switched. The screenshot below highlights the issue within the code and the debugger output :
image

Although this could've of been fixed in a pull request, making this fix would break date_filter called by the pre_filter method in the main JobFunnel class.

Environment

  • Build: Master 0a246cb
  • Operating system and version: Arch Linux
  • [Linux] Desktop Environment and/or Window Manager: Gnome

GlassDoor support (fix and re-enable)

Issue

Description

Currently we get the second page of glassdoor via the URL of the 2 button, but this no longer works as it redirects you to the first page. This is the case wether we use the webdriver or not.

Steps to Reproduce

  1. navigate to https://www.glassdoor.ca/Job/waterloo-python-jobs-SRCH_IL.0,8_IC2280158_KO9,15.htm?radius=12&p=2

Expected behavior

We get to the second page of jobs

Actual behavior

We are redirected to the first page during the GET, which leads to every single page of jobs being a duplicate of the first page, with loads of TFIDF duplicate detection hits.

If you click the 2 button yourself, you will get toast RE: subscribing to email notifications, which will then navigate you to the second page.

Environment

  • Build: current development, or the branch on #85
  • Operating system and version: Ubntu 20.04
  • [Linux] Desktop Environment and/or Window Manager: Chrome

log_path has an invalid value

Issue Template

Description

the log path variable isn't being set. I tried setting the log_path in settings too, but not sure what the issue is.

Steps to Reproduce

  1. Install via pip
  2. Copy settings.yaml into ~/Virtualenv/Lib/site-packages/jobfunnel/config
  3. Run 'funnel'

Expected behavior

Funnel runs

Actual behavior

$ funnel
ConfigError: 'log_path' has an invalid value

Environment

  • Build: Latest as of writing
  • Operating system and version: Windows 10 using mingw64 w/ python 3.7.3

Monster results contain CSS in blurb field

Description

Hey everyone, I was gonna cut us a new release, but I noticed an issue:

Currently you get blurbs like below for all jobs scraped from Monster with GlassdoorStatic scraper:

.css-1noe2rc *{color:#505863;line-height:1.4em;}.css-1noe2rc .ecgq1xb1{padding-left:0;}.css-1noe2rc .ecgq1xb1 .ecgq1xb0{margin:0 0 8px 0;}.css-1noe2rc ol,.css-1noe2rc ul{padding-left:32px;}.css-1noe2rc li{margin:10px;margin-bottom:5px;margin-left:20px;line-height:1.4em;}.css-58vpdc{margin-bottom:24px;}.css-58vpdc ul{margin:5px 0 10px 20px;}.css-58vpdc ul > br{display:none;}.css-58vpdc ul > li{margin-left:0;}.css-58vpdc li{padding:0;}PlayStation isn't just the Best Place to Play it's also the Best Place to Work. We've thrilled gamers since 1994, when we launched the original ...

Since glassdoor runs first most of the duplicates are in other job sites and as a result most of the jobs I scrape now have blurbs like the one above.

I was just checking out GlassDoorDynamic and it seems to work well but it misses the date and blurb fields for jobs. As a side note, watching the browser windows go by made me feel like I was in the matrix ๐Ÿ˜Ž

Perhaps it is easier for us to purge some of the CSS from these blurbs in the GlassDoorStatic scraper with a regex for longest string in the raw scrape? Open to suggestions.

Alternatively we could just:

  • rename blurb to something more 'machiney' to reflect that it's not gonna be human readable all the time
  • switch to scraping indeed first in default settings so most of the burbs will contain human readable blurbs.

Steps to Reproduce

Easily replicable on the stock YAMl on current master with command funnel -kw Engineer, resulting ./search/masterlist.csv will contain aforementioned results.

Expected behavior

blurb should not contain CSS

Actual behavior

blurb contains CSS.

Environment

  • Build: current Master b30b28453a0f3528095166d0cbbe871726929b64
  • Operating system and version: OSX, Python3 w/ fresh install in a Virtualenv.
  • [Linux] Desktop Environment and/or Window Manager: n/a

Search term specification

First of, great idea which could definitely be useful for a lot of people!

Trying to use the application does raise a few questions for me though. From the README and the demo it is unclear to me what type of search terms I can use. The demo provides the province, city, domain and radius for the region search term.

To be concrete:

  • Can I provide a country to search for?
  • What does domain mean?
  • I suppose radius is in kilometres (or miles)?

Unable to run funnel --help

Issue Template

After installation, when I try to run funnel --help, i get error saying:

funnel : The term 'funnel' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of
the name, or if a path was included, verify that the path is correct and try again.
At line:1 char:1

  • funnel --help
  •   + CategoryInfo          : ObjectNotFound: (funnel:String) [], CommandNotFoundException
      + FullyQualifiedErrorId : CommandNotFoundException
    

Steps to Reproduce

Improve test coverage

Description

A cost of releasing version 3.0.0 is a signifcant loss of test coverage.

Steps to Reproduce

git clone https://github.com/PaulMcInnis/JobFunnel
cd JobFunnel
pytest

Expected behavior

We should cover Jobfunnel and scraper get/set methods with unit testing via pyest. Existing codecov was set to around 60%

Coverage

I've added Notes for all the modules that need testing coverage in a kanban here: https://github.com/PaulMcInnis/JobFunnel/projects/2

Please keep this up-to-date so that we don't duplicate eachother's work on upping the test coverage ๐Ÿ‘

failed to scrape Indeed: 'NoneType' object has no attribute 'contents'

Ran
$ funnel -s /home/danny/JobFunnel/jobfunnel/config/settings.yaml

and got

jobfunnel initialized at 2020-01-05
jobfunnel indeed to pickle running @ 2020-01-05
Starting new HTTP connection (1): www.indeed.ie
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+None&radius=25&limit=50&filter=0 HTTP/1.1" 301 366
Starting new HTTP connection (1): ie.indeed.com
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+None&radius=25&limit=50&filter=0 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,+None&radius=25&limit=50&filter=0 HTTP/1.1" 301 0
Starting new HTTPS connection (1): ie.indeed.com
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,+None&radius=25&limit=50&filter=0 HTTP/1.1" 200 None
failed to scrape Indeed: 'NoneType' object has no attribute 'contents'
....

Cannot open shared object

Issue Template

Description

When I run any funnel command (--help, -s) I get the following errors:

Traceback (most recent call last):
File "/home/pi/.local/bin/funnel", line 6, in
from jobfunnel.main import main
File "/home/pi/.local/lib/python3.7/site-packages/jobfunnel/main.py", line 11, in
from .jobfunnel import JobFunnel
File "/home/pi/.local/lib/python3.7/site-packages/jobfunnel/jobfunnel.py", line 20, in
from .tools.delay import delay_alg
File "/home/pi/.local/lib/python3.7/site-packages/jobfunnel/tools/delay.py", line 9, in
from scipy.special import expit
File "/home/pi/.local/lib/python3.7/site-packages/scipy/special/init.py", line 633, in
from . import _ufuncs
ImportError: libf77blas.so.3: cannot open shared object file: No such file or directory

Environment

Raspbian OS
Raspberry Pi 4

Python 2

Python 2 has been deprecated for some time now, and it's end of life is 2020. We should NOT support it.

Proxy support

Proxy support would be nice if you were somewhere that requires them.

E.g. in indeed.py

http_proxy = "http.foo.com:8000" https_proxy = "http.foo.com:8000" proxyDict = { "http" : http_proxy, "https" : https_proxy } request_HTML = get(search, headers=self.headers, proxies=proxyDict)

Improved search keyword encoding with support for exact phrase

Issue Template

Description

For example on indeed when you want to search for an exact phrase (multiple words) as keyword you put this phrase between double quotes.

When I want to use this feature on funnel it removes the double quotes and it returns wrong results.

Steps to Reproduce

  1. Use funnel with multiple word as keywords between double quotes
  2. Example: -kw "Data Distribution Service"

Expected behavior

Normally when you write this keywords on indeed website this is the URL that is generated:
https://www.indeed.com/jobs?q=%22data+distribution+service%22&l=Saratoga%2C+CA&radius=25

Actual behavior

But funnel generates this url:
getting indeed page 0 : http://www.indeed.com/jobs?q=Data Distribution Service&l=Saratoga%2C+CA&radius=25&limit=50&filter=0&start=0

Environment

*Windows 10 Home

failed to scrape Indeed: invalid literal for int() with base 10: 'Just poste'

I get an error when running funnel, an exception gets caugh then scraping Indeed and then moves on to Monster...

 $ funnel -s JobFunnel/jobfunnel/config/settings.yaml

jobfunnel initialized at 2020-01-05
jobfunnel indeed to pickle running @ 2020-01-05
Starting new HTTP connection (1): www.indeed.ie
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0 HTTP/1.1" 301 362
Starting new HTTP connection (1): ie.indeed.com
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0 HTTP/1.1" 301 0
Starting new HTTPS connection (1): ie.indeed.com
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0 HTTP/1.1" 200 None
Found 291 indeed results for query=security
getting indeed page 0 : http://www.indeed.ie/jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=0
getting indeed page 1 : http://www.indeed.ie/jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=50
Starting new HTTP connection (1): www.indeed.ie
Starting new HTTP connection (1): www.indeed.ie
getting indeed page 2 : http://www.indeed.ie/jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=100
Starting new HTTP connection (1): www.indeed.ie
getting indeed page 3 : http://www.indeed.ie/jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=150
Starting new HTTP connection (1): www.indeed.ie
getting indeed page 4 : http://www.indeed.ie/jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=200
Starting new HTTP connection (1): www.indeed.ie
getting indeed page 5 : http://www.indeed.ie/jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=250
Starting new HTTP connection (1): www.indeed.ie
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=50 HTTP/1.1" 301 375
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=0 HTTP/1.1" 301 374
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=200 HTTP/1.1" 301 376
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=150 HTTP/1.1" 301 376
Starting new HTTP connection (1): ie.indeed.com
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=100 HTTP/1.1" 301 376
Starting new HTTP connection (1): ie.indeed.com
Starting new HTTP connection (1): ie.indeed.com
Starting new HTTP connection (1): ie.indeed.com
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=250 HTTP/1.1" 301 376
Starting new HTTP connection (1): ie.indeed.com
Starting new HTTP connection (1): ie.indeed.com
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=0 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=250 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=50 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=150 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=100 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=200 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=0 HTTP/1.1" 301 0
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=250 HTTP/1.1" 301 0
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=50 HTTP/1.1" 301 0
Starting new HTTPS connection (1): ie.indeed.com
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=100 HTTP/1.1" 301 0
Starting new HTTPS connection (1): ie.indeed.com
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=150 HTTP/1.1" 301 0
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=200 HTTP/1.1" 301 0
Starting new HTTPS connection (1): ie.indeed.com
Starting new HTTPS connection (1): ie.indeed.com
Starting new HTTPS connection (1): ie.indeed.com
Starting new HTTPS connection (1): ie.indeed.com
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=250 HTTP/1.1" 200 None
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=0 HTTP/1.1" 200 None
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=100 HTTP/1.1" 200 None
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=150 HTTP/1.1" 200 None
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=50 HTTP/1.1" 200 None
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=200 HTTP/1.1" 200 None
failed to scrape Indeed: invalid literal for int() with base 10: 'Just poste'
jobfunnel monster to pickle running @ : 2020-01-05
Starting new HTTPS connection (1): www.monster.ie
https://www.monster.ie:443 "GET /jobs/search/?q=security&whe

Notes that I've changed the location to xxxx for posting purposes.

The settings file looks like this

# This is the default settings file. Do not edit.

# All paths are relative to this file.

# Paths.
output_path: 'search'

# Providers from which to search (case insensitive)
providers:
  - 'Indeed'
  - 'Monster'
  - 'GlassDoor' # This takes ~10x longer to run than the other providers

# Filters.
search_terms:
  region:
    province: ''
    city:     'xxxx'
    domain:   'ie'
    radius:   25

  keywords:
    - 'security'

# Black-listed company names
black_list:
  - 'yyyyyyyyyy'

# Logging level options are: critical, error, warning, info, debug, notset
log_level: 'debug'
~

argument --no_scrape is broken

pauls-mbp $ funnel --no_scrape
jobfunnel initialized at 2019-07-07
Traceback (most recent call last):
  File "/usr/local/bin/funnel", line 11, in <module>
    load_entry_point('JobFunnel', 'console_scripts', 'funnel')()
  File "/Users/paulmcinnis/JobFunnel/jobfunnel/__main__.py", line 30, in main
    jp.load_pickle()
  File "/Users/paulmcinnis/JobFunnel/jobfunnel/jobfunnel.py", line 62, in load_pickle
    pickle_filepath = os.path.join(args['data_path'], 'scraped',
NameError: name 'args' is not defined
pauls-mbp $

Integrate a Web-app

Description
It would be really nice to have some kind of web-app to use jobfunnel with, perhaps even just to provide a good demo experience.

i.e. https://pages.github.com/

Describe the solution you'd like
Ideally the user could run jobfunnel in-browser and review results with a simple web-app.

Describe alternatives you've considered
It may be more desirable to make something that runs on a user's local machine, limitations of github pages could prove a blocker.

[Feature Request] Export as JSON

I guess at some points dicts get converted to .csv so maybe we could have a --json flag which exports the jobs in JSON format instead of CSV.

Empty incoming jobs update()

Description

I seem to have botched this in the release... oops!

Everything seems to be working fine for the demo scrape, but suddenly you get no jobs added because we update it to be Nothing!

Steps to Reproduce

funnel load -s demo/settings.yaml

Expected behavior

We should be updating to accumulate incoming jobs.

Actual behavior

We never accumulate anything! but we get nice progress bars (and scrape goes fine)

Environment

  • Build: [3.0.0

More Sites to Scrape

I attempted to adjust the 'providers' in the settings.yaml, but I found a few that raised errors. The following would be great additions to impact the tool:

  • 'hire.google'
  • 'Angel.co'
  • 'greenhouse.io'
  • 'jobs.jobvite'
  • 'workable'

[Fix Included] Google Chrome Driver undefined

Description

Google Chrome Driver not defined

Steps to Correct

Replace line 158 of jobfunnel/tools/tools.py
webdriver.Chrome(ChromeDriverManager().install())
with
driver = webdriver.Chrome(ChromeDriverManager().install())
so function get_webdriver() has a value to return.

Why not a pull request? Too small of a fix

Environment

  • Installed Commit: cb9f152
  • Operating system and version: Windows 10
  • Browser: Google Chrome

ValueError: empty vocabulary

Issue Template

Description

Standard search produces web scrape error

Steps to Reproduce

Standard search with

  • 'Indeed'
  • 'Monster'
  • 'GlassDoor'

Expected behavior

Results of query

Actual behavior

No loglevel

Traceback (most recent call last):
  File "C:\Users\phcre\AppData\Local\Programs\Python\Python38\Scripts\funnel-script.py", line 11, in <module>
    load_entry_point('JobFunnel', 'console_scripts', 'funnel')()
  File "c:\users\phcre\documents\jobs\jobfunnel\jobfunnel\__main__.py", line 55, in main
    jf.update_masterlist()
  File "c:\users\phcre\documents\jobs\jobfunnel\jobfunnel\jobfunnel.py", line 330, in update_masterlist
    tfidf_filter(self.scrape_data, masterlist)
  File "c:\users\phcre\documents\jobs\jobfunnel\jobfunnel\tools\filters.py", line 118, in tfidf_filter
    duplicate_ids = tfidf_filter(cur_dict)
  File "c:\users\phcre\documents\jobs\jobfunnel\jobfunnel\tools\filters.py", line 90, in tfidf_filter
    similarities = cosine_similarity(vectorizer.fit_transform(query_words))
  File "c:\users\phcre\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\text.py", line 1840, in fit_transform
    X = super().fit_transform(raw_documents)
  File "c:\users\phcre\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\text.py", line 1198, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents,
  File "c:\users\phcre\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\text.py", line 1129, in _count_vocab
    raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words

query_words is empty therefore cannot be fit_transform by vectorizer

Debug Loglevel

GET http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/source {}
http://127.0.0.1:50081 "GET /session/02b7e485dd5ae5ae4fb5c16bf406267a/source HTTP/1.1" 200 381722
Finished Request
Found 8 glassdoor results for query=Advertising-Marketing-Coordinator-Account-Agency
GET http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {}
http://127.0.0.1:50081 "GET /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 200 144
Finished Request
getting glassdoor page 1 : https://www.glassdoor.com/Job/allen-advertising-marketing-coordinator-account-agency-jobs-SRCH_IL.0,5_IC1139946_KE6,54.htm?radius=25
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/Job/allen-advertising-marketing-coordinator-account-agency-jobs-SRCH_IL.0,5_IC1139946_KE6,54.htm?radius=25"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 200 14
Finished Request
GET http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/source {}
http://127.0.0.1:50081 "GET /session/02b7e485dd5ae5ae4fb5c16bf406267a/source HTTP/1.1" 200 381666
Finished Request
DELETE http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/window {}
http://127.0.0.1:50081 "DELETE /session/02b7e485dd5ae5ae4fb5c16bf406267a/window HTTP/1.1" 200 14
Finished Request
found 8 unique job ids and 0 duplicates from glassdoor
removed 0 jobs present in filter-list
removed 0 jobs in blacklist from master-list
Calculating delay...
Done! Starting scrape!
delay of 0.00s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=101&ao=68087&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_c26c12d6&cb=1591932436271&jobListingId=3596513699
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=101&ao=68087&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_c26c12d6&cb=1591932436271&jobListingId=3596513699"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 770
Finished Request
delay of 22.19s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=102&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_4b2ba71c&cb=1591932436271&jobListingId=3593859227
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=102&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_4b2ba71c&cb=1591932436271&jobListingId=3593859227"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 22.34s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=103&ao=58033&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_9dd170bc&cb=1591932436271&jobListingId=3319079566
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=103&ao=58033&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_9dd170bc&cb=1591932436271&jobListingId=3319079566"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 24.76s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=104&ao=926135&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_3f575b87&cb=1591932436271&jobListingId=3582441465
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=104&ao=926135&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_3f575b87&cb=1591932436271&jobListingId=3582441465"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 27.24s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=105&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_98a55e04&cb=1591932436271&jobListingId=3584976096
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=105&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_98a55e04&cb=1591932436271&jobListingId=3584976096"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 29.04s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=106&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_ca4062d5&cb=1591932436271&jobListingId=3579768726
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=106&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_ca4062d5&cb=1591932436271&jobListingId=3579768726"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 29.64s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=107&ao=60939&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_64a152f4&cb=1591932436271&jobListingId=3504589748
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=107&ao=60939&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_64a152f4&cb=1591932436271&jobListingId=3504589748"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 18.15s, getting glassdoor search: https://www.glassdoor.com/partner/jobListing.htm?pos=108&ao=60939&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_81ad2932&cb=1591932436272&jobListingId=3543437733
POST http://127.0.0.1:50081/session/02b7e485dd5ae5ae4fb5c16bf406267a/url {"url": "https://www.glassdoor.com/partner/jobListing.htm?pos=108&ao=60939&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_81ad2932&cb=1591932436272&jobListingId=3543437733"}
http://127.0.0.1:50081 "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
glassdoor scrape job took 173.619s
removed 0 jobs present in filter-list
removed 0 jobs in blacklist from master-list
removed 0 jobs present in filter-list
removed 0 jobs in blacklist from master-list
Traceback (most recent call last):
  File "C:\Users\phcre\AppData\Local\Programs\Python\Python38\Scripts\funnel-script.py", line 11, in <module>
    load_entry_point('JobFunnel', 'console_scripts', 'funnel')()
  File "c:\users\asdf\documents\jobs\jobfunnel\jobfunnel\__main__.py", line 55, in main
    jf.update_masterlist()
  File "c:\users\asdf\documents\jobs\jobfunnel\jobfunnel\jobfunnel.py", line 330, in update_masterlist
    tfidf_filter(self.scrape_data, masterlist)
  File "c:\users\asdf\documents\jobs\jobfunnel\jobfunnel\tools\filters.py", line 118, in tfidf_filter
    duplicate_ids = tfidf_filter(cur_dict)
  File "c:\users\asdf\documents\jobs\jobfunnel\jobfunnel\tools\filters.py", line 90, in tfidf_filter
    similarities = cosine_similarity(vectorizer.fit_transform(query_words))
  File "c:\users\asdf\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\text.py", line 1840, in fit_transform
    X = super().fit_transform(raw_documents)
  File "c:\users\asdf\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\text.py", line 1198, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents,
  File "c:\users\asdf\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\text.py", line 1129, in _count_vocab
    raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words

webdriver manager returning 404 errors?

Variable Contents

prev_dict

None

cur_dict.values()

odict_values([{'status': 'new', 'title': 'Account Manager Digital Marketing - Professional Services - Entertainment and Media Industry Opportunity', 'company': 'Gannett', 'location': 'Plano, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=101&ao=68087&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_c26c12d6&cb=1591932436271&jobListingId=3596513699', 'id': '3596513699', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Account Coordinator - Marketing', 'company': 'The Point Group', 'location': 'Dallas, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=102&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_4b2ba71c&cb=1591932436271&jobListingId=3593859227', 'id': '3593859227', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Marketing Coordinator', 'company': 'Gourmet Marketing LLC', 'location': 'Plano, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=103&ao=58033&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_9dd170bc&cb=1591932436271&jobListingId=3319079566', 'id': '3319079566', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Account Coordinator - Client Service', 'company': 'RKD Group, Inc.', 'location': 'Richardson, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=104&ao=926135&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_3f575b87&cb=1591932436271&jobListingId=3582441465', 'id': '3582441465', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'COLLEGE GRADS & INTERNS - Entry Level Marketing & Advertising', 'company': 'Millennium Events Management', 'location': 'Dallas, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=105&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_98a55e04&cb=1591932436271&jobListingId=3584976096', 'id': '3584976096', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Senior Account Executive (Marketing/Advertising)', 'company': 'The Point Group', 'location': 'Dallas, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=106&ao=85058&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_ca4062d5&cb=1591932436271&jobListingId=3579768726', 'id': '3579768726', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Account Coordinator - Client Service', 'company': 'RKD Group', 'location': 'Richardson, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=107&ao=60939&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_64a152f4&cb=1591932436271&jobListingId=3504589748', 'id': '3504589748', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Digital Account Coordinator', 'company': 'RKD Group', 'location': 'Richardson, TX', 'date': '', 'blurb': '', 'tags': '', 'link': 'https://www.glassdoor.com/partner/jobListing.htm?pos=108&ao=60939&s=58&guid=00000172a6913e06ac92fffcddc5bb23&src=GD_JOB_AD&t=SR&extid=1&exst=EL&ist=&ast=EL&slr=true&cs=1_81ad2932&cb=1591932436272&jobListingId=3543437733', 'id': '3543437733', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}]) 

query_ids

['3596513699', '3593859227', '3319079566', '3582441465', '3584976096', '3579768726', '3504589748', '3543437733']

query_words

['', '', '', '', '', '', '', '']

Environment

  • Operating system and version: Windows 10

beautifulsoup4>=4.6.3 (4.9.1)
lxml>=4.2.4 (4.5.1)
requests>=2.19.1 (2.23.0)
python-dateutil>=2.8.0 (2.8.1)
PyYAML>=5.1 (5.3.1)
scikit-learn>=0.21.2 (0.23.1)
nltk>=3.4.1 (3.5)
scipy>=1.4.1 (1.4.1)
selenium>=3.141.0 (3.141.0)
webdriver-manager>=2.4.0 (3.1.0)
soupsieve>1.2 (2.0.1)
certifi>=2017.4.17 (2020.4.5.2)
urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (1.25.9)
chardet<4,>=3.0.2 (3.0.4)
idna<3,>=2.5 (2.9)
six>=1.5 (1.15.0)
threadpoolctl>=2.0.0 (2.1.0)
joblib>=0.11 (0.15.1)
numpy>=1.13.3 (1.18.5)
click (7.1.2)
tqdm(4.46.1)
atomicwrites>=1.0; (1.4.0)
packaging (20.4)
pluggy<1.0,>=0.12 (0.13.1)

Glassdoor.com is not working

Issue Template

Description

Just today I discovered that when scraping Glassdoor.com, JobFunnel fails.
Please include the steps to reproduce.
List any additional libraries that are affected.

Steps to Reproduce

  1. Comment out Indeed and Monster from providers options in settings.yaml as such:
        # - 'Indeed'
         # - 'Monster'
        - 'GlassDoor'
  1. Run job funnel funnel -s settings.yaml

Expected behavior

Scrape Glassdoor.com and store jobs in master_list.csv

Actual behavior

JobFunnel output:

jobfunnel initialized at 2020-05-05
no master-list, filter-list was not updated
jobfunnel glassdoor to pickle running @ 2020-05-05
failed to scrape GlassDoor: 'NoneType' object has no attribute 'text'
Traceback (most recent call last):
  File "/usr/local/bin/funnel", line 11, in <module>
    load_entry_point('JobFunnel==2.1.6', 'console_scripts', 'funnel')()
  File "/usr/local/lib/python3.6/dist-packages/jobfunnel/__main__.py", line 55, in main
    jf.update_masterlist()
  File "/usr/local/lib/python3.6/dist-packages/jobfunnel/jobfunnel.py", line 291, in update_masterlist
    raise ValueError('No scraped jobs, cannot update masterlist')

Environment

  • Operating system and version: Linux Mint(Ubuntu 18.04)
  • Desktop Environment and/or Window Manager: Cinnamon
  • Tested on .com(United States domain) and .ca(Canada domain)
    NOTE: I also ran JobFunnel on an isolated docker container(Ubuntu 18.04) and the issue persisted.

I discovered this while inspecting glassdoor.py for testing. I will try my best to tackle this issue in the upcoming days. Hopefully we'll fix it soon!

Cheers!

TFIDF content matching should check inter-scrape

Description

Currently we remove duplicates everywhere but we only remove duplicates by description (TFIDF) between the masterlist and all scrape data.

We should allow masterlist to perform a content match to itself.

Steps to Reproduce

  1. scrape some jobs to .pkl
  2. copy-paste a row a few times, only changing the key_id
  3. run again with --no-scrape

Expected behavior

We should be running TFIDF inter-scrape data and inter-master csv

Actual behavior

Only duplicates in the ncoming dict are identified based on master CSV

Environment

  • Build: 3.0.0

Significant lag in CLI

Significant lag in CLI

Description

Hi all,

Installing the latest version of JobFunnel and running funnel --help takes a few seconds when opening the first time. Later calls are slightly faster but are arguable still a tad slow.

I know I could have looked into the cause of the problem right now but I'm really low on time as of lately. If this issue still persists when I have more time I will come back to this (kind of a reminder for myself then).

Steps to Reproduce

  1. Fresh install of JobFunnel
  2. Call funnel --help
  3. Wait

Expected behavior

A few ms delay before showing CLI help information.

Actual behavior

Significant delay between function call and presented info.

Environment

  • Build: #75
  • Operating system and version: macOS 10.14

pip installer is broken

pip installer fails to install because it can't find pipenv.

Currently none of the installation instructions on the readme work.

Implement European locales

Is your feature request related to a problem? Please describe.
We currently only offer CANADA_ENGLISH and USA_ENGLISH scrapers, but we should also implement some european locales

Describe the solution you'd like
Implement some european locales such as UK_ENGLISH, FRANCE_FRENCH

Describe alternatives you've considered
n/a

Additional context
Existing request #45

misuse of abstract base classes + monolithic JobFunnel class + schema validation + localisation

Description

Currently we are using the JobFunnel class for to much, I want to break it down into the following:

Job(object):
    def __init__(self, title: str, company: str, location: str, tags: List[str], post_date: datetime.date, key_id: str, url: str) -> None:
        ...

Scraper(ABC):

    @abstractmethod
    def scrape(self) -> List[Job]:
        pass
    

main():
    
    # instantiate scrapers

    # run filter on list of Job

    # dump pickle

    # writeout CSV

Note: if I get to it, I'd also like our filters to be an ABC.

Steps to Reproduce

This is a structural technical debt issue. (n/a)

Expected behavior

Abstract base class should not be halfway abstract, Need seperation between JobFunnel and main() and inherited scrapers.

Actual behavior

JobFunnel being monolithic and half-abstract has allowed us to implement three script-like scrapers which share too many methods, without an actual Job object.

Environment

n/a


Current Status:

  • Job Object

  • Support for Internationalization

  • BaseScraper with get/set scraping logic

  • New YAML and CLI implemented

  • Schema Validation with Cerberus

  • Caching

  • Filtering with lists

  • Indeed

  • Monster

  • GlassDoorStatic (works but seems like it has bugs so fixing this).

  • Wage Scraping

  • GlassDoor Dynamic/Driven

  • Duplicates list file support

  • Integrate TFIDF similarity filter (special case filter)

  • Prevent writing out empty CSVs in --no-scrape mode

  • Prevent delayed get/set for jobs which fail filters

  • Fix multi-page Monster scraping

  • handle duplicated jobs special case

  • Make JobFilter class

  • Add TAG scraping to Monster

  • Implement job filtering as own class

  • Fix paths from -s yaml being overwritten with defaults with CLI

  • Fix concurrency issue with dependencies for get/set

  • Monkey / general usability testing

  • Update main README

  • Update other READMEs + tutorials

  • Add versioning to cache files (i.e wrapper for dict with metadata)

  • Review various FIXMES in-code

  • Fix build (Travis CI)

  • Test setup.py

  • Fix demo GIF

  • Document how to write new scrapers with localization


Future work:

  • Google jobs scraper
  • Ycombinator job scraper
  • Assess the update experience from V2.0 --> V3.0, provide a guide
  • cut a release
  • Add WAGE scraping to Indeed
  • Add REMOTE scraping to Indeed
  • Add REMOTE scraping to Monster

Bad url search keyword encoding

Description

We currently pass keywords verbatim, even if they include โ€œ+โ€ or other special characters, we should be encoding these so as not to disrupt the url formation.

Steps to Reproduce

  1. Search with any word containing +s such as โ€œC++โ€

Expected behavior

We would get a query containing โ€œ%โ€ encoded values in place of +s

Actual behavior

We build a search string with erroneous + chars that breaks the search query url (query is not as intended)

Environment

  • Build: current dev
  • Operating system Ubuntu 18.04

Implement Remote scraping

Is your feature request related to a problem? Please describe.
Currently we do not scrape Remote field for Indeed or Monster, which are providing many jobs.

Indeed provides this fairly visibly, it's just a bit buried in tags.
Glassdoor has this implemented already (tags as well)
Monster does not seem to have this capability, but it may be worth looking into if we can infer it.

Describe the solution you'd like
We should scrape Remote = True if a job is fully-remote on Indeed. Temporarily remote doesn't count, sorry!

Describe alternatives you've considered
n/a

Additional context
image

Bad Status

When changing the status to applied and rerunning JobFunnel master_list.csv has no fields!
Yikes!
Every time running JubFunnel thereafter will result in an empty list of jobs.
If you delete master_list.csv a new list will be created next time running JobFunnel but the previous jobs are gone.

We need two things:

  1. To fix this issue.
  2. Make a way to dump existing pickles to form master_list.csv in case this ever happens again.

I think by default I will have the software search for existing pickles by changing:
load_pickle(self, args) to load_pickles(self, args).
This way, all pickles are important and not just today's pickle.

Check config after argparse

Check config after argparse

Description

If the user provides an invalid configuration file, then the program will throw unclear errors. I think it's good practice to check all settings of the config dictionary and show a helpful error message in case the config is invalid.

Steps to Reproduce

  1. Provide an invalid configuration file (settings.yaml)
  2. Run funnel -s settings.yaml

Using JobFunnel with PyCharm

image

I am opening this issue because I think we need a folder pycharm for debugging this project.
PyCharm has become a very popular IDE and can really help with debugging this project.
If we can make a folder that gets developers up and running in debug mode in just a few simple steps then I think it is worth doing.

Main benefits of using PyCharm for developing this project:

  • PyCharm has real time break points that are easy to set.
  • PyCharm has an intuitive interface for running expressions in debug mode.
  • PyCharm recognizes URLs in the log output which allows quick "one-click" access to job postings for validation.

Increase test coverage

Increase test coverage

Description

Since a few weeks, testing has been introduced to the JobFunnel software but it only covers a fraction of the entire code base. I think increasing the quality of the testing framework with unit and integration tests can make it easier for the reviewers to assess new pull requests and gives new contributors more feedback whether their changes to the code base does not break functionality.

I came up with two ideas to provide a clear and productive environment to increase test coverage:

  1. Adding test coverage (next to the building status) could be a good way to keep track of the current state of the test environment.
  2. Have a checklist of all functions in the code base and whether this function is covered by tests.

Note that checking whether a function is 'covered' by tests can be a bit tricky because some functions are rather long, e.g.parse_config. Hence, true unit tests are very difficult for these long functions but can be easily implemented for shorter functions. Realising that unit tests are complicated for certain functions might also be a sign that these functions should be broken into smaller parts

Test coverage

glassdoor.py

  • convert_radius
  • get_search_url
  • search_page_for_job_soups
  • search_joblink_for_blurb
  • parse_blurb
  • scrape

indeed.py

  • convert_radius
  • get_search_url
  • search_page_for_job_soups
  • search_joblink_for_blurb
  • get_blurb_with_delay
  • parse_blurb
  • scrape

monster.py

  • convert_radius
  • get_search_url
  • search_page_for_job_soups
  • search_joblink_for_blurb
  • get_blurb_with_delay
  • parse_blurb
  • scrape

jobfunnel.py

  • init_logging
  • load_pickle
  • load_pickles
  • dump_pickle
  • read_csv
  • write_csv
  • remove_jobs_in_filterlist
  • remove_blacklisted_companies
  • update_filterjson
  • pre_filter
  • delay_threader
  • update_masterlist

tools

delay.py

  • _c_delay
  • _lin_delay
  • _sig_delay
  • delay_alg

filters.py

  • id_filter
  • tfidf_filter

tools.py

  • filter_non_printables
  • post_date_from_relative_post_age
  • split_url
  • proxy_dict_to_url
  • change_nested_dict
  • config_factory

config

parser.py

  • parse_cli
  • cli_to_yaml
  • update_yaml
  • recursive_check_config_types
  • check_config_types
  • parse_config

validate.py

  • validate_region
  • validate_delay
  • validate_config

Want to contribute?

Do you like this project and do you want to make it even better? Feel free to discuss below if you want to contribute to this project. All help is welcome ๐Ÿ‘.

Want to start with something (relatively) easy? The functions in validate.py, tools.py and delay.py are relatively easy to test (at first inspection)!

Implement wage scraping for Indeed and Monster

Is your feature request related to a problem? Please describe.
Currently Glassdoor performs wage scraping, but Indeed and monster should do this as well.

Describe the solution you'd like
Add scraping for JobField.WAGE to indeed and monster scrapers

Apply NLP to condense description

Description

There are currently a number of state-of-the-art algorithms to summarize long passages of text.

We should apply one of these methods to produce the currently-unused short_description JobField

Example algorithms (whatever we use must provide a good license):

Steps to Reproduce

n/a

Expected behavior

We should populate short_description with a condensed (1-3 sentences?) description generated from the longer Job.description

Actual behavior

short_description is un-implemented.

Environment

n/a

Unknown Location Error

Files.zip

Description

I have 2 identical .yaml files. Only difference is the location the information is stored and one looks for CFO positions and the other Human Resource. However, the Human Resource has the following error where as the CFO does not:
Traceback (most recent call last):
File "/home/pittsie/.local/bin/funnel", line 8, in
sys.exit(main())
File "/home/pittsie/.local/lib/python3.8/site-packages/jobfunnel/main.py", line 28, in main
job_funnel.run()
File "/home/pittsie/.local/lib/python3.8/site-packages/jobfunnel/backend/jobfunnel.py", line 86, in run
self.master_jobs_dict = self.read_master_csv()
File "/home/pittsie/.local/lib/python3.8/site-packages/jobfunnel/backend/jobfunnel.py", line 405, in read_master_csv
locale = locale.UNKNOWN
AttributeError: 'NoneType' object has no attribute 'UNKNOWN'

Both files have been attached in a zip

Expected behavior

Run

Actual behavior

Error shown above only on Bill.yaml not 2Mom.yaml

Environment

  • Build: 3.0.0
  • Operating system and version: Ubuntu 20.04

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.