Giter Club home page Giter Club logo

document-dl's People

Contributors

cyroxx avatar dependabot[bot] avatar heeplr avatar sdx23 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

document-dl's Issues

Download only latest?

Is it possible to only download the latest invoice? (
E.g. a o2 cronjob, running every month.

RFC: limited time-range

Let me first say that I totally agree with the statement in #4 that we should take care not to clutter the namespace [1]. After all, I see unification/standardization as one big aspect of this project, that distinguishes it from me hacking a short standalone selenium-script for just the websites/services I need.

As well, I do agree that adding a special "--year" switch or the like is redundant with the ability of jq querying (and that also put me at unease with my suggestion in #4). But I also see the point of dates being something special since they are (possibly the main) restriction on which documents to process.

That is important on the one hand for speed / not doing a lot of useless work (see #4). This is relevant for regular downloading as mentioned in #1: when the script runs periodically once a month, it is surely fine to only (try) to download documents from within the last 60 days.

On the other hand, the scraped website itself may raise that question (and that's why I bring up this topic again). I'm currently developing a plugin for smartbroker [2], which displays the postbox by default as a search form [3]. It allows selecting predefined ranges (the last x days with x in [10,30,...360] or alternatively specifying your own range. So I could either -- quite arbitrarily -- select "last 360 days" or do something (possibly stupid?) like setting the range to 1970-01-01 til today.

Now this is specific to the plugin in question, and in principle I'd just leave it as is (360 days) and possibly specify an additional option making all documents available by forcing "1970-01-01 til today". But from a user perspective it might get confusing, what one must do for which plugin to work as expected from the experience with others.

Not sure I'm overthinking this; this can still be changed at some later point. Nevertheless I wanted to bring this to attention and ask for comments.

[1] sidenote, that is offtopic here: raises the question whether short cli options should be discouraged in plugins
[2] https://github.com/sdx23/document-dl/tree/smartbroker
[3]
2022-01-07-180639_624x359_scrot

Testing Firefox

  • Mac OS 11.5.2
  • Firefox 91.0.1 + Selenium

I'm running this command:

 document-dl -b firefox -u NUMBER -p 'PASSWORD' --action download --jq 'contains({id: 0})' o2

It starts a pure Firefox without anything installed (it's a new profile I think).
After a while it exits and I'm getting these errors:

Traceback (most recent call last):
  File "/usr/local/bin/document-dl", line 8, in <module>
    sys.exit(documentdl())
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/docdl/plugins/o2.py", line 133, in o2
    docdl.cli.run(ctx, O2)
  File "/usr/local/lib/python3.9/site-packages/docdl/cli.py", line 150, in run
    plugin = plugin_class(
  File "/usr/local/lib/python3.9/site-packages/docdl/__init__.py", line 147, in __init__
    self._init_webdriver(webdriver_opts, arguments['webdriver'])
  File "/usr/local/lib/python3.9/site-packages/docdl/__init__.py", line 272, in _init_webdriver
    self.webdriver = webdriver.Firefox(
  File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/firefox/webdriver.py", line 190, in __init__
    executor = ExtensionConnection("127.0.0.1", self.profile,
  File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/firefox/extension_connection.py", line 52, in __init__
    self.binary.launch_browser(self.profile, timeout=timeout)
  File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/firefox/firefox_binary.py", line 73, in launch_browser
    self._wait_until_connectable(timeout=timeout)
  File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/firefox/firefox_binary.py", line 109, in _wait_until_connectable
    raise WebDriverException(
selenium.common.exceptions.WebDriverException: Message: Can't load the profile. Possible firefox version mismatch. 
You must use GeckoDriver instead for Firefox 48+. Profile Dir: /var/folders/td/x4r_b40s2r5bvlrlrby13ptw0000gn/T/tmp799ugx0k 
If you specified a log_file in the FirefoxBinary constructor, check it for details.
``

Support for Remote selenium webdriver (Docker Version?)

Is it possible to configure a remote server for doing all the Selenium / browser work?
There is webdriver.Remote for Selenium in Python to configure the ip / hostname
and https://hub.docker.com/u/selenium Docker images for all browsers.

Thought about building a stack like this:
document-dl container & selenium chrome container
connected through docker docker network and
a volume bind for an invoice directory
(auto sub-dirs for all invoice providers).

Alternative: everything in one container (https://nander.cc/using-selenium-within-a-docker-container).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.