Giter Club home page Giter Club logo

dataverk's Introduction

Dataverk

Verktøykasse for kodebasert serverless ELT/ETL på NAIS med tilgangsstyring i Vault.

Getting started

pip install dataverk

Docs

About

Usage

dataverk's People

Contributors

dependabot-preview[bot] avatar erikvatt avatar gorzan avatar lljorgeluis avatar mariacabrol avatar pbencze avatar snyk-bot avatar sonhal avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

dataverk's Issues

Error on dataverk-cli init -i on windows

(base) C:\projects\dataverktest\datasett-test>dataverk-cli init -i
dataverk-cli init completed
Traceback (most recent call last):
  File "c:\apps\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\apps\anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\apps\Anaconda3\Scripts\dataverk-cli.exe\__main__.py", line 9, in <module>
  File "c:\apps\anaconda3\lib\site-packages\dataverk_cli\dataverk_cli_entrypoint.py", line 40, in main
    settings_dict, env_store = get_package_configuration(args, initialize=True)
  File "c:\apps\anaconda3\lib\site-packages\dataverk_cli\cli\cli_utils\package_config_handler.py", line 17, in get_package_configuration
    settings_dict = setting_store_functions.create_settings_dict(args=args, env_store=env_store)
  File "c:\apps\anaconda3\lib\site-packages\dataverk_cli\cli\cli_utils\setting_store_functions.py", line 22, in create_settings_dict
    settings_dict = settings_loader.load_settings_file_from_resource(resource_url)
  File "c:\apps\anaconda3\lib\site-packages\dataverk_cli\cli\cli_utils\settings_loader.py", line 16, in load_settings_file_from_resource
    settings_dict = _get_settings_dict_from_git_repo(url)
  File "c:\apps\anaconda3\lib\site-packages\dataverk_cli\cli\cli_utils\settings_loader.py", line 64, in _get_settings_dict_from_git_repo
    return settings_dict
  File "c:\apps\anaconda3\lib\tempfile.py", line 807, in __exit__
    self.cleanup()
  File "c:\apps\anaconda3\lib\tempfile.py", line 811, in cleanup
    _shutil.rmtree(self.name)
  File "c:\apps\anaconda3\lib\shutil.py", line 494, in rmtree
    return _rmtree_unsafe(path, onerror)
  File "c:\apps\anaconda3\lib\shutil.py", line 384, in _rmtree_unsafe
    _rmtree_unsafe(fullname, onerror)
  File "c:\apps\anaconda3\lib\shutil.py", line 384, in _rmtree_unsafe
    _rmtree_unsafe(fullname, onerror)
  File "c:\apps\anaconda3\lib\shutil.py", line 384, in _rmtree_unsafe
    _rmtree_unsafe(fullname, onerror)
  File "c:\apps\anaconda3\lib\shutil.py", line 389, in _rmtree_unsafe
    onerror(os.unlink, fullname, sys.exc_info())
  File "c:\apps\anaconda3\lib\shutil.py", line 387, in _rmtree_unsafe
    os.unlink(fullname)
PermissionError: [WinError 5] Ingen tilgang: 'C:\\Users\\g153850\\AppData\\Local\\Temp\\tmph5te_rac\\.git\\objects\\pack\\pack-309141ef0ec03ac6e96825dc4e0fcfc9e8f7dd66.idx'

dataverk-cli init rework

  1. Remove dataverk factory step

  2. Remove dependency on remote and local repo

  3. create open-source settings repo

Rette feil/mangler i henhold til krav i navikt/utvikling (funnet av roboten repo-linter)

Dette er en autogenerert issue, laget av et skript som går gjennom alle NAV sine kodebaser på Github og gjør diverse sjekker. Her er en liste over ting som må endres.

Beskrivelse mangler

På Github kan man gi hver kodebase en kort beskrivelse. Denne bør fortelle hva kodebasen heter, og litt om hva den brukes til. (Eksempel: kodebasen "veilarbportefoljeflatefs" har beskrivelse "Oversikt for veiledere over oppfølgingsbrukere".)
NB! Dette gjelder ikke beskrivelse i en README-fil, det gjelder beskrivelse i metadataen til selve Github-repoet.

Når alle endringer er gjort, så kan denne saken lukkes.

Spørsmål og svar

Jeg har meninger om disse rådene - kan jeg komme med tilbakemeldinger?

Skriv i vei, på Slack-kanalen #open-source.

Kodebasen vår er ikke open source, derfor er det ikke noe poeng

Selv om koden i dag ikke er åpen for innsyn, så ta høyde for at den kan komme til å bli det i fremtiden. Uansett så vil forbedringene være til hjelp, enten kodebasen er åpen eller ei!

Hvem har ansvaret for å fikse det her?

Det er i utgangspunktet den/de/teamet som eier kodebasen som må fikse.

Det er en feil i rådene

Alle roboter gjør jo feil, denne også. Lag en issue på https://github.com/navikt/repo-linter.

Integrasjon mot S3

Jeg savner Koala-funksjonaliteten som lar meg laste ned/opp pickles fra S3 med en oneliner i Dataverk. Slik det er nå så må jeg kjøre en request, bruke BytesIO m.m. for å pickle og laste opp en pandas dataframe. Hadde vært smud om dette kunne blitt gjort enklere.

gcloud publish bug: no priv_key for gcloud in settings object

Publishing package tester-med-ny-dv-versjon
The private_key field was not found in the service account info.
dataverk-cli publish completed
Traceback (most recent call last):
File "/usr/local/bin/dataverk-cli", line 11, in
sys.exit(main())
File "/usr/local/lib/python3.6/site-packages/dataverk_cli/dataverk_cli_entrypoint.py", line 67, in main
publish_datapackage()
File "/usr/local/lib/python3.6/site-packages/dataverk_cli/dataverk_publish.py", line 95, in publish_datapackage
datapackage.publish()
File "/usr/local/lib/python3.6/site-packages/dataverk_cli/dataverk_publish.py", line 87, in publish
self.package_settings["package_name"]))
File "/usr/local/lib/python3.6/site-packages/dataverk/utils/publish_data.py", line 6, in upload_to_storage_bucket
conn.upload_blob(os.path.join(dir_path, 'datapackage.json'), datapackage_key_prefix + 'datapackage.json')
File "/usr/local/lib/python3.6/site-packages/dataverk/connectors/google_storage.py", line 72, in upload_blob
blob = self.bucket.blob(destination_blob_name)
AttributeError: 'GoogleStorageConnector' object has no attribute 'bucket'

Error when trying to create new local project

(IndexError: tuple index out of range) is thrown when trying to execute dataverk-cli init from a new local project with no matching github repository

location: line 68 in _getssh_url /dataverk_base.py

dataverk init -i feiler med ValueError 'dataverk_s3' is not a valid BucketStorage

/home/G153850/projects/datapakke-arbeidsledighet $ dataverk-cli init -i
Skriv inn ønsket pakkenavn: arbeidsledighet
Do you want to create the datapackage (arbeidsledighet)? [y/n] y
dataverk-cli init completed
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/dataverk_cli/dataverk_init.py", line 29, in run
    self._create()
  File "/usr/local/lib/python3.6/site-packages/dataverk_cli/dataverk_init.py", line 40, in _create
    self._edit_package_metadata()
  File "/usr/local/lib/python3.6/site-packages/dataverk_cli/dataverk_init.py", line 90, in _edit_package_metadata
    package_metadata['path'] = self._determine_bucket_path()
  File "/usr/local/lib/python3.6/site-packages/dataverk_cli/dataverk_init.py", line 105, in _determine_bucket_path
    if BucketStorage(bucket_type) == BucketStorage.GITHUB:
  File "/usr/local/lib/python3.6/enum.py", line 291, in __call__
    return cls.__new__(cls, value)
  File "/usr/local/lib/python3.6/enum.py", line 533, in __new__
    return cls._missing_(value)
  File "/usr/local/lib/python3.6/enum.py", line 546, in _missing_
    raise ValueError("%r is not a valid %s" % (value, cls.__name__))
ValueError: 'dataverk_s3' is not a valid BucketStorage

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/dataverk-cli", line 11, in <module>
    load_entry_point('dataverk==0.0.17', 'console_scripts', 'dataverk-cli')()
  File "/usr/local/lib/python3.6/site-packages/dataverk_cli/dataverk_cli_entrypoint.py", line 46, in main
    dp.run()
  File "/usr/local/lib/python3.6/site-packages/dataverk_cli/dataverk_init.py", line 32, in run
    raise Exception(f'Klarte ikke generere datapakken {self._settings_store["package_name"]}')
Exception: Klarte ikke generere datapakken arbeidsledighet

Får en futurewarning når jeg importerer dataverk til en notebook

/usr/local/lib/python3.7/site-packages/dask/dataframe/utils.py:15: FutureWarning:

pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.

INFO:numexpr.utils:Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.