delta-io / delta-sharing Goto Github PK

View Code? Open in Web Editor NEW

739.0 30.0 164.0 2.45 MB

An open protocol for secure data sharing

Home Page: https://delta.io/sharing

License: Apache License 2.0

Scala 82.02% Shell 1.83% Python 15.79% Rust 0.36%

big-data spark pandas data-sharing delta-lake

delta-sharing's People

Contributors

Stargazers

Watchers

Forkers

psyoblade yaohua628 wassimtrifi skhatri zhuansunxt jackhoutw yangchenghuang kumarneetesh24 dobachi sandeep-katta0102 fx196 trendingtechnology yang040840219 zsxwing btello bbielicki andrey-puzyr laopeng2021 piaozhexiu psadusumilli sunny30 yukaizou2015 cuulee norazhaoo a70 admariner laashub-soa joserfjuniorllms goodwillpunning danghieu179 danielschulz felixsafegraph kuguobing nkvuong afsana-afzal aalbu ganeshrj78 qbeast-io gentijo zt30668 mekkinen hongyonggan lakecat-io nagomiko erdal-pb phial3 wchau hpeezmeral jacob-heldenbrand-cl yutoacts sa255304 riddopic amberz frankfanslc stikkireddy kristens-work alejoelpaisa reinaldo-ito icewater2020 2733284198 alexott kohei-tosshy buggythepirate magpierre toannhu96 yarikc harksin calixtofelipe wxl24life atgabe pelumi-aderinto ferbocato patrickjmccauley bochuxt hugovk airizarrydb carbirbal wangp-nhlab sandeep-katta mushtaq ksceriath ggwiebe emmajaneshaw sastafford keren123 sibyabin emmeneses zionrubin khoanguyen1806 spencerwilson chitralverma gupsydaisys seohyeon0322 agile-lab-dev rohankumardubey josh-seidel-db shailendrakk dialberg dennyglee bogao007

delta-sharing's Issues

Server returns next_page_token instead of nextPageToken

As defined in the protocol.

Unable to specify hadoop configuration for AWS S3 Access on Delta Server

We are working on standing up a delta sharing server on an AWS EC2 VM using docker and pulling data from S3. However, despite trying a variety of things, including passing in environmental variables to the docker run command and dropping a core-site.xml file with our access/secret keys in /opt/docker/conf with the HADOOP_CONF_DIR set to that folder, we continue to receive the below error:

46904 [armeria-common-worker-epoll-2-1] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (org.apache.htrace.core.Tracer).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
47620 [armeria-common-worker-epoll-2-1] ERROR io.delta.sharing.server.DeltaSharingServiceExceptionHandler - com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentExceptio
n: AWS Access Key ID and Secret Access Key must be specified by setting the fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey properties (respectively).
io.delta.sharing.server.DeltaInternalException: com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified
 by setting the fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey properties (respectively).
        at io.delta.sharing.server.DeltaSharingService.processRequest(DeltaSharingService.scala:104)
        at io.delta.sharing.server.DeltaSharingService.getMetadata(DeltaSharingService.scala:159)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.linecorp.armeria.internal.server.annotation.AnnotatedService.invoke(AnnotatedService.java:338)
        at com.linecorp.armeria.internal.server.annotation.AnnotatedService.lambda$serve0$2(AnnotatedService.java:285)
        at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
        at java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628)
        at java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996)
        at com.linecorp.armeria.internal.server.annotation.AnnotatedService.serve0(AnnotatedService.java:289)
        at com.linecorp.armeria.internal.server.annotation.AnnotatedService.serve(AnnotatedService.java:262)
        at com.linecorp.armeria.server.RouteDecoratingService.serve(RouteDecoratingService.java:92)
        at com.linecorp.armeria.server.RouteDecoratingService.serve(RouteDecoratingService.java:67)
        at com.linecorp.armeria.server.auth.AuthService.handleSuccess(AuthService.java:118)
        at com.linecorp.armeria.server.auth.AuthService.lambda$serve$0(AuthService.java:99)
        at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
        at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
        at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
        at com.linecorp.armeria.common.RequestContext.lambda$makeContextAware$3(RequestContext.java:522)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)
Caused by: com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified by setting the fs.s3.awsAccessKeyId
and fs.s3.awsSecretAccessKey properties (respectively).
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2049)
        at com.google.common.cache.LocalCache.get(LocalCache.java:3849)
        at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4688)
        at io.delta.standalone.internal.DeltaSharedTableLoader.loadTable(DeltaSharedTableLoader.scala:49)
        at io.delta.sharing.server.DeltaSharingService.$anonfun$getMetadata$1(DeltaSharingService.scala:162)
        at io.delta.sharing.server.DeltaSharingService.processRequest(DeltaSharingService.scala:101)
        ... 27 more
Caused by: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified by setting the fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey properties (respectively).
        at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:74)

We are running the deltaio/delta-sharing-server:0.2.0 with the below command:

docker run -p 9999:9999 --mount type=bind,source=/home/ubuntu/config.yaml,target=/config/delta-sharing-server-config.yaml --mount type=bind,source=/home/ubuntu/core-siget=/opt/docker/conf/core-site.xml  -e HADOOP_CONF_DIR=/opt/docker/conf  deltaio/delta-sharing-server:0.2.0 -- --config /config/delta-sharing-server-config.yaml

We can successfully curl the all-tables endpoints, but any that attempt to connect to S3 return a 500 with the above error. We were hoping that the docker container would inherit IAM permissions from the EC2 instance, which can successfully retrieve from S3. However, when it appeared that was not the case, we tried using the IAM metadata provider as specified in the README. Is there something else that we're missing that needs to be set in order for the server to pick up the variables specified in our core-site.xml file?

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>fs.s3n.awsAccessKeyId</name>
    <value>XXXX</value>
  </property>

  <property>
    <name>fs.s3n.awsSecretAccessKey</name>
    <value>XXXX</value>
  </property>
</configuration>

Docker image: deltaio/delta-sharing-server:0.2.0
EC2 image: ubuntu-bionic-18.04-amd64-server
IAM permissions on EC2:

"Effect": "Allow",
            "Action": [
                "s3:*"
            ],
            "Resource": [
                ...
            ]

limit feature returning an empty dataframe

The delta_sharing limit feature doesn't seem to work properly. It returns an empty dataframe.

table = delta_sharing.load_as_pandas(table_url, limit=10)

Empty DataFrame
Columns: [a, b, c, ...]
Index: []

Error when configuring for azure datalake gen2

Hello, I am configuring delta-sharing for connecting with azure deltalake gen2, but when I call the method .../delta-sharing/shares/{share_name}/schemas/{schema_name}/tables/{table_name}/metadata

An error occurs indicating:
java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem not found

I'm using delta-sharing docker image deltaio/delta-sharing-server:0.2.0

Also tried copying libraries such as hadoop-azure-2.10.1.jar and hadoop-azure-3.3.1.jar but under /opt/docker/lib but the error persist

Not able to access GCS delta data using Python 3.6

Error : I am not able to access GCS delta data with Python 3.6 version with below message.

I am able to access parquet file printed in error message.

Solution: This issue has been resolved using python3.8

Slack Thread
https://delta-users.slack.com/archives/C023H5LS51P/p1645650761171099?thread_ts=1645650761.171099&cid=C023H5LS51P](https://urldefense.proofpoint.com/v2/url?u=https-3A__delta-2Dusers.slack.com_archives_C023H5LS51P_p1645650761171099-3Fthread-5Fts-3D1645650761.171099-26cid-3DC023H5LS51P&d=DwMFaQ&c=1hIq-C3ayh4zm6RZ7m4R2AxLax342Ba687_wO3ZmnvQ&r=BEU_jKcCgZCKXfyreR9nxUI6GR4XQvaAfWli5I_Owg4&m=O-HHCNT-RW4GxGhotR3_aZartO2vhdg2D_Om5mxXxl_d_9GQcOj9ghW2o7E-3tkL&s=w0od_Ql-k-9eeYpdfsY79h744TcoA4XxeN8M8c79dr4&e=)

Add support for AWS Web Identity Token File auth

The delta-sharing server does not support AWS Web Identity Token File authentication currently.

I've deployed a delta-sharing server to an AWS Kubernetes cluster (EKS) which uses IAM Roles for Service Accounts (IRSA) to grant permissions to an S3 bucket. This uses a Web Identity Token File for authentication.

This seems like it could potentially be a common use case.

The AWS Java SDK supports this with WebIdentityTokenFileCredentialsProvider which reads the environment variable AWS_WEB_IDENTITY_TOKEN_FILE and the token file.

Looking at the Hadoop docs, it should be supported by Hadoop, but it looks like delta-sharing server is using the default credential providers.

If unspecified, then the default list of credential provider classes,
    queried in sequence, is:
    1. org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider:
       Uses the values of fs.s3a.access.key and fs.s3a.secret.key.
    2. com.amazonaws.auth.EnvironmentVariableCredentialsProvider: supports
        configuration of AWS access key ID and secret access key in
        environment variables named AWS_ACCESS_KEY_ID and
        AWS_SECRET_ACCESS_KEY, as documented in the AWS SDK.
    3. com.amazonaws.auth.InstanceProfileCredentialsProvider: supports use
        of instance profile credentials if running in an EC2 VM.

Would it be possible to add support to specify additional credential providers? Or add com.amazonaws.auth.WebIdentityTokenFileCredentialsProvider to the AWSCredentialProviderList?

Delta Sharing API for easy cloning as Delta

We would like to use Delta Sharing from one Region and Metastore to another as a type of replication between primary and secondary sites without sharing S3 buckets or doing cross-mounted buckets.

Data API adaptor

Especially, to serve the data as a performant API with the ability to query the data efficiently is quite a use-case and everyone has to develop their own solutions around it. This would turn the delta sharing into something like GraphQL Server (with the ability to query the data based on filtering conditions)

Pagination in load_as_pandas for large tables

The ability to pass limit=n to the delta_sharing.load_as_pandas function is very helpful, but it would be even better if you could also pass start_at=. This would allow large tables to be loaded and processed on a client machine without having to load the whole thing into memory.

delta_sharing.load_as_pandas(table_url, limit=1000) returns rows 0-999
delta_sharing.load_as_pandas(table_url, limit=1000, start_at=1000) returns rows 1000-1999

Support Google Cloud Storage in Delta Sharing Server

Google Cloud Storage supports Signed URLs. It's not hard to make Delta Sharing Server support it.

Errors trying the delta-sharing access with pandas and spark

Hello team,

I'm trying to use your delta-sharing library with python for both PANDAS and SPARK

SPARK METHOD

As you can see from the below screenshot I can access the columns or schema using the load_as_spark() method

However, when I try .show() I see the following Java certificate errors

javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

I've tried to use different Java versions and also added the certificate of the delta-sharing server hosted on Kubernetes, but nothing seems to be working. If I try this from a linux machine or kubernetes I seem to have the same problem

PANDAS METHOD

When I use the load_as_pandas() method I get a FileNotFoundError. The strange thing is that when I click on that s3 url link that you can see from the screenshot, I download the parquet file (therefore the link does work). Is the PyArrow library trying to look for a local file or what do you think the issue may be ?

Any ideas about the above 2 errors using spark and pandas ?

Thank you very much,

Peter

Expose table version in metadata to make polling for [data] updates faster & easier

Looking for table version # (or timestamp) so that we can build an airflow sensor to easily detect if a new version of the data is now available.

Presto connector

If the are several data sources in the organization (such as SQL Server, DWH, etc), the data needs to be accessed to be shared without copying it firstly into a Delta lake. Using a federated data access tool such as Presto to access such a diverse set of resources would be valuable for enterprise data-sharing use-cases.

as discussed in the delta-sharing slack channel with Itai Weiss.

Add pre-signed URL request logging to server

Currently there's not a great way to understand how many pre-signed URLs are being generated by the sharing server. It would be nice if the server logged at the DEBUG level every time it generated a new pre-signed S3 URL. Having access to these logs would aid in performance testing and debugging.

Using the delta-sharing library from Anaconda Python Installation

Hi,

I'm trying to use the delta-sharing library from JupyterHub

If I try to use a Python installation from a centos machine, load_as_pandas() works
If I use an Anaconda Python installation with load_as_pandas() I get a FileNotFoundError. I've tried installing the delta-sharing and pyarrow libraries with conda-forge rather than pip, but I still get the same error

Is there anything additional that I need for the delta-sharing Anaconda installation ?

Thank you very much

Add a limit parameter to load_as_pandas

Currently load_as_pandas will load the entire table to memory. Sometimes, the user may just want to take some data from a big table to explore. We can add a limit parameter to load_as_pandas so that the user can just read small data from a shared table.

Filtering does not work

Hi,
The filtering does work as suggested here. It brings all the records...
https://github.com/delta-io/delta-sharing/blob/main/PROTOCOL.md#read-data-from-a-table

I am trying the following:
curl -X POST ...../delta-sharing/shares/covidshare/schemas/conformed/tables/fact_covid/query -H “Accept: application/json” -H “Content-Type: application/json” -d ’{“predicateHints”: [“date_sk >= ‘2021-01-01’“,”date_sk <= ‘2021-01-02’“], “limitHint”: 1}'

OR
curl -X POST ...../delta-sharing/shares/covidshare/schemas/conformed/tables/fact_covid/query -H “Accept: application/json” -H “Content-Type: application/json” -d ’{“predicateHints”: [“date >= ‘2021-01-01’“,”date <= ‘2021-01-02’“], “limitHint”: 1}'

OR
curl -X POST ...../delta-sharing/shares/covidshare/schemas/conformed/tables/fact_covid/query -H “Accept: application/json” -H “Content-Type: application/json” -d ‘{“limitHint”: 1}’

OR
curl -X POST http://deltasharing-dev.westeurope.azurecontainer.io:8181/delta-sharing/shares/covidshare/schemas/conformed/tables/fact_covid/query -H "Accept: application/json" -H "Content-Type: application/json" -d '{"predicateHints": ["date_sk >= 2021-01-01","date_sk <= 2021-01-02"], "limitHint": 1}'

OR
curl -X POST http://deltasharing-dev.westeurope.azurecontainer.io:8181/delta-sharing/shares/covidshare/schemas/conformed/tables/fact_covid/query -H "Accept: application/json" -H "Content-Type: application/json" -d '{"predicateHints": ["date >= 2021-01-01","date <= 2021-01-02"], "limitHint": 1}'

The query endpoint should return a 400 on invalid input

Let's just say, hypothetically speaking some idiot were to pass the wrong type of JSON object for predicateHints, the API will 500.

curl -X POST "https://sharing.delta.io/delta-sharing/shares/delta_sharing/schemas/default/tables/COVID_19_NYT/query" -H  "accept: application/x-ndjson" -H  "Authorization: Bearer faaie590d541265bcab1f2de9813274bf233" -H  "Content-Type: a
pplication/json" -d "{\"predicateHints\":{},\"limitHint\":1000}"

Since I'm the idiot, I should be given a 400 error which means that the request was incorrectly formed.

deltaSharing Spark source doesn't use statistics for count operation

The simple code that I'm using to check sizes of the shares is timeouts because it's trying to load data, although that kind of operations could be answered directly from stats returned when we query the table

import delta_sharing
profile_file = "https://github.com/delta-io/delta-sharing/raw/main/examples/open-datasets.share"
client = delta_sharing.SharingClient(profile_file)
all_tables = client.list_all_tables()

for tbl in all_tables:
    table_name = f"{tbl.share}.{tbl.schema}.{tbl.name}"
    table_url = f"{profile_file}#{table_name}"
    print(f"Going to read {table_name}")
    df = spark.read.format("deltaSharing").load(table_url)
    print(f"{table_name} count: {df.count()}")

Python Connector GET Requests does not respect maxResults and nextPageToken

https://github.com/delta-io/delta-sharing/blob/main/python/delta_sharing/rest_client.py#L259

For get requests, the json kwarg does not work. It should be using the params kwarg instead.

load_as_pandas should return a DataFrame with correct schema for an empty table

When a table is empty, load_as_pandas will return a DataFrame without schema. We should create an empty DataFrame with the table schema instead. See: https://github.com/delta-io/delta-sharing/blob/v0.1.0/python/delta_sharing/reader.py#L65

Allow passing profile in through Spark read options instead of a profile-file

This option would be more convenient for those who do not want to store credentials directly in a file.

DataSharingRestClient.list_files_in_table doesn't return version of the table

Right now, when you call DataSharingRestClient.list_files_in_table doesn't return information about version of the table for which data is returned. Of course it's possible to have a separate call to the query_table_version, but this may lead to race condition when the table is updated between these two calls

Add possibility to set endpoint-url for s3

Unfortunately, default AWS variables doesn't support endpoint url for compatible s3 storages, only through CLI "--endpoint-url". Any chances it could be implemented?

Add user agent to include the client information in the http requests

Setting the client information (e.g., client type, version) is pretty useful. The server can use this information to track client metrics and it's also easier to debug issues.

Support Azure Storage in Delta Sharing Server

Azure supports shared access signature which is similar to signed url in other storage systems. It's not hard to make Delta Sharing Server support it.

Extend Delta Sharing Protocol to include unique IDs for Shares and Tables

Unique IDs of Share and its tables inside will help the data recipient or data consumer disambiguate the name of datasets as time passes. This is especially useful when data recipient is a large organization and wants to apply access control on the shared dataset within their organization.

For example:

at time 0 a data provider shares a Share with name foo
at time 1 the data provider removes that Share foo to the recipient
at time 2 the data provider creates another Share also with name foo and shares with the recipient again, this time with different tables inside.

As an admin on the data recipient side, I would expect to be able to distinguish the old and new Share in this case and decide with whom to further share the data within my organization. Table ID serve the same purpose.

Note that the IDs should be optional to Share and Tables for backward compatibility. The IDs should be immutable throughout the resource's lifecycle.

Delta sharing container on AWS ecs getting access denied error even with all s3 permissions there

Delta sharing container on ecs getting access denied error even with all Iam s3 and kms permission there for the bucket on the ecs service .

Error

io.delta.sharing.server.DeltaInternalException: java.util.concurrent.ExecutionException: java.nio.file.AccessDeniedException:  s3a://foo-lake/foo/foo_fact/_delta_log: getFileStatus on s3a://foo-lake/foo/foo_fact/_delta_log: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: C0DZXXXYNWKQ9CWD; S3 
Extended Request ID: eDvJMbR8UtIRDg8nXD7+0ix04VN8UPsVSEJDIBosFC5u/YJPsnAGpm/hvGdrXteQBpeQNu5DW9Q=), S3 Extended Request ID: eDvJMbR8UtIRDg8nXD7+0ix04VN8UPsVSEJDIBosFC5u/YJPsnAGpm/hvGdrXteQBpeQNu5DW9Q=
--
@timestamp | 1635392335291

Also


(s3a://foo-lake/foo/foo_fact/_delta_log/_delta_log/_last_checkpoint is corrupted.
 Will search the checkpoint files directly,java.nio.file.AccessDeniedException: s3a://foo-lake/foo/foo_fact/_delta_log/_last_checkpoint: getFileStatus on 
s3a://wfg1stg-datahub-lake/ovdm/tran_fact/_delta_log/_last_checkpoint: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: C0DM6WHHRX0RR7DG; S3 Extended Request ID: yEdeFmnz49jLTXu+LzqoNZqFy0sK8X3ge0p7Gmp5ia9FVFjWN7/HLLZ5sWatqfi6cDH0ZRGGf9s=), S3 Extended Request ID: yEdeFmnz49jLTXu+LzqoNZqFy0sK8X3ge0p7Gmp5ia9FVFjWN7/HLLZ5sWatqfi6cDH0ZRGGf9s=)
--

Shouldn't delta-server already use EC2ContainerCredentialsProviderWrapper or if there is a way to still configure this

Pyspark RESOURCE_LIMIT_EXCEEDED error

Hi,
I am not sure this issue is related to delta-sharing, but I can not use any more delta-sharing from Spark because of the systematic error:

22/01/14 22:10:51 INFO DataSourceStrategy: Pruning directories with:
22/01/14 22:10:51 INFO FileSourceStrategy: Pushed Filters:
22/01/14 22:10:51 INFO FileSourceStrategy: Post-Scan Filters:
22/01/14 22:10:51 INFO FileSourceStrategy: Output Data Schema: struct<_id: struct<oid: string>, assetID: string, attributeID: string, count: int, data: array<struct<complexValue:struct<value:string>,ts:timestamp,value:string>> ... 6 more fields>
22/01/14 22:10:51 INFO CodeGenerator: Code generated in 181.774052 ms
22/01/14 22:10:51 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 194.5 KiB, free 434.2 MiB)
22/01/14 22:10:52 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 34.5 KiB, free 434.2 MiB)
22/01/14 22:10:52 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on skymac-c02z30nnlvcf.home:53134 (size: 34.5 KiB, free: 434.4 MiB)
22/01/14 22:10:52 INFO SparkContext: Created broadcast 0 from showString at NativeMethodAccessorImpl.java:0
Traceback (most recent call last):
  File "spark-delta-sharing_extract.py", line 30, in <module>
    df.show()
  File "/usr/local/Cellar/apache-spark/3.2.0/libexec/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 494, in show
  File "/usr/local/Cellar/apache-spark/3.2.0/libexec/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py", line 1309, in __call__
  File "/usr/local/Cellar/apache-spark/3.2.0/libexec/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
  File "/usr/local/Cellar/apache-spark/3.2.0/libexec/python/lib/py4j-0.10.9.2-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o28.showString.
: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('{' (code 123)): was expecting double-quote to start field name
 at [Source: (String)"{{"errorCode":"RESOURCE_LIMIT_EXCEEDED","message":"The table metadata size exceeded limits"}"; line: 1, column: 3]
	at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:2337)
	at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:710)
	at com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:635)
	at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddName(ReaderBasedJsonParser.java:1814)
	at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:713)
	at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:191)
	at com.fasterxml.jackson.databind.deser.DefaultDeserializationContext.readRootValue(DefaultDeserializationContext.java:322)
	at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4593)
	at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3548)
	at com.fasterxml.jackson.module.scala.ScalaObjectMapper.readValue(ScalaObjectMapper.scala:206)
	at com.fasterxml.jackson.module.scala.ScalaObjectMapper.readValue$(ScalaObjectMapper.scala:205)
	at io.delta.sharing.spark.util.JsonUtils$$anon$1.readValue(JsonUtils.scala:26)
	at io.delta.sharing.spark.util.JsonUtils$.fromJson(JsonUtils.scala:38)
	at io.delta.sharing.spark.DeltaSharingRestClient.$anonfun$getFiles$1(DeltaSharingClient.scala:216)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at scala.collection.TraversableLike.map(TraversableLike.scala:286)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at io.delta.sharing.spark.DeltaSharingRestClient.getFiles(DeltaSharingClient.scala:216)
	at io.delta.sharing.spark.RemoteSnapshot.filesForScan(RemoteDeltaLog.scala:197)
	at io.delta.sharing.spark.RemoteDeltaFileIndex.listFiles(RemoteDeltaLog.scala:266)
	at org.apache.spark.sql.execution.FileSourceScanExec.selectedPartitions$lzycompute(DataSourceScanExec.scala:234)
	at org.apache.spark.sql.execution.FileSourceScanExec.selectedPartitions(DataSourceScanExec.scala:229)
	at org.apache.spark.sql.execution.FileSourceScanExec.dynamicallySelectedPartitions$lzycompute(DataSourceScanExec.scala:264)
	at org.apache.spark.sql.execution.FileSourceScanExec.dynamicallySelectedPartitions(DataSourceScanExec.scala:245)
	at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:432)
	at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:417)
	at org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:495)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:222)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:219)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:180)
	at org.apache.spark.sql.execution.InputAdapter.inputRDD(WholeStageCodegenExec.scala:526)
	at org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs(WholeStageCodegenExec.scala:454)
	at org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs$(WholeStageCodegenExec.scala:453)
	at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:497)
	at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:50)
	at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:750)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:184)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:222)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:219)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:180)
	at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:325)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:443)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:429)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2728)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2935)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:326)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)

22/01/14 22:11:07 INFO SparkContext: Invoking stop() from shutdown hook
22/01/14 22:11:07 INFO SparkUI: Stopped Spark web UI at http://skymac-c02z30nnlvcf.home:4040
22/01/14 22:11:07 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/01/14 22:11:07 INFO MemoryStore: MemoryStore cleared
22/01/14 22:11:07 INFO BlockManager: BlockManager stopped
22/01/14 22:11:07 INFO BlockManagerMaster: BlockManagerMaster stopped
22/01/14 22:11:07 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/01/14 22:11:07 INFO SparkContext: Successfully stopped SparkContext
22/01/14 22:11:07 INFO ShutdownHookManager: Shutdown hook called

The code looks like

delta_sharing_conf_path = "/path/conf.share"

share = "test"
schema = "schema"
table = "table"

table_path = delta_sharing_conf_path + f"#{share}.{schema}.{table}"

sc = SparkContext('local')
spark = SparkSession(sc)

df = spark.read.format("deltaSharing").load(table_path)
df.show()

The issue concerns all commands with spark and the packages io.delta:delta-sharing-spark_2.12:0.2.0. I have the same error with pyspark --packages io.delta:delta-sharing-spark_2.12:0.2.0 with the same basic commands.

I was able to request and access data from Spark delta-sharing on the 13th of January but not anymore.

I am using a MacBook Pro 2019 up-to-date, with apache-spark (3.2.0) and openjdk@11 (11.0.12).

Add a new API to Delta Sharing Protocol to list all tables in a share

Listing all the tables inside a Share is a common metadata retrieval operation. Given the current Delta Sharing Protocol, however, a client has to make one ListSchema API and multiple ListTables API calls in order to list all the tables due to the fact that ListTables API has the scope of a single schema. This may introduce higher traffic load to the sharing server as well as usage frictions to the client, especially when there is a high number of schemas inside a share.

Improve the error message in the client when a share/schema/table doesn't exist

When a share/schema/table doesn't exist, we currently report http 404 error. It would be great if our client can wrap the error with a user friendly message such as [Share/Schema/Table] XXX does not exist, please contact your share provider for further information.

class mis-spelling in rest_client.py

This class name ListSharesRespons is miss-spelled, missing ending 'e'.

Add GetShare API to Delta Sharing Protocol

Propose adding a GetShare REST API to the Delta Sharing Protocol.

Proposed API specification:

HTTP Parameters	Value
HTTP Method	GET
Header	Authorization: Bearer {token}
URL	{prefix}/shares/{share}
URL Parameters	{share}: The share name to query. It's case-insensitive.
Response Header	Content-Type: application/json; charset=utf-8
Response Body	{ "name": string "id": string }

Use case:

A setup pattern of data consumer in delta sharing is they will import all the tables under a Share into a special catalog or schema on their own data warehouse / lakehouse. When performing metadata level operation on those catalogs/schemas, the data consumer would like to know if the Share which the catalog/schema is based on is still being shared. A GetShare API serves for purpose of querying a Share by its name for a data consumer to know if they still have access to that Share. If the API returns 404 NOT FOUND, it means the Share is revoked from the current data consumer.

Dockerize Delta Sharing Server

We can dockerize Delta Sharing Server to simplify the server setup work.

Performance improvement

Currently 200 K rows, 100 columns can be fully loaded in 20-40 minutes using load_as_pandas python package.
Besides, Filtering is not working properly (predicate hints, etc) with this option.
Improving the performance is currently only possible by partitioning delta tables monthly rather than daily.

HDFS table path

Greetings,

I have configured the server yaml to serve this HDFS delta file:

name: "share1"
schemas:
- name: "schema1"
  tables:
  - name: "delta-table"
    location: "hdfs://localhost/user/juan/delta-table"

But when I try to load the file table, I get the following error:

160985 [armeria-common-worker-epoll-2-2] ERROR io.delta.sharing.server.DeltaSharingServiceExceptionHandler - com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalStateException: File system class org.apache.hadoop.hdfs.DistributedFileSystem is not supported

So, my question is if HDFS is really not support or I am missing something in the server configuration.

Thanks a lot!

Connection broken: Incomplete Read error for large datasets

If the data volumes grow, the load_as_pandas gives the following error:

client = delta_sharing.SharingClient(path)
df = delta_sharing.load_as_pandas(table_url)

The delta lake data volumes:

175 columns
330 K rows

java.lang.NullPointerException when reading delta file on Google Cloud Storage

I have config.yaml like this:

version: 1
shares:
- name: "share1"
  schemas:
  - name: "default"
    tables:
    - name: "delta-table"
      location: "gs://xxxx/delta-table" #"gs://<bucket-name>/<the-table-path>"
host: "localhost"
port: 8080
endpoint: "/delta-sharing"
preSignedUrlTimeoutSeconds: 3600
deltaTableCacheSize: 10
stalenessAcceptable: false
evaluatePredicateHints: false
authorization:
  bearerToken: "abcabc"

, and already have Google Service Account key (which has permission to view Cloud Storage)

, and the gs://xxxx/delta-table is a delta table

, My docker run is like this (Docker Desktop on Mac M1):

docker run \
--volume=/tmp/keys/delta-sharing-9864a6b8425e.json:/tmp/keys/delta-sharing-9864a6b8425e.json \
--env=GOOGLE_APPLICATION_CREDENTIALS=/tmp/keys/delta-sharing-9864a6b8425e.json \
-p 8080:8080 -v /Users/lap02429/Downloads/delta_test/config:/mnt  --platform linux/amd64  deltaio/delta-sharing-server:0.4.0 -- --config /mnt/delta-sharing-server-config.yaml

And the image can run
but when I run this

table_url = profile_file + "#share1.default.delta-table"
data = delta_sharing.load_as_pandas(table_url)
print(data.head(10))

it has error like this:

io.delta.sharing.server.DeltaInternalException: java.lang.NullPointerException
	at io.delta.sharing.server.DeltaSharingService.processRequest(DeltaSharingService.scala:138)
	at io.delta.sharing.server.DeltaSharingService.listFiles(DeltaSharingService.scala:226)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.linecorp.armeria.internal.server.annotation.AnnotatedService.invoke(AnnotatedService.java:338)
	at com.linecorp.armeria.internal.server.annotation.AnnotatedService.lambda$serve0$2(AnnotatedService.java:285)
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
	at java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628)
	at java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996)
	at com.linecorp.armeria.internal.server.annotation.AnnotatedService.serve0(AnnotatedService.java:289)
	at com.linecorp.armeria.internal.server.annotation.AnnotatedService.serve(AnnotatedService.java:262)
	at com.linecorp.armeria.server.RouteDecoratingService.serve(RouteDecoratingService.java:92)
	at com.linecorp.armeria.server.RouteDecoratingService.serve(RouteDecoratingService.java:67)
	at com.linecorp.armeria.server.auth.AuthService.handleSuccess(AuthService.java:118)
	at com.linecorp.armeria.server.auth.AuthService.lambda$serve$0(AuthService.java:99)
	at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
	at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
	at com.linecorp.armeria.common.RequestContext.lambda$makeContextAware$3(RequestContext.java:522)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

Caused by: java.lang.NullPointerException
	at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:889)
	at com.google.cloud.storage.BlobId.of(BlobId.java:108)
	at io.delta.sharing.server.GCSFileSigner.sign(CloudFileSigner.scala:209)
	at io.delta.standalone.internal.DeltaSharedTable.$anonfun$query$2(DeltaSharedTableLoader.scala:165)
	at scala.collection.immutable.Stream.map(Stream.scala:418)
	at io.delta.standalone.internal.DeltaSharedTable.$anonfun$query$1(DeltaSharedTableLoader.scala:163)
	at io.delta.standalone.internal.DeltaSharedTable.withClassLoader(DeltaSharedTableLoader.scala:107)
	at io.delta.standalone.internal.DeltaSharedTable.query(DeltaSharedTableLoader.scala:133)
	at io.delta.sharing.server.DeltaSharingService.$anonfun$listFiles$1(DeltaSharingService.scala:232)
	at io.delta.sharing.server.DeltaSharingService.processRequest(DeltaSharingService.scala:135)
....

When I run with spark-shell:

val tablePath = "/Users/test.share#share1.default.delta-table"
val df = spark.read.format("deltaSharing").load(tablePath)
df.printSchema()

=> It still prints the schema of delta table in gcs, but can not print out the data, the error still the same that I pointed above: NullPointerException.

Can anyone help me with this? Thanks so so much!

Upgrade Delta Standalone to 0.3.0

Delta Standalone 0.3.0 is out. We can upgrade to this new version and use the new scan API to load Delta transaction logs.

Support writing to Delta Shares

Since the underlying method of sharing out data uses secure signed URLs, the same process can be used in reverse to collect files for ingestion. Leveraging an extended delta sharing protocol would allow secure, unlimited bandwidth fast uploads of data, from any platform to any cloud object store. Signed URLs can be used by curl, python requests module and any other http library.

This could be used by public health entities, SaaS/IoT companies, Pharma/ Research, Retailers, gas/oil companies to collect data from far flung locales for large scale, traceable, secure streaming data ingestion scenarios.

May need to limit semantics to Append only writes to the underlying share.

Read ".parquet" format files in S3

Hi,

I have a question about the delta-sharing project. I am using Python

I have some files stored in S3 in '.parquet' format. I want to be able to load the content of these files directly into a pandas dataframe without having to download them

The parquet format files I need to read, look as follows:

b'PAR1\x15\x00\x15\xb0' etc ...

Can the delta-sharing project read files in .parquet format or only in JSON format ? The reason why I'm asking is because in the code I see many "from_json" and "json.loads" methods

In the file "profile-file-path", for the "endpoint" key of the dictionary I'm using the S3 file location of the parquet file. Is that correct or do I need to use another endpoint ?

I've tried calling the "client.list_schemas()" or "client.list_shares()" or "client.list_tables()" to find out the final "table_url" that I need for the "load_as_pandas" method. The problem is that with the parquet S3 file location "endpoint" I get an error because this content fetched is parquet and not json

Thank you very much,

Alison

Load_as_Pandas gives defragmented frame as warning

2022-03-28T10:28:56 Welcome, you are now connected to log-streaming service.Starting Log Tail -n 10 of existing logs ----/appsvctmp/volatile/logs/runtime/f63aaea16e4517972f71ef9508dc154bdf3b8368adf4071191264aa28859fc79.log
2022-03-28T10:28:54.930985032Z: [INFO] }
2022-03-28T10:28:54.930990432Z: [INFO] }
2022-03-28T10:28:56.077294348Z: [INFO] warn: Host.Function.Console[0]
2022-03-28T10:28:56.077456950Z: [INFO] /home/site/wwwroot/.python_packages/lib/site-packages/delta_sharing/reader.py:125: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
2022-03-28T10:28:56.077596252Z: [INFO] info: Host.Function.Console[0]
2022-03-28T10:28:56.077606953Z: [INFO] pdf[col] = converter(add_file.partition_values[col])
2022-03-28T10:28:56.203313720Z: [INFO] warn: Host.Function.Console[0]
2022-03-28T10:28:56.203352220Z: [INFO] /home/site/wwwroot/.python_packages/lib/site-packages/delta_sharing/reader.py:125: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
2022-03-28T10:28:56.203359120Z: [INFO] info: Host.Function.Console[0]
2022-03-28T10:28:56.203363820Z: [INFO] pdf[col] = converter(add_file.partition_values[col])Ending Log Tail of existing logs ---Starting Live Log Stream ---ol])Ending Log Tail of existing logs ---Starting Live Log Stream ---22-03-28T10:28:56.338723923Z: [INFO] /home/site/wwwroot/.python_packages/lib/site-packages/delta_sharing/reader.py:125: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
2022-03-28T10:28:56.338754124Z: [INFO] info: Host.Function.Console[0]
2022-03-28T10:28:56.338761824Z: [INFO] pdf[col] = converter(add_file.partition_values[col])

Can a new release be published

While trying out delta-sharing:1.0.0 I ran into some bugs which are resolved by a few merged PR's in main.

The most important (to me) are:

Able to add hadoop core-site.xml config to the delta sharing server #42
Fix load_as_pandas on table without stats #30
Use docker image from sbt docker:publishLocal

would be lovely to have a new release (1.1.0 or 2.0.0??) ( for the python/scala packages at least. A docker image on docker hub would be even more awesome

Python 3.10

I can't install on Python 3.10, it seems this version of Python only fixed a bug, so I do not see the problem of restricting to version < 3.10.
I changed the restriction and install it on Python 3.10 and all worked perfectly splendid

Doc on how to create a "share"?

If I'm self hosting the server, getting it to read S3, how do I create a share? I've checked https://github.com/delta-io/delta-sharing/blob/main/PROTOCOL.md - which focuses on list share, query table etc.

Connection Issue

Whenever i am trying to read the delta table from s3 using load_as_pandas function i am getting a connection issue in ec2 instance. Following is the issue:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 170, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 96, in create_connection
raise err
File "/usr/local/lib/python3.7/site-packages/urllib3/util/connection.py", line 86, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 394, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 234, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/usr/lib64/python3.7/http/client.py", line 1277, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib64/python3.7/http/client.py", line 1323, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib64/python3.7/http/client.py", line 1272, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib64/python3.7/http/client.py", line 1032, in _send_output
self.send(msg)
File "/usr/lib64/python3.7/http/client.py", line 972, in send
self.connect()
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 200, in connect
conn = self._new_conn()
File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 182, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fb6514c1c10>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 756, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 574, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=5044): Max retries exceeded with url: /delta-sharing/test/shares/share1/schemas/schema1/tables/table1/query (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb6514c1c10>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "/home/ec2-user/.local/lib/python3.7/site-packages/delta_sharing/delta_sharing.py", line 61, in load_as_pandas
rest_client=DataSharingRestClient(profile),
File "/home/ec2-user/.local/lib/python3.7/site-packages/delta_sharing/reader.py", line 62, in to_pandas
self._table, predicateHints=self._predicateHints, limitHint=self._limitHint
File "/home/ec2-user/.local/lib/python3.7/site-packages/delta_sharing/rest_client.py", line 84, in func_with_retry
raise e
File "/home/ec2-user/.local/lib/python3.7/site-packages/delta_sharing/rest_client.py", line 77, in func_with_retry
return func(self, *arg, **kwargs)
File "/home/ec2-user/.local/lib/python3.7/site-packages/delta_sharing/rest_client.py", line 182, in list_files_in_table
f"/shares/{table.share}/schemas/{table.schema}/tables/{table.name}/query", data=data,
File "/usr/lib64/python3.7/contextlib.py", line 112, in enter
return next(self.gen)
File "/home/ec2-user/.local/lib/python3.7/site-packages/delta_sharing/rest_client.py", line 204, in _request_internal
response = request(f"{self._profile.endpoint}{target}", json=data)
File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 590, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.7/site-packages/requests/sessions.py", line 655, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.7/site-packages/requests/adapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=5044): Max retries exceeded with url: /delta-sharing/test/shares/share1/schemas/schema1/tables/table1/query (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fb6514c1c10>: Failed to establish a new connection: [Errno 111] Connection refused'))

Data sharing based on data access policies (Ranger, etc)

Ability to configure data sharing based on policies rather than Delta Sharing based configuration would be helpful.

fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey properties must be present

Even after having IAM role access to S3 and specifying AWS_ACCESS_KEY and AWS_SECRET_ACCESS_KEY. I am having this error in server logs.
As it uses hadoop-aws to read files, how do we authenticate and pass credentials in server?

Before running the below command, where should I specify fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey properties?
bin/delta-sharing-server -- --config conf/delta-sharing-server.yaml

Investigate how to test Delta Sharing connector with Delta Sharing Server without touching real cloud storage

Today when we want to write a test to use Delta Sharing connector to talk with Delta Sharing Server, we need to set up Delta Sharing Server with real cloud storage credentials. It would be great to investigate whether we can write such tests without touching real cloud storage so that it's much easier to run these tests.

For example, we can try to create a test mode for Delta Sharing Server which can become a proxy of local files. Then Delta Sharing connector can just request files from Delta Sharing Server to read local files.

Update the list of shared tables dynamically

We have a delta-lake bucket with thousands of tables, total set are updated several times a day(it is a multitenant system, each tenant has own bunch of separate delta-tables). Separate service(our internal python ml tool) needs to read this data.
Delta sharing server looks great, but we find it quite difficult to use static shares configuration for this case. Do you have any plan to support dynamic tables resolving?

I can think two possible ways to do this

Delta server periodically call webhook that returns the actual set of tables(each with own s3a:// path and full logical name #share.schema.table)
Define tables via yaml-config by wildcard, for example s3a://some-bucket/tenant_*/user_actions and update actual set periodically on delta-server side.

delta-io / delta-sharing Goto Github PK

delta-sharing's People

Contributors

Stargazers

Watchers

Forkers

delta-sharing's Issues

Recommend Projects

Recommend Topics

Recommend Org