Hi there, I'm looking into lakeFS with Delta Lake ...
# help
Hi there, I'm looking into lakeFS with Delta Lake integration and wanted to know if there's a way to use Delta Lake Python SDK together with lakeFS Python SDK using the pre-signed URL mode, where my S3 files will be read/write only by the SDKs running in my code and not the lakeFS server itself?
If it helps, I plan to use Polars library for Python which has Delta Lake Python SDK integration built-in
Perhaps if I make the Delta Lake SDK write to local file system and use lakefs-fsspec to simulate the file system? I saw this Github thread re. lakefs-fsspec support for pre-signed URL mode: https://github.com/treeverse/lakeFS/discussions/6469#discussioncomment-6885253
Hi @Eldar Sehayek, If you’re using Spark with the Delta Lake Python SDK, you can leverage the lakeFS Hadoop Filesystem to support exactly that. Basically the combination of: 1. Hadoop conf in the Delta SDK 2. lakeFS Filesystem Should do the job. If you’re not using Spark, what query engine do you use?
Hi @Jonathan Rosenberg I'm not using Spark, I use AWS Batch to distribute data processing workloads between containers as part of a distributed job. For each distributed job there are N workers and the work is divided between them, all should read from table T1, process, and write to table T2 I plan to use the Polars library to handle DataFrames in every worker
I wish to use the Delta Lake Python SDK that's not Spark dependent called delta-rs and there will be Polars and pyarrow as well in case using any of them will help, both can be used to perform writes in Delta Lake format, either to S3 object store or to local file system. I looked at the Delta Lake Python SDK support for pyarrow writer to local file system and it looks like plugging it a lakefs-spec as a file system should work since it uses pre-signed URLs by default. Any idea if that works?
if you’re using delta-rs you can configure the storage_options of the delta table constructor to point to lakeFS and use the lakeFS api key and secret: https://delta-io.github.io/delta-rs/python/usage.html The values are something like:
Copy code
AWS_ACCESS_KEY_ID: lakefs_api_key,
AWS_ENDPOINT_URL: lakefs_endpoint,
AWS_SECRET_ACCESS_KEY: lakefs_secret_access_key,
AWS_REGION: "us-east-1"
Thank you this is very helpful! With this I'll be using a path "lakeFS://repo/branch/file" as with the lakeFS client directly?
you’ll need to use
but it will refer to your lakeFS server
Which generates pre-signed URL for the client to use, got it. Thank you!
let me know if that helped 🙏
🙏 1
@Jonathan Rosenberg Thought you may wish to know the delta-rs key AWS_STORAGE_ALLOW_HTTP seems to be called AWS_ALLOW_HTTP in the library's source code https://github.com/delta-io/delta-rs/blob/9264edea89a2fc1c35f4a6b9faab125748ff3651/crates/deltalake-aws/src/storage.rs#L363C7-L363C7 Also I read the this flag is to allow HTTP instead of HTTPS, is it necessary to work in Prod or did you add it for test code?
No it’s not necessarily at all… It’s just an example that I had. Sorry for the confusion
👍 1
Hi @Eldar Sehayek, I was actually wrong when suggesting to use the mentioned configuration for pre-signed urls since the request from delta-rs will reach the lakeFS’s S3 gateway (the S3 compatibility layer) which doesn’t support pre-signed url, and currently we don’t have built-in support for delta-rs 😔
Hi @Jonathan Rosenberg Thank you for letting me know. Is there an open issue about it I can promote? This has impact on our considerations re. using lakeFS
Hi Guys, I'm one of the maintainers of Delta-RS, I am also looking to use LakeFS for our projects at work and I am wondering whether the S3 gateway cliënt LakeFS uses requires a locking cliënt as well? Basically our storage is Azure which doesn't have this notion since it supports by design concurrent writers. However S3 requires locking clients like dynamodb.. So I'm wondering if you need the locking cliënt as well when you use the S3 emulator even if the underlying storage is Azure