Hey everyone! I'm working with LakeFS in a Glue/Py...
# help
m
Hey everyone! I'm working with LakeFS in a Glue/PySpark environment, so I've set the
spark.hadoop.fs.s3a.endpoint
property to be the lakefs endpoint. This works great for data already in a lakefs repo, but is there any way to still access a "normal" s3 path within the same session? I'd like to create a dataframe of some data from a "normal" S3 path and use it alongside the lakefs-managed data. If not, is the only reasonable path to first import the data at the normal S3 path into the lakefs repo? Thanks for any insight you have.
n
@Michael Gaebel Hi, I'm not a Spark expert but as far as I understand once you configured the s3a endpoint to lakeFS it means that all interaction via the S3 compatible API goes through the lakeFS spark client. Maybe my other colleagues (@Ariel Shaqed (Scolnicov), @Yoni Augarten) can enlighten us on solutions I'm not aware of. Regarding the import solution - this is a possibility and will probably work if you plan on only reading the imported data. Keep in mind that if you intend to modify the data it should be done only in the context of lakeFS, and once done, the changes will exist only on the lakeFS namespace (the import is a zero copy operation and lakeFS does not write to the source)
a
Hi @Michael Gaebel, What Hadoop version are you on? Almost anything that is still in use should support per-bucket configuration (see link). Could you try to configure credentials and endpoint separately for your S3 buckets and for your lakeFS repos that S3 sees as other buckets?
m
@Niro We would only be reading it, as this would be an upsert operation into a lakefs-managed table from the raw S3 files.
a
The alternative is RouterFS , which our @Tal Sofer and @Jonathan Rosenberg set up to do precisely this.
m
oh perfect! I'll do some reading and see what will work
a
Or (only 2 more, I promise!) just use LakeFSFileSystem to access lakeFS on lakeFS:// URLs. See https://docs.lakefs.io/integrations/spark.html#lakefs-hadoop-filesystem for that.
👀 1
Last one: I don't know if this works on all Spark games. But you might try to configure s3b:// URLs (or any other prefix, even lakefs://...) to use the S3AFileSystem but with different endpoint and credentials. That way it will go to your lakeFS.
Which to choose? I think it usually boils down to 2 questions: • Usability: do you want our bed all URLs to start "s3a://"? Usually I prefer not to, so I go and use LakeFSFileSystem. But we've had users who needed this, for instance because they had existing Spark programs and configurations that they wanted to change as little as possible. We produced RouterFS for this kind of usage. I don't think we see it too much. • Can you install jars on your Spark cluster? If not, you need one of the two vanilla S3A solutions. I hope one of these should be enough to get you going. Please let us know how you get along... welcome sunglasses lakefs to the lake!
m
Awesome, thanks for all the options. I'll let you know which one we go with
n
Thanks @Ariel Shaqed (Scolnicov) for always coming through!
jumping lakefs 2