user
03/23/2022, 11:00 AMuser
03/23/2022, 1:02 PMuser
03/23/2022, 1:03 PMuser
03/23/2022, 1:19 PMspark.hadoop.fs.s3a.endpoint
In that case you would be using the S3 committer
(which is already contained in databricks), but instead of accessing S3 it would access lakeFS, in this case all the data goes through your lakeFS server.
In order for your spark job to write the data directly to S3 ( your underlying storage), lakeFS provides the lakeFS-specific Hadoop FileSystem
user
03/23/2022, 10:47 PMuser
03/23/2022, 10:48 PMuser
03/23/2022, 10:58 PMuser
03/24/2022, 8:38 AM<lakefs://repo/branch/path/to/obj>
, lakeFSFS will ask lakeFS to look up path /path/to/obj
on branch branch
of the repository named repo
. lakeFS will reply with the object metadata, that includes a path <s3://storage/namespace/opaqueopaqueopaque>
("opaque" means that portion of the path is essentially meaningless). Now lakeFSFS calls out to the Hadoop S3AFileSystem to open that path. Writing is a bit more complex sunglasses lakefs but the details are the same: ask lakeFS for somewhere to write, upload the file directly to S3, tell lakeFS to link that file into its new location on whatever branch.user
03/24/2022, 8:43 AMuser
03/24/2022, 12:54 PMuser
03/24/2022, 12:59 PMuser
03/24/2022, 1:17 PMuser
03/24/2022, 1:28 PMs3a
you could find the code hereuser
03/24/2022, 1:51 PMMake sense to have both.@Edmondo Porcu this is certainly a future plan for us. We will love to hear about the use cases you and @Micha Kunze have!
user
03/24/2022, 3:52 PMuser
03/24/2022, 4:10 PMuser
03/24/2022, 4:31 PMuser
03/24/2022, 4:37 PMuser
03/24/2022, 4:42 PMuser
03/24/2022, 4:43 PMuser
03/24/2022, 4:49 PMuser
03/24/2022, 4:52 PMuser
03/24/2022, 5:11 PMuser
03/24/2022, 6:08 PM