Hey I have a couple questions 1 Does the export functionalit lakeFS #help

Hey! I have a couple questions: 1. Does the export...

user

01/27/2023, 4:55 PM

Hey! I have a couple questions: 1. Does the export functionality work for zero-copy repositories? 2. Is there any way to export via a Python API? I was looking at the spark-submit pip package

user

01/27/2023, 4:56 PM

I might also be missing the best way to implement what I'm trying to do. Essentially, I want to do a sort of

git checkout

on an S3 bucket with zero-copy

user

01/27/2023, 4:58 PM

Hey @Conor Simmons 👋 Let me check that for you

user

01/27/2023, 5:33 PM

1. export should work for zero-copy repositories, have you experienced any issue? 2. we do have an Spark example in our export documentation, unfortunately, I don’t think we support the export command using our Python SDK I’m not sure I follow the use-case you mentioned, you’d like to export data out of lakeFS to be managed as plain objects (with zero copy), or you’d like to manage existing S3 bucket within lakeFS with zero-copy.

user

01/27/2023, 5:33 PM

mind sharing more information?

user

01/27/2023, 5:42 PM

or you’d like to manage existing S3 bucket within lakeFS with zero-copy

I think it would be this. Here's an example: commit 1: • add hello_world.txt to s3://my/bucket/path • commit to LakeFS with import from s3 bucket commit 2: • delete hello_world.txt from s3://my/bucket/path • add hello_world_2.txt to s3://my/bucket/path • commit to LakeFS with import from s3 bucket Now, in my s3 bucket, I want to "checkout" commit 1, so I should see hello_world.txt instead of hello_world_2.txt in my s3 bucket

user

01/27/2023, 5:55 PM

thank you for the detailed explanation, one question to clarify, when you add or delete objects, you do this directly against the S3 bucket, right?

user

01/27/2023, 5:57 PM

if that’s the case, it’s not recommended as lakeFS can’t guarantee the files will be reachable in the underlying storage. if you let manage lakeFS manage the files, I suggest not to interact directly with the object store, or manage the risk of files not found in the underlying storage. if you’ll upload/delete the files using lakeFS and let lakeFS manage the underlying storage, once checking out to commit 1, the files will be accessible (or if they weren’t deleted from the underlying object store). does this answer your question?

user

01/27/2023, 6:09 PM

you do this directly against the S3 bucket, right?

Yes, isn't this the only way to do zero-copy?

if you’ll upload/delete the files using lakeFS and let lakeFS manage the underlying storage, once checking out to commit 1, the files will be accessible (or if they weren’t deleted from the underlying object store).

Would this mean not using zero-copy? Or what use case does this look like?

user

01/27/2023, 6:10 PM

If you upload the object anyway to a bucket, this step isn't considered zero copy, so cloud can upload it directly using lakeFS, and every checkout using lakeFS is zero copy

user

01/27/2023, 6:11 PM

If the bucket is not under your ownership and you don't want to pay the storage costs, than yeah, you should import, but than you don't really upload or delete any object

user

01/27/2023, 6:12 PM

I see. I think that's where my confusion has been. We want to store on the bucket and we manage the bucket. I thought upload would be storing on the LakeFS server database

user

01/27/2023, 6:13 PM

What would the

lakectl fs upload

command and/or python API look like for using an s3 bucket?

user

01/27/2023, 6:16 PM

lakeFS has different methods of communication, you can use the API, SDKs or S3 compatible endpoint, it's agnostic of the object store behind the scenes (S3, Google Cloud, Azure or MinIO)

user

01/27/2023, 6:17 PM

https://docs.lakefs.io/reference/commands.html#lakectl-fs-upload

user

01/27/2023, 6:17 PM

See the CLI documentation, you can use the API or SDKs if you like too

user

01/27/2023, 6:18 PM

And if you're interested in importing large data sets, the import is the tool for you.

user

01/27/2023, 6:18 PM

https://docs.lakefs.io/howto/import.html

user

01/27/2023, 6:18 PM

You can read here more on the import process and it's limitations

user

01/27/2023, 6:20 PM

So I think be able to upload data on local storage, and it writes to S3 and uses LakeFS for versioning right?

user

01/27/2023, 6:20 PM

What do you mean local storage?

user

01/27/2023, 6:21 PM

If you launch lakeFS with local storage, than the file system is your "object store". It's mainly for experimental use

user

01/27/2023, 6:21 PM

https://docs.lakefs.io/deploy/aws.html

user

01/27/2023, 6:21 PM

I mean if I have

file://home/conor/cat.jpg▾

I can lakectl fs upload that, and it goes though S3? When I tried lakectl upload before I don't think I was able to use an S3 path? Or I was confused on the docs

user

01/27/2023, 6:22 PM

See this guide on how to deploy lakeFS using S3 object store

user

01/27/2023, 6:22 PM

I have it set up already, I can make repos with S3 buckets

user

01/27/2023, 6:23 PM

Yes, if you configured lakeFS with S3 and upload an object from your local system using the fs upload command, it'll be uploaded to S3

user

01/27/2023, 6:24 PM

Ok great I will test this. What about a

git checkout

equivalent? Say I want my local files and S3 bucket to checkout a different commit

user

01/27/2023, 6:25 PM

Say I want to swicth between commits 1 and 2 in this example https://lakefs.slack.com/archives/C02CV7MUV4G/p1674841322632359?thread_ts=1674838530.053799&cid=C02CV7MUV4G

user

01/27/2023, 6:27 PM

So basically in lakeFS there's no "checkout" process, all branches accessible using the path. If you refer to changing a branch to point to a different commit, that's possible using the branch revert command

user

01/27/2023, 6:27 PM

Where you pass the commit reference

user

01/27/2023, 6:27 PM

https://docs.lakefs.io/reference/cli.html#lakectl-branch-revert

user

01/27/2023, 6:28 PM

I've seen that as well but it seems like that's only intended for large errors in the data

user

01/27/2023, 6:28 PM

What do you mean by large errors in the data?

user

01/27/2023, 6:29 PM

Isn't branch revert intended for rollback, which is described as this in the docs

A rollback operation is used to to fix critical data errors immediately.

user

01/27/2023, 6:31 PM

Revert is used to point a branch to a different commit, which is what you're looking for if I understand correctly. If you'd like to access objects from a different commit without changing the branch reference, you can do so also, just like all branches are accessible, commits/refs or tags are accessible too

user

01/27/2023, 6:32 PM

Changing a branch ref with large data operation is usually executed when there are data issues, this is why it's mentioned in the docs

user

01/27/2023, 6:33 PM

This might be a better way to phrase my potential use case as well, to put it in ML context. The goal is reproducibility. If I train a model on LakeFS commit x, and then have commits y, z, etc., and let's say I want to reproduce when I trained with commit x, what's the best way to do that? It seems like I want this:

If you'd like to access objects from a different commit without changing the branch reference, you can do so also, just like all branches are accessible, commits/refs or tags are accessible too

user

01/27/2023, 6:34 PM

You can point your model to work with a specific ref or tag, you don't need a revert/checkout

user

01/27/2023, 6:35 PM

See this: https://docs.lakefs.io/understand/model.html#lakefs-protocol-uris

user

01/27/2023, 6:35 PM

I think it'll make it clearer

user

01/27/2023, 6:36 PM

Thanks. To do this

You can point your model to work with a specific ref or tag, you don't need a revert/checkout

Is it a lakefs download?

user

01/27/2023, 6:37 PM

Ideally, I want to point to a specific ref or tag when training and have the option to read data from local disk or directly from S3

user

01/27/2023, 6:38 PM

It's relevant to all listing/getting objects commands, the only exception is upload as commits are immutable

user

01/27/2023, 6:40 PM

if you wish to read directly from S3 but leverage lakeFS mapping and versioning capabilities, you can too. That's the S3 gateway :)

user

01/27/2023, 6:40 PM

If you use the API, lakeFS download the object for you (so traffic goes through lakeFS)

user

01/27/2023, 6:41 PM

Ultimately I just want to stream data over internet, but I think other software dependencies would require S3 path

user

01/27/2023, 6:42 PM

For the downloading to local storage use case, would that be python

ObjectsAPI

? Can I download recursively from a folder in lakefs?

user

01/27/2023, 6:46 PM

Note that you can also use the objects API presigned URLs in order to generate S3 URLs. You can also use the objects API to list objects under a specific path and eventually get these objects

user

01/27/2023, 6:48 PM

Once you list the objects you can iterate over them and download

user

01/27/2023, 6:49 PM

Ok let's say I want to do mirror an old commit on local storage. If 99% of the objects are unchanged, would I still need to download 100% or just 1%?

user

01/27/2023, 6:52 PM

Please explain what do you mean by doing a mirror on local storage, up until now we discussed on lakeFS running against S3

user

01/27/2023, 6:54 PM

Creating a branch, regardless of the source branch, is a zero copy operation, so no files will be copied If you change 1% of the files, than only those 1% of files will be written using copy on write. Now, you wish to "sync" this branch locally? If this the use case?

user

01/27/2023, 6:59 PM

Yes, because I may have multiple machines training via local storage. I need a

git pull

equivalent for a whole branch and not individual objects

user

01/27/2023, 6:59 PM

I want to point to a specific ref or tag when training and have the option to read data from local disk or directly from S3

To satisfy the first option here

user

01/27/2023, 7:01 PM

Or directly from S3 is also there :) You can use native Linux rsync/rclone commands if you like, I think the better approach would be to work directly with S3 to be honest

user

01/27/2023, 7:01 PM

I just use the term mirror as a

git checkout && git pull

equivalent

user

01/27/2023, 7:02 PM

Due to the fact datasets might be very big, I'm not sure that's the common use case as it can take very long time downloading

user

01/27/2023, 7:02 PM

The reason it may be undesirable to work directly with S3 is cause network bottleneck is much worse than local read when loading data for training

user

01/27/2023, 7:02 PM

If they're small, rsync should do the trick anyway

user

01/27/2023, 7:03 PM

Network will take place anyway as you'll need to "git pull"/rsync before running your model

user

01/27/2023, 7:03 PM

Reading from S3 would be intended for larger datasets when the download cost outweighs the slower data loading

user

01/27/2023, 7:04 PM

Yeah it's a large/small dataset tradeoff. One time download cost versus continuous data loading cost during a training job

user

01/27/2023, 7:04 PM

See this blog post: https://lakefs.io/blog/seamlessly-sync-data-into-your-lakefs-repos-with-airbyte/

user

01/27/2023, 7:05 PM

Argh, sorry, that's the other way around

user

01/27/2023, 7:06 PM

But it doesn't really matter, you should just change the source / destination

user

01/27/2023, 7:06 PM

https://docs.lakefs.io/howto/copying.html#syncing-a-local-directory-and-lakefs

user

01/27/2023, 7:06 PM

See here the rsync reference

user

01/27/2023, 7:11 PM

Thanks rsync looks good

user

01/27/2023, 7:11 PM

Can airbyte sync locally too?

user

01/27/2023, 7:12 PM

Do you have some recommendation between the 2?

user

01/27/2023, 7:13 PM

Yes but I think it depends on the data type (CSV an JSON are supported if I remember correctly)

user

01/27/2023, 7:14 PM

Ah, I'm working with images 😅

user

01/27/2023, 7:14 PM

And also JSON

user

01/27/2023, 7:14 PM

Not really, it's a matter of preference. If that's the only use case, I'd stick with rsync as it covers everything you need Airbyte has a lot of other integrations

user

01/27/2023, 7:15 PM

but I think the onboarding experience will take more time and it might not support your use case

user

01/27/2023, 7:15 PM

Thank you. I think hopefully my last question:

If they're small, rsync should do the trick anyway

Do you happen to have an idea of what small means here quantitatively?

user

01/27/2023, 7:18 PM

I don't have any number to give on this as it's related to the effectiveness of the model and the number of objects download. I suggest to test both methods and pick the one which is faster or more reliable following your needs

user

01/27/2023, 7:19 PM

Ok, thank you! I appreciate your help

user

01/27/2023, 7:19 PM

Sure :) happy to help! Feel free to reach out with further questions

user

01/27/2023, 9:24 PM

@Or Tzabary I'm not sure if you can help me with this issue. I was able to use rclone sync between my S3 bucket (actually BackBlaze) and local storage. When I try to set up with lakefs following the docs, it seems to start out well...

Copy code

rclone lsd lakefs:

yields

Copy code

-1 2023-01-27 12:36:40        -1 demo

which is my only lakefs repository right now. However, even when trying

Copy code

rclone ls lakefs:

I get

Copy code

Failed to ls: RequestError: send request failed
caused by: Get "<https://demo.lakefs.example.com/?delimiter=&encoding-type=url&list-type=2&max-keys=1000&prefix=>": dial tcp: lookup <http://demo.lakefs.example.com|demo.lakefs.example.com> on 127.0.0.53:53: no such host

with similar errors for rclone sync, etc Any idea on this? Could it be related to using BackBlaze?

user

01/27/2023, 9:31 PM

Hi @Conor Simmons, let me try and find an idea from a member of the team. Not sure we have something immediate, but will look into this.

user

01/27/2023, 9:32 PM

Thank you!

user

01/27/2023, 9:43 PM

You bet

user

01/29/2023, 10:04 AM

@Conor Simmons is demo.lakefs.example.com really your lakeFS host? it might indicate of a wrong rclone configuration where you need to set lakefs to your lakeFS endpoint.

user

01/29/2023, 6:36 PM

The

demo

part is the name of the repository (you can see in the

ls

command. It seems to add that into the host name. I substituted "example" in the actual endpoint name. The config looks like this

Copy code

[lakefs]
type = s3
provider = AWS
env_auth = false
access_key_id = xxxxxxxxxxxxxxxxxx
secret_access_key = xxxxxxxxxxxxxxxxx
endpoint = <https://lakefs.example.com>
no_check_bucket = true

user

01/29/2023, 6:39 PM

while lakeFS should work with that endpoint URL, it seems that the dns is not configured to work with it following the error code IIRC, you mentioned this is running locally, right?

user

01/29/2023, 6:40 PM

If so, make sure to add demo.lakefs.example.com to your hosts file pointing to 127.0.0.1 If it's not running locally, you should point that endpoint to your server

user

01/29/2023, 6:41 PM

It's running on a server but thanks, I'll share this info

user

01/29/2023, 6:42 PM

Let me know how it went

user

01/30/2023, 5:59 PM

We are running this on a server. We've set up a reverse proxy with nginx using certificates from CloudFlare but only for the subdomain lakefs.example.com so that we can serve over https. Is there a particular reason that lookup is being made to repo_name.lakefs.example.com and is there some other way to handle that instead through subsubdomains?

user

01/30/2023, 6:16 PM

@Matija Teršek Hi Just reading through the thread real quick and I'll try to assist

user

01/30/2023, 6:29 PM

Hi, I would guess that you're using the s3 gateway to access lakeFS using the s3 protocol. This is host-style addressing in the s3 protocol; you'll want to change the client accessing lakeFS to use path-style addressing. We might be able to help with that; which client are you using? Alternatively, of course, if you can configure CloudFlare also to serve the wildcard DNS *.lakeFS.com , using a similar wildcard certificate, you will be able to solve it at the server side.

user

01/30/2023, 6:32 PM

You can read about the two styles in AWS documentation, for instance this is a good summary. It's one really annoying aspect of the s3 protocol, that makes it hard to configure a server. Amusing, AWS have also been trying to remove host-based addressing for many years now. So far with little success.

user

01/30/2023, 6:39 PM

To add to what @Ariel Shaqed (Scolnicov) correctly pointed out, I believe the specific flag you're looking for in

rclone

--s3-force-path-style

user

01/30/2023, 6:43 PM

Also available as a config option in Rclone. I think you may have missed this integration guide for Rclone.

user

01/30/2023, 6:45 PM

Just to add a couple of notes: 1. I tried

rclone ls --s3-force-path-style lakefs:

and got the same error 2. Our rclone config is based exactly on that integration guide

user

01/30/2023, 6:48 PM

Ouch! Sorry to bug you, but could you try rclone with the -vv flag and attach all output, please?

user

01/30/2023, 6:50 PM

Copy code

<7>DEBUG : rclone: Version "v1.61.1" starting with parameters ["rclone" "ls" "-vv" "lakefs:"]
<7>DEBUG : rclone: systemd logging support activated
<7>DEBUG : Creating backend with remote "lakefs:"
<7>DEBUG : Using config file from "/home/conor/.config/rclone/rclone.conf"
<7>DEBUG : 3 go routines active
Failed to ls: RequestError: send request failed
caused by: Get "<https://demo.lakefs.example.com/?delimiter=&encoding-type=url&list-type=2&max-keys=1000&prefix=>": x509: certificate is valid for <http://lakefs.example.com|lakefs.example.com>, not <http://demo.lakefs.example.com|demo.lakefs.example.com>

Not much helpful info imo 😅 But note the error is a bit different since @Matija Teršek changed something on the server side

user

01/30/2023, 6:52 PM

I've found this issue on the rclone forums As far as I recall you said you aren't using AWS, right? Maybe this will be relevant

user

01/30/2023, 6:53 PM

It's an S3 store using backblaze Here's the config (provider is set to AWS): https://lakefs.slack.com/archives/C02CV7MUV4G/p1675017392529639?thread_ts=1674838530.053799&cid=C02CV7MUV4G

user

01/30/2023, 6:54 PM

If I follow this guide, I can access the backblaze store fine

user

01/30/2023, 6:56 PM

Plus, it seems like the config is working somewhat since

rclone lsd lakefs:

yields

Copy code

-1 2023-01-27 14:42:27        -1 demo

which is our 1 lakefs repo right now

user

01/30/2023, 6:57 PM

Would you mind trying the suggestion in the post I linked to? They suggest using

provider = Other

in place of

provider = AWS

I want to rule that out before we look elsewhere

user

01/30/2023, 6:58 PM

Yeah, listing buckets doesn't perform host-style addressing. That's one of my favourite s3 weirdnesses.

user

01/30/2023, 6:59 PM

Also, make sure to add

--s3-force-path-style=true

, that's still needed even with

provider = Other

user

01/30/2023, 7:13 PM

@Ariel Shaqed (Scolnicov) @Elad Lachmi it works! Thanks a ton!

user

01/30/2023, 7:15 PM

@Conor Simmons Awesome! Glad we were able to assist If you have any further questions, feel free to reach out Happy lakeFS`ing sunglasses lakefs

user

01/30/2023, 7:17 PM

Thanks! Looking forward to using lakefs

Open in Slack

Previous Next