https://lakefs.io/ logo
Title
c

Conor Simmons

01/27/2023, 4:55 PM
Hey! I have a couple questions: 1. Does the export functionality work for zero-copy repositories? 2. Is there any way to export via a Python API? I was looking at the spark-submit pip package
I might also be missing the best way to implement what I'm trying to do. Essentially, I want to do a sort of
git checkout
on an S3 bucket with zero-copy
o

Or Tzabary

01/27/2023, 4:58 PM
Hey @Conor Simmons 👋 Let me check that for you
👋 1
👍 1
1. export should work for zero-copy repositories, have you experienced any issue? 2. we do have an Spark example in our export documentation, unfortunately, I don’t think we support the export command using our Python SDK I’m not sure I follow the use-case you mentioned, you’d like to export data out of lakeFS to be managed as plain objects (with zero copy), or you’d like to manage existing S3 bucket within lakeFS with zero-copy.
mind sharing more information?
c

Conor Simmons

01/27/2023, 5:42 PM
or you’d like to manage existing S3 bucket within lakeFS with zero-copy
I think it would be this. Here's an example: commit 1: • add hello_world.txt to s3://my/bucket/path • commit to LakeFS with import from s3 bucket commit 2: • delete hello_world.txt from s3://my/bucket/path • add hello_world_2.txt to s3://my/bucket/path • commit to LakeFS with import from s3 bucket Now, in my s3 bucket, I want to "checkout" commit 1, so I should see hello_world.txt instead of hello_world_2.txt in my s3 bucket
o

Or Tzabary

01/27/2023, 5:55 PM
thank you for the detailed explanation, one question to clarify, when you add or delete objects, you do this directly against the S3 bucket, right?
if that’s the case, it’s not recommended as lakeFS can’t guarantee the files will be reachable in the underlying storage. if you let manage lakeFS manage the files, I suggest not to interact directly with the object store, or manage the risk of files not found in the underlying storage. if you’ll upload/delete the files using lakeFS and let lakeFS manage the underlying storage, once checking out to commit 1, the files will be accessible (or if they weren’t deleted from the underlying object store). does this answer your question?
c

Conor Simmons

01/27/2023, 6:09 PM
you do this directly against the S3 bucket, right?
Yes, isn't this the only way to do zero-copy?
if you’ll upload/delete the files using lakeFS and let lakeFS manage the underlying storage, once checking out to commit 1, the files will be accessible (or if they weren’t deleted from the underlying object store).
Would this mean not using zero-copy? Or what use case does this look like?
o

Or Tzabary

01/27/2023, 6:10 PM
If you upload the object anyway to a bucket, this step isn't considered zero copy, so cloud can upload it directly using lakeFS, and every checkout using lakeFS is zero copy
If the bucket is not under your ownership and you don't want to pay the storage costs, than yeah, you should import, but than you don't really upload or delete any object
c

Conor Simmons

01/27/2023, 6:12 PM
I see. I think that's where my confusion has been. We want to store on the bucket and we manage the bucket. I thought upload would be storing on the LakeFS server database
What would the
lakectl fs upload
command and/or python API look like for using an s3 bucket?
o

Or Tzabary

01/27/2023, 6:16 PM
lakeFS has different methods of communication, you can use the API, SDKs or S3 compatible endpoint, it's agnostic of the object store behind the scenes (S3, Google Cloud, Azure or MinIO)
See the CLI documentation, you can use the API or SDKs if you like too
And if you're interested in importing large data sets, the import is the tool for you.
You can read here more on the import process and it's limitations
c

Conor Simmons

01/27/2023, 6:20 PM
So I think be able to upload data on local storage, and it writes to S3 and uses LakeFS for versioning right?
o

Or Tzabary

01/27/2023, 6:20 PM
What do you mean local storage?
If you launch lakeFS with local storage, than the file system is your "object store". It's mainly for experimental use
c

Conor Simmons

01/27/2023, 6:21 PM
I mean if I have

file://home/conor/cat.jpg

I can lakectl fs upload that, and it goes though S3? When I tried lakectl upload before I don't think I was able to use an S3 path? Or I was confused on the docs
o

Or Tzabary

01/27/2023, 6:22 PM
See this guide on how to deploy lakeFS using S3 object store
c

Conor Simmons

01/27/2023, 6:22 PM
I have it set up already, I can make repos with S3 buckets
o

Or Tzabary

01/27/2023, 6:23 PM
Yes, if you configured lakeFS with S3 and upload an object from your local system using the fs upload command, it'll be uploaded to S3
👍 1
c

Conor Simmons

01/27/2023, 6:24 PM
Ok great I will test this. What about a
git checkout
equivalent? Say I want my local files and S3 bucket to checkout a different commit
o

Or Tzabary

01/27/2023, 6:27 PM
So basically in lakeFS there's no "checkout" process, all branches accessible using the path. If you refer to changing a branch to point to a different commit, that's possible using the branch revert command
Where you pass the commit reference
c

Conor Simmons

01/27/2023, 6:28 PM
I've seen that as well but it seems like that's only intended for large errors in the data
o

Or Tzabary

01/27/2023, 6:28 PM
What do you mean by large errors in the data?
c

Conor Simmons

01/27/2023, 6:29 PM
Isn't branch revert intended for rollback, which is described as this in the docs
A rollback operation is used to to fix critical data errors immediately.
o

Or Tzabary

01/27/2023, 6:31 PM
Revert is used to point a branch to a different commit, which is what you're looking for if I understand correctly. If you'd like to access objects from a different commit without changing the branch reference, you can do so also, just like all branches are accessible, commits/refs or tags are accessible too
Changing a branch ref with large data operation is usually executed when there are data issues, this is why it's mentioned in the docs
c

Conor Simmons

01/27/2023, 6:33 PM
This might be a better way to phrase my potential use case as well, to put it in ML context. The goal is reproducibility. If I train a model on LakeFS commit x, and then have commits y, z, etc., and let's say I want to reproduce when I trained with commit x, what's the best way to do that? It seems like I want this:
If you'd like to access objects from a different commit without changing the branch reference, you can do so also, just like all branches are accessible, commits/refs or tags are accessible too
o

Or Tzabary

01/27/2023, 6:34 PM
You can point your model to work with a specific ref or tag, you don't need a revert/checkout
I think it'll make it clearer
c

Conor Simmons

01/27/2023, 6:36 PM
Thanks. To do this
You can point your model to work with a specific ref or tag, you don't need a revert/checkout
Is it a lakefs download?
Ideally, I want to point to a specific ref or tag when training and have the option to read data from local disk or directly from S3
o

Or Tzabary

01/27/2023, 6:38 PM
It's relevant to all listing/getting objects commands, the only exception is upload as commits are immutable
if you wish to read directly from S3 but leverage lakeFS mapping and versioning capabilities, you can too. That's the S3 gateway :)
👍 1
If you use the API, lakeFS download the object for you (so traffic goes through lakeFS)
c

Conor Simmons

01/27/2023, 6:41 PM
Ultimately I just want to stream data over internet, but I think other software dependencies would require S3 path
For the downloading to local storage use case, would that be python
ObjectsAPI
? Can I download recursively from a folder in lakefs?
o

Or Tzabary

01/27/2023, 6:46 PM
Note that you can also use the objects API presigned URLs in order to generate S3 URLs. You can also use the objects API to list objects under a specific path and eventually get these objects
Once you list the objects you can iterate over them and download
c

Conor Simmons

01/27/2023, 6:49 PM
Ok let's say I want to do mirror an old commit on local storage. If 99% of the objects are unchanged, would I still need to download 100% or just 1%?
o

Or Tzabary

01/27/2023, 6:52 PM
Please explain what do you mean by doing a mirror on local storage, up until now we discussed on lakeFS running against S3
Creating a branch, regardless of the source branch, is a zero copy operation, so no files will be copied If you change 1% of the files, than only those 1% of files will be written using copy on write. Now, you wish to "sync" this branch locally? If this the use case?
c

Conor Simmons

01/27/2023, 6:59 PM
Yes, because I may have multiple machines training via local storage. I need a
git pull
equivalent for a whole branch and not individual objects
I want to point to a specific ref or tag when training and have the option to read data from local disk or directly from S3
To satisfy the first option here
o

Or Tzabary

01/27/2023, 7:01 PM
Or directly from S3 is also there :) You can use native Linux rsync/rclone commands if you like, I think the better approach would be to work directly with S3 to be honest
c

Conor Simmons

01/27/2023, 7:01 PM
I just use the term mirror as a
git checkout && git pull
equivalent
o

Or Tzabary

01/27/2023, 7:02 PM
Due to the fact datasets might be very big, I'm not sure that's the common use case as it can take very long time downloading
c

Conor Simmons

01/27/2023, 7:02 PM
The reason it may be undesirable to work directly with S3 is cause network bottleneck is much worse than local read when loading data for training
o

Or Tzabary

01/27/2023, 7:02 PM
If they're small, rsync should do the trick anyway
Network will take place anyway as you'll need to "git pull"/rsync before running your model
👍 1
c

Conor Simmons

01/27/2023, 7:03 PM
Reading from S3 would be intended for larger datasets when the download cost outweighs the slower data loading
Yeah it's a large/small dataset tradeoff. One time download cost versus continuous data loading cost during a training job
Argh, sorry, that's the other way around
But it doesn't really matter, you should just change the source / destination
See here the rsync reference
c

Conor Simmons

01/27/2023, 7:11 PM
Thanks rsync looks good
Can airbyte sync locally too?
Do you have some recommendation between the 2?
o

Or Tzabary

01/27/2023, 7:13 PM
Yes but I think it depends on the data type (CSV an JSON are supported if I remember correctly)
c

Conor Simmons

01/27/2023, 7:14 PM
Ah, I'm working with images 😅
And also JSON
o

Or Tzabary

01/27/2023, 7:14 PM
Not really, it's a matter of preference. If that's the only use case, I'd stick with rsync as it covers everything you need Airbyte has a lot of other integrations
but I think the onboarding experience will take more time and it might not support your use case
c

Conor Simmons

01/27/2023, 7:15 PM
Thank you. I think hopefully my last question:
If they're small, rsync should do the trick anyway
Do you happen to have an idea of what small means here quantitatively?
o

Or Tzabary

01/27/2023, 7:18 PM
I don't have any number to give on this as it's related to the effectiveness of the model and the number of objects download. I suggest to test both methods and pick the one which is faster or more reliable following your needs
c

Conor Simmons

01/27/2023, 7:19 PM
Ok, thank you! I appreciate your help
o

Or Tzabary

01/27/2023, 7:19 PM
Sure :) happy to help! Feel free to reach out with further questions
💯 1
c

Conor Simmons

01/27/2023, 9:24 PM
@Or Tzabary I'm not sure if you can help me with this issue. I was able to use rclone sync between my S3 bucket (actually BackBlaze) and local storage. When I try to set up with lakefs following the docs, it seems to start out well...
rclone lsd lakefs:
yields
-1 2023-01-27 12:36:40        -1 demo
which is my only lakefs repository right now. However, even when trying
rclone ls lakefs:
I get
Failed to ls: RequestError: send request failed
caused by: Get "<https://demo.lakefs.example.com/?delimiter=&encoding-type=url&list-type=2&max-keys=1000&prefix=>": dial tcp: lookup <http://demo.lakefs.example.com|demo.lakefs.example.com> on 127.0.0.53:53: no such host
with similar errors for rclone sync, etc Any idea on this? Could it be related to using BackBlaze?
i

Iddo Avneri

01/27/2023, 9:31 PM
Hi @Conor Simmons, let me try and find an idea from a member of the team. Not sure we have something immediate, but will look into this.
👍 1
c

Conor Simmons

01/27/2023, 9:32 PM
Thank you!
i

Iddo Avneri

01/27/2023, 9:43 PM
You bet
o

Or Tzabary

01/29/2023, 10:04 AM
@Conor Simmons is demo.lakefs.example.com really your lakeFS host? it might indicate of a wrong rclone configuration where you need to set lakefs to your lakeFS endpoint.
c

Conor Simmons

01/29/2023, 6:36 PM
The
demo
part is the name of the repository (you can see in the
ls
command. It seems to add that into the host name. I substituted "example" in the actual endpoint name. The config looks like this
[lakefs]
type = s3
provider = AWS
env_auth = false
access_key_id = xxxxxxxxxxxxxxxxxx
secret_access_key = xxxxxxxxxxxxxxxxx
endpoint = <https://lakefs.example.com>
no_check_bucket = true
o

Or Tzabary

01/29/2023, 6:39 PM
while lakeFS should work with that endpoint URL, it seems that the dns is not configured to work with it following the error code IIRC, you mentioned this is running locally, right?
If so, make sure to add demo.lakefs.example.com to your hosts file pointing to 127.0.0.1 If it's not running locally, you should point that endpoint to your server
c

Conor Simmons

01/29/2023, 6:41 PM
It's running on a server but thanks, I'll share this info
🙏 1
o

Or Tzabary

01/29/2023, 6:42 PM
Let me know how it went
👍 1
m

Matija Teršek

01/30/2023, 5:59 PM
We are running this on a server. We've set up a reverse proxy with nginx using certificates from CloudFlare but only for the subdomain lakefs.example.com so that we can serve over https. Is there a particular reason that lookup is being made to repo_name.lakefs.example.com and is there some other way to handle that instead through subsubdomains?
e

Elad Lachmi

01/30/2023, 6:16 PM
@Matija Teršek Hi Just reading through the thread real quick and I'll try to assist
a

Ariel Shaqed (Scolnicov)

01/30/2023, 6:29 PM
Hi, I would guess that you're using the s3 gateway to access lakeFS using the s3 protocol. This is host-style addressing in the s3 protocol; you'll want to change the client accessing lakeFS to use path-style addressing. We might be able to help with that; which client are you using? Alternatively, of course, if you can configure CloudFlare also to serve the wildcard DNS *.lakeFS.com , using a similar wildcard certificate, you will be able to solve it at the server side.
You can read about the two styles in AWS documentation, for instance this is a good summary. It's one really annoying aspect of the s3 protocol, that makes it hard to configure a server. Amusing, AWS have also been trying to remove host-based addressing for many years now. So far with little success.
e

Elad Lachmi

01/30/2023, 6:39 PM
To add to what @Ariel Shaqed (Scolnicov) correctly pointed out, I believe the specific flag you're looking for in
rclone
is
--s3-force-path-style
a

Ariel Shaqed (Scolnicov)

01/30/2023, 6:43 PM
Also available as a config option in Rclone. I think you may have missed this integration guide for Rclone.
c

Conor Simmons

01/30/2023, 6:45 PM
Just to add a couple of notes: 1. I tried
rclone ls --s3-force-path-style lakefs:
and got the same error 2. Our rclone config is based exactly on that integration guide
a

Ariel Shaqed (Scolnicov)

01/30/2023, 6:48 PM
Ouch! Sorry to bug you, but could you try rclone with the -vv flag and attach all output, please?
c

Conor Simmons

01/30/2023, 6:50 PM
<7>DEBUG : rclone: Version "v1.61.1" starting with parameters ["rclone" "ls" "-vv" "lakefs:"]
<7>DEBUG : rclone: systemd logging support activated
<7>DEBUG : Creating backend with remote "lakefs:"
<7>DEBUG : Using config file from "/home/conor/.config/rclone/rclone.conf"
<7>DEBUG : 3 go routines active
Failed to ls: RequestError: send request failed
caused by: Get "<https://demo.lakefs.example.com/?delimiter=&encoding-type=url&list-type=2&max-keys=1000&prefix=>": x509: certificate is valid for <http://lakefs.example.com|lakefs.example.com>, not <http://demo.lakefs.example.com|demo.lakefs.example.com>
Not much helpful info imo 😅 But note the error is a bit different since @Matija Teršek changed something on the server side
e

Elad Lachmi

01/30/2023, 6:52 PM
I've found this issue on the rclone forums As far as I recall you said you aren't using AWS, right? Maybe this will be relevant
c

Conor Simmons

01/30/2023, 6:53 PM
It's an S3 store using backblaze Here's the config (provider is set to AWS): https://lakefs.slack.com/archives/C02CV7MUV4G/p1675017392529639?thread_ts=1674838530.053799&amp;cid=C02CV7MUV4G
If I follow this guide, I can access the backblaze store fine
Plus, it seems like the config is working somewhat since
rclone lsd lakefs:
yields
-1 2023-01-27 14:42:27        -1 demo
which is our 1 lakefs repo right now
e

Elad Lachmi

01/30/2023, 6:57 PM
Would you mind trying the suggestion in the post I linked to? They suggest using
provider = Other
in place of
provider = AWS
I want to rule that out before we look elsewhere
👍🏼 1
a

Ariel Shaqed (Scolnicov)

01/30/2023, 6:58 PM
Yeah, listing buckets doesn't perform host-style addressing. That's one of my favourite s3 weirdnesses.
e

Elad Lachmi

01/30/2023, 6:59 PM
Also, make sure to add
--s3-force-path-style=true
, that's still needed even with
provider = Other
c

Conor Simmons

01/30/2023, 7:13 PM
@Ariel Shaqed (Scolnicov) @Elad Lachmi it works! Thanks a ton!
🤩 1
e

Elad Lachmi

01/30/2023, 7:15 PM
@Conor Simmons Awesome! Glad we were able to assist If you have any further questions, feel free to reach out Happy lakeFS`ing 😒unglasses_lakefs:
💯 1
:lakefs: 1
c

Conor Simmons

01/30/2023, 7:17 PM
Thanks! Looking forward to using :lakefs: