Hey! I have a couple questions: 1. Does the export...
# help
u
Hey! I have a couple questions: 1. Does the export functionality work for zero-copy repositories? 2. Is there any way to export via a Python API? I was looking at the spark-submit pip package
u
I might also be missing the best way to implement what I'm trying to do. Essentially, I want to do a sort of
git checkout
on an S3 bucket with zero-copy
u
Hey @Conor Simmons 👋 Let me check that for you
u
1. export should work for zero-copy repositories, have you experienced any issue? 2. we do have an Spark example in our export documentation, unfortunately, I don’t think we support the export command using our Python SDK I’m not sure I follow the use-case you mentioned, you’d like to export data out of lakeFS to be managed as plain objects (with zero copy), or you’d like to manage existing S3 bucket within lakeFS with zero-copy.
u
mind sharing more information?
u
or you’d like to manage existing S3 bucket within lakeFS with zero-copy
I think it would be this. Here's an example: commit 1: • add hello_world.txt to s3://my/bucket/path • commit to LakeFS with import from s3 bucket commit 2: • delete hello_world.txt from s3://my/bucket/path • add hello_world_2.txt to s3://my/bucket/path • commit to LakeFS with import from s3 bucket Now, in my s3 bucket, I want to "checkout" commit 1, so I should see hello_world.txt instead of hello_world_2.txt in my s3 bucket
u
thank you for the detailed explanation, one question to clarify, when you add or delete objects, you do this directly against the S3 bucket, right?
u
if that’s the case, it’s not recommended as lakeFS can’t guarantee the files will be reachable in the underlying storage. if you let manage lakeFS manage the files, I suggest not to interact directly with the object store, or manage the risk of files not found in the underlying storage. if you’ll upload/delete the files using lakeFS and let lakeFS manage the underlying storage, once checking out to commit 1, the files will be accessible (or if they weren’t deleted from the underlying object store). does this answer your question?
u
you do this directly against the S3 bucket, right?
Yes, isn't this the only way to do zero-copy?
if you’ll upload/delete the files using lakeFS and let lakeFS manage the underlying storage, once checking out to commit 1, the files will be accessible (or if they weren’t deleted from the underlying object store).
Would this mean not using zero-copy? Or what use case does this look like?
u
If you upload the object anyway to a bucket, this step isn't considered zero copy, so cloud can upload it directly using lakeFS, and every checkout using lakeFS is zero copy
u
If the bucket is not under your ownership and you don't want to pay the storage costs, than yeah, you should import, but than you don't really upload or delete any object
u
I see. I think that's where my confusion has been. We want to store on the bucket and we manage the bucket. I thought upload would be storing on the LakeFS server database
u
What would the
lakectl fs upload
command and/or python API look like for using an s3 bucket?
u
lakeFS has different methods of communication, you can use the API, SDKs or S3 compatible endpoint, it's agnostic of the object store behind the scenes (S3, Google Cloud, Azure or MinIO)
u
See the CLI documentation, you can use the API or SDKs if you like too
u
And if you're interested in importing large data sets, the import is the tool for you.
u
You can read here more on the import process and it's limitations
u
So I think be able to upload data on local storage, and it writes to S3 and uses LakeFS for versioning right?
u
What do you mean local storage?
u
If you launch lakeFS with local storage, than the file system is your "object store". It's mainly for experimental use
u
I mean if I have

file://home/conor/cat.jpg

I can lakectl fs upload that, and it goes though S3? When I tried lakectl upload before I don't think I was able to use an S3 path? Or I was confused on the docs
u
See this guide on how to deploy lakeFS using S3 object store
u
I have it set up already, I can make repos with S3 buckets
u
Yes, if you configured lakeFS with S3 and upload an object from your local system using the fs upload command, it'll be uploaded to S3
u
Ok great I will test this. What about a
git checkout
equivalent? Say I want my local files and S3 bucket to checkout a different commit
u
u
So basically in lakeFS there's no "checkout" process, all branches accessible using the path. If you refer to changing a branch to point to a different commit, that's possible using the branch revert command
u
Where you pass the commit reference
u
I've seen that as well but it seems like that's only intended for large errors in the data
u
What do you mean by large errors in the data?
u
Isn't branch revert intended for rollback, which is described as this in the docs
A rollback operation is used to to fix critical data errors immediately.
u
Revert is used to point a branch to a different commit, which is what you're looking for if I understand correctly. If you'd like to access objects from a different commit without changing the branch reference, you can do so also, just like all branches are accessible, commits/refs or tags are accessible too
u
Changing a branch ref with large data operation is usually executed when there are data issues, this is why it's mentioned in the docs
u
This might be a better way to phrase my potential use case as well, to put it in ML context. The goal is reproducibility. If I train a model on LakeFS commit x, and then have commits y, z, etc., and let's say I want to reproduce when I trained with commit x, what's the best way to do that? It seems like I want this:
If you'd like to access objects from a different commit without changing the branch reference, you can do so also, just like all branches are accessible, commits/refs or tags are accessible too
u
You can point your model to work with a specific ref or tag, you don't need a revert/checkout
u
I think it'll make it clearer
u
Thanks. To do this
You can point your model to work with a specific ref or tag, you don't need a revert/checkout
Is it a lakefs download?
u
Ideally, I want to point to a specific ref or tag when training and have the option to read data from local disk or directly from S3
u
It's relevant to all listing/getting objects commands, the only exception is upload as commits are immutable
u
if you wish to read directly from S3 but leverage lakeFS mapping and versioning capabilities, you can too. That's the S3 gateway :)
u
If you use the API, lakeFS download the object for you (so traffic goes through lakeFS)
u
Ultimately I just want to stream data over internet, but I think other software dependencies would require S3 path
u
For the downloading to local storage use case, would that be python
ObjectsAPI
? Can I download recursively from a folder in lakefs?
u
Note that you can also use the objects API presigned URLs in order to generate S3 URLs. You can also use the objects API to list objects under a specific path and eventually get these objects
u
Once you list the objects you can iterate over them and download
u
Ok let's say I want to do mirror an old commit on local storage. If 99% of the objects are unchanged, would I still need to download 100% or just 1%?
u
Please explain what do you mean by doing a mirror on local storage, up until now we discussed on lakeFS running against S3
u
Creating a branch, regardless of the source branch, is a zero copy operation, so no files will be copied If you change 1% of the files, than only those 1% of files will be written using copy on write. Now, you wish to "sync" this branch locally? If this the use case?
u
Yes, because I may have multiple machines training via local storage. I need a
git pull
equivalent for a whole branch and not individual objects
u
I want to point to a specific ref or tag when training and have the option to read data from local disk or directly from S3
To satisfy the first option here
u
Or directly from S3 is also there :) You can use native Linux rsync/rclone commands if you like, I think the better approach would be to work directly with S3 to be honest
u
I just use the term mirror as a
git checkout && git pull
equivalent
u
Due to the fact datasets might be very big, I'm not sure that's the common use case as it can take very long time downloading
u
The reason it may be undesirable to work directly with S3 is cause network bottleneck is much worse than local read when loading data for training
u
If they're small, rsync should do the trick anyway
u
Network will take place anyway as you'll need to "git pull"/rsync before running your model
u
Reading from S3 would be intended for larger datasets when the download cost outweighs the slower data loading
u
Yeah it's a large/small dataset tradeoff. One time download cost versus continuous data loading cost during a training job
u
Argh, sorry, that's the other way around
u
But it doesn't really matter, you should just change the source / destination
u
See here the rsync reference
u
Thanks rsync looks good
u
Can airbyte sync locally too?
u
Do you have some recommendation between the 2?
u
Yes but I think it depends on the data type (CSV an JSON are supported if I remember correctly)
u
Ah, I'm working with images 😅
u
And also JSON
u
Not really, it's a matter of preference. If that's the only use case, I'd stick with rsync as it covers everything you need Airbyte has a lot of other integrations
u
but I think the onboarding experience will take more time and it might not support your use case
u
Thank you. I think hopefully my last question:
If they're small, rsync should do the trick anyway
Do you happen to have an idea of what small means here quantitatively?
u
I don't have any number to give on this as it's related to the effectiveness of the model and the number of objects download. I suggest to test both methods and pick the one which is faster or more reliable following your needs
u
Ok, thank you! I appreciate your help
u
Sure :) happy to help! Feel free to reach out with further questions
u
@Or Tzabary I'm not sure if you can help me with this issue. I was able to use rclone sync between my S3 bucket (actually BackBlaze) and local storage. When I try to set up with lakefs following the docs, it seems to start out well...
Copy code
rclone lsd lakefs:
yields
Copy code
-1 2023-01-27 12:36:40        -1 demo
which is my only lakefs repository right now. However, even when trying
Copy code
rclone ls lakefs:
I get
Copy code
Failed to ls: RequestError: send request failed
caused by: Get "<https://demo.lakefs.example.com/?delimiter=&encoding-type=url&list-type=2&max-keys=1000&prefix=>": dial tcp: lookup <http://demo.lakefs.example.com|demo.lakefs.example.com> on 127.0.0.53:53: no such host
with similar errors for rclone sync, etc Any idea on this? Could it be related to using BackBlaze?
u
Hi @Conor Simmons, let me try and find an idea from a member of the team. Not sure we have something immediate, but will look into this.
u
Thank you!
u
You bet
u
@Conor Simmons is demo.lakefs.example.com really your lakeFS host? it might indicate of a wrong rclone configuration where you need to set lakefs to your lakeFS endpoint.
u
The
demo
part is the name of the repository (you can see in the
ls
command. It seems to add that into the host name. I substituted "example" in the actual endpoint name. The config looks like this
Copy code
[lakefs]
type = s3
provider = AWS
env_auth = false
access_key_id = xxxxxxxxxxxxxxxxxx
secret_access_key = xxxxxxxxxxxxxxxxx
endpoint = <https://lakefs.example.com>
no_check_bucket = true
u
while lakeFS should work with that endpoint URL, it seems that the dns is not configured to work with it following the error code IIRC, you mentioned this is running locally, right?
u
If so, make sure to add demo.lakefs.example.com to your hosts file pointing to 127.0.0.1 If it's not running locally, you should point that endpoint to your server
u
It's running on a server but thanks, I'll share this info
u
Let me know how it went
u
We are running this on a server. We've set up a reverse proxy with nginx using certificates from CloudFlare but only for the subdomain lakefs.example.com so that we can serve over https. Is there a particular reason that lookup is being made to repo_name.lakefs.example.com and is there some other way to handle that instead through subsubdomains?
u
@Matija Teršek Hi Just reading through the thread real quick and I'll try to assist
u
Hi, I would guess that you're using the s3 gateway to access lakeFS using the s3 protocol. This is host-style addressing in the s3 protocol; you'll want to change the client accessing lakeFS to use path-style addressing. We might be able to help with that; which client are you using? Alternatively, of course, if you can configure CloudFlare also to serve the wildcard DNS *.lakeFS.com , using a similar wildcard certificate, you will be able to solve it at the server side.
u
You can read about the two styles in AWS documentation, for instance this is a good summary. It's one really annoying aspect of the s3 protocol, that makes it hard to configure a server. Amusing, AWS have also been trying to remove host-based addressing for many years now. So far with little success.
u
To add to what @Ariel Shaqed (Scolnicov) correctly pointed out, I believe the specific flag you're looking for in
rclone
is
--s3-force-path-style
u
Also available as a config option in Rclone. I think you may have missed this integration guide for Rclone.
u
Just to add a couple of notes: 1. I tried
rclone ls --s3-force-path-style lakefs:
and got the same error 2. Our rclone config is based exactly on that integration guide
u
Ouch! Sorry to bug you, but could you try rclone with the -vv flag and attach all output, please?
u
Copy code
<7>DEBUG : rclone: Version "v1.61.1" starting with parameters ["rclone" "ls" "-vv" "lakefs:"]
<7>DEBUG : rclone: systemd logging support activated
<7>DEBUG : Creating backend with remote "lakefs:"
<7>DEBUG : Using config file from "/home/conor/.config/rclone/rclone.conf"
<7>DEBUG : 3 go routines active
Failed to ls: RequestError: send request failed
caused by: Get "<https://demo.lakefs.example.com/?delimiter=&encoding-type=url&list-type=2&max-keys=1000&prefix=>": x509: certificate is valid for <http://lakefs.example.com|lakefs.example.com>, not <http://demo.lakefs.example.com|demo.lakefs.example.com>
Not much helpful info imo 😅 But note the error is a bit different since @Matija Teršek changed something on the server side
u
I've found this issue on the rclone forums As far as I recall you said you aren't using AWS, right? Maybe this will be relevant
u
It's an S3 store using backblaze Here's the config (provider is set to AWS): https://lakefs.slack.com/archives/C02CV7MUV4G/p1675017392529639?thread_ts=1674838530.053799&amp;cid=C02CV7MUV4G
u
If I follow this guide, I can access the backblaze store fine
u
Plus, it seems like the config is working somewhat since
rclone lsd lakefs:
yields
Copy code
-1 2023-01-27 14:42:27        -1 demo
which is our 1 lakefs repo right now
u
Would you mind trying the suggestion in the post I linked to? They suggest using
provider = Other
in place of
provider = AWS
I want to rule that out before we look elsewhere
u
Yeah, listing buckets doesn't perform host-style addressing. That's one of my favourite s3 weirdnesses.
u
Also, make sure to add
--s3-force-path-style=true
, that's still needed even with
provider = Other
u
@Ariel Shaqed (Scolnicov) @Elad Lachmi it works! Thanks a ton!
u
@Conor Simmons Awesome! Glad we were able to assist If you have any further questions, feel free to reach out Happy lakeFS`ing sunglasses lakefs
u
Thanks! Looking forward to using lakefs