Hi All I am new to lakefs, enjoying it so far. I a...
# help
n
Hi All I am new to lakefs, enjoying it so far. I am evaluating it out on DL dataset that will grow overtime in s3(~200TB). I created the repo on lakefs with s3 as storage backend. If the underlying data changes, how does lakefs deal with it, The reason is we plan to add data to s3 directly from a pipeline process. The pipeline may delete data or change contents of files or add new files from time to time. whats the best way to deal with my use case in lakefs? I see using rclone to sync lakefs repo and s3 bucket in documentation, if thats the preferred way, does it scale for ~1Billion objects every time theres a change? Thanks in Advance!!
l
Hi @Narendra Nath and welcome! With lakeFS the best practice is to have lakeFS as part of your pipeline, and not make changes directly to your bucket. Can you please tell me more about your use case and your pipeline?
n
Hi Lynn we upload data to s3 from external systems(on prem computers/external annotation systems like scale/ sagemaker ground truth..etc.). The pipeline does some data quality checks before the upload process from on prem systems (I can potentially use lakefs to upload here).But the annotation systems upload data directly to s3 we do not control the upload.
i
Hey @Narendra Nath, one option would be to import the data to lakeFS. It doesn’t copy the data into lakeFS, it just point to it. So it scales much better than having to copy everything.
Since the data isn’t managed by lakeFS, lakeFS won’t delete it. If a user is trying to retrieve an object that was deleted from it’s origin, she’ll receive a
409 Gone
response. If the object was changed in the origin, it will cause failures during read time since lakeFS metadata contains different content-length. The best practice is to re-import (cheap) when the origin data was changed.
Newly added files to the origin will only be available after an import, e.g. imports to lakeFS are not continuous.
o
One thing to keep in mind about importing though: since lakefs has no control over the files being imported to it, if a file gets overwritten in its original location, lakeFS has no way to magically return the older version 🙂 @Narendra Nath could you share a bit more info about the annotation system? if it can write to s3, perhaps it could write to the lakefs s3 compatible endpoint?
h
Can you tell your "annotation systems upload" to use the lakefs S3 endpoint once you import the data into lakefs ? You may not be able control how, when, etc, of the annotation system upload, but you must surely be able to change the s3 endpoint/bucket right ?
👀 1
n
that is correct we currently allow the annotation system (3rd party) to upload to s3 by assigning an IAM role cross account permissions. How do I authenticate if I use lakefs instead?
l
Let me check that and I'll get back to you tomorrow 🙂
h
LakeFs uses access and secret key pair, which I am pretty sure that what your annotation system is using (to get the IAM role). So from the annotation system point of view, it's just a matter of using a different pair of access and secret key. The only thing that may be tricky is that you need to use lakeFS URL endpoint rather than the default Amazon S3 endpoint URL. Depends on how the tool your Annotation system use to connect to S3, it may be easy (eg: in rclone it's super simple) or harder (eg: aws cli: https://github.com/aws/aws-cli/issues/1270). Canveat: I am just been using LakeFS recently and I am not part of LakeFS team. So some stuffs may be wrong? Please correct me.
n
Hi HT we do not provide access and secret key for annotation vendors we just whitelist a role from their account in our bucket policy. If access key and secret access key work then I think I might be able to use that depending on vendor systems. Thank you!
h
right ... the Vendor have AWS account, and you just do a bucket policy to allow that Role in that Account to access to your bucket (aka cross account access). Gotcha.
a
@HT is correct about how to access lakeFS from an S3 application such as your annotation server. I'd like to lay out the 2 issues explicitly: • Endpoint URL. You have to be able to tell the annotation server to access lakeFS rather than AWS S3. This is usually called an "endpoint" or an "endpoint URL"; as pointed out above,, it can be tricky to configure on the AWS CLI -- but it is possible. • Authentication. You also have to tell the annotation server how to authenticate itself to lakeFS. lakeFS supports access key authentication using secret keys. This is usually easier to configure. The tricky point here is that you will not be able to authenticate using an AWS IAM role: there is no way for an application running with some AWS role to prove outside of AWS that it has that role. So you will need to provide the annotation server with an identity on lakeFS. lakeFS pretty much shares these issues with any other object store that provides an S3 interface. One way that often works is to see how to configure an application to work with MinIO -- and then do the same for lakeFS. Good luck!
n
thanks for all the help I will try it out.