I’m a new user of LakeFS and a friend told me abou...
# help
u
I’m a new user of LakeFS and a friend told me about it and that led to me trying it out and deploying it. I’ve been trying to deploy it on AWS following docs and have been running into few blockages, specifically setting up S3 blockstore with the LakeFS server. My setup is a LakeFS server deployed on an EC2 instance, which talks to a Postgres RDS instance and currently i’m trying to connect that with an existing S3 bucket. My question is how exactly does a LakeFS server connect with an existing S3 bucket? An ideal scenario for me would be to create a minimum access policy based instance profile and attach that to the EC2 instance so that it can talk to my s3 storage bucket? i’ve looked at the architecture documentations as well as the

diagrams

and they don’t really explain the mentioned scenario.
u
Also, can someone please explain the difference between

data files stored in unmanaged bucket and LakeFS managed bucket in this diagram

.
u
Hi @ali, welcome to lakeFS! lakefs You might want to take a look at this document, which explains how to deploy lakeFS in an AWS environment and specifically on EC2. You are right that you do not have to pass the credentials to lakeFS if you configure the correct AWS profile on your EC2 instance. Regarding the diagram: data which is stored in lakeFS managed buckets is versioned data that was uploaded to the bucket via lakeFS. This are buckets and prefixes used to create the lakeFS repositories and the data in them is managed entirely by lakeFS and reading and writing to them should be done via lakeFS (trying to read directly from these prefixes will not make a lot of sense and writing directly to objects there will potentially cause data loss). Un-managed buckets are simply buckets (and prefixes) that lakeFS was not configured on (i.e. a lakeFS repository was not created on them). One exception to this rule are objects (and prefixes) in unmanaged buckets that were imported into lakeFS via our import process (which is the single green line you see going from the metadata to the unmanaged buckets block). Since the import process is a zero copy operation, we only keep "pointers" in lakeFS to the imported data. Therefore it is important not to modify these objects (or prefixes) directly after importing them and use lakeFS for any modification operation. Please let us know if you have any further questions
u
Hello @Niro thank you so much for shedding light on this and for sharing the link to deploy lakeFS on AWS. I did take a look at that and what wasn’t clear for me was that is a S3 bucket a requirement for deployment of lakeFS server on AWS? Because the document states preparation of S3 bucket after the creation of lakeFS server. And is this the same S3 bucket as the one from the previous diagram?
u
When deploying in AWS you'd want to use the native underlying blockstore which will be an AWS S3 bucket. You could use Azure Blob Storage for example if you really wanted to but that won't really make any sense. You don't need to create a new bucket for lakeFS, you can use an existing one - the important thing is to make sure the IAM role used by lakeFS has the required permissions for the bucket as described in the documentation
u
Got it, thanks! Another thing is regarding the policy i need to define for the lakeFS server. I’d want a new S3 bucket for every new repository that i’d create in the lakefs server. Would it be that every time i add a new repository, i would need to update the policy to add the new bucket ARN to it, so that LakeFS server can talk to it, or is there something that i’m missing here?
u
You could optionally provide a wildcard for the S3 resource in the policy
Copy code
"Resource": ["arn:aws:s3:::*"],
I think this would work for your setup, however I would highly recommend to avoid it as it will give lakeFS access to all your buckets (even the ones you don't intend to use with lakeFS. Instead I would suggest using a single bucket for lakefs and use prefixes for the repositories. For example if your bucket is named
bucket1
, when creating repository named
repo1
I would use the repo namespace
<s3://bucket1/lakefs-repos/repo1/>
or
<s3://bucket1/repo1>
u
perfect, thats what i was looking for. Thank you for the explanation.
u
You're very welcome! Hope you enjoy your lakeFS journey