Hey! So in using LakeFS I'd want to be able to: -...
# help
c
Hey! So in using LakeFS I'd want to be able to: • Take the table of files and query the raw file paths for a specific version. Then, left join the actual content to these paths within Spark. Is it possible to do this without transacting the binary file content through LakeFS? I'm concerned about the potential IO bottleneck with Go. Ideally, we'd get the file metadata from the PG database, return the file paths for a specific version, and retrieve those files from S3. • Query files filtered by the name of each dataset, which is indicated by a directory, and the latest commit hash of that folder. • Delete the KV stores and postgres and rebuild everything in LakeFS from scratch using only the raw data. • Soft delete files (which I assume is implicit in your versioning strategy), and if required hard delete those files too. Does all this sound do-able?
e
It is. You need to use a lakefs client. For Spark lakefsfs is most suitable. See the documentation here: https://docs.lakefs.io/v0.52/reference/spark-client.html
You can use lakeFS over delta tables. Postgress is used as a ket value store of sone of the metadata lakeFS creates, for performance reasons. lakeFS is built on top of metadata saved in S3 using SSTables. You can read more about it here: https://docs.lakefs.io/understand/how/versioning-internals.html
c
Yes so if the Postgres instance gets deleted or the KV store gets deleted does all the S3 raw data versioning history become corrupt? Or can LakeFS rebuild the PG/SSTables from the folder structure or raw S3 data.
e
No, since their content is persisted to S3 in a production grade environment
c
so to be clear 'production grade' means here: if i install LakeFS, add a bunch of data in s3, delete your postgres instance and flush all the dynamodb tables, LakeFS can look at the S3 buckets and re-build all of that lost information so it can be queried?
whats the impact on the system there in the event of that data loss, basically.
FWIW the system we're running right now can do this so it builds in this idea of self-healing, we're considering if lakefs is viable as an alternative
e
That depends on many things. If you are asking if lakeFS is an additional availability layer to the data, the answer is yes. If you are asking if it can be highly durable the answer is also yes. S3 also fails in some probability, right? There are thousands of lakeFS installations out there, and we never had a data loss incident reported in any of the community channels, or the commercial ones.
c
Yes not criticising LakeFS, more worried about user level issues.
So the short answer is no, there is something that gets lost in this case or?
e
I believe you can configure it to answer the needs of your users
c
Well. Part of the reason we use Delta Lake and Iceberg is so the system isn't entirely online. You can't 'hack' into a Delta Lake in the same way you can break into an always available Postgres cluster or instance.
Doesn't have to be our users, can be any bad actor.
The self-healing is important because we soon could* have several hundred million files, closing in on the billion. Deltas scale quite well for this because Spark can just map reduce over its partitions, so also a bit worried about having to manage TB scale PG instances. Being able to guarantee integrity is important.
A backfill would suck, but its better than not being able to rebuild our trees essentially.
e
Most metadata is saved directly to S3. The small amount saved to kv store scales with the braches you create and not the #of objects you manage. If availability is important, using dynamodb with aws garentees on availability is advised.
gratitude thank you 1
c
Yeah. I presumed this should be supported because it should be the same operation as migrating off LakeFS, losing all the postgres and dynamodb data, and then migrating back onto it.
šŸ‘ 1
Right now our 'zero-copy' model is to just create new versions of datasets which are just pointers to older datasets. Those pointers get saved in S3 as the datasets. One of my worries is because the LakeFS model is zero copy, if you migrate away from lakeFS by just rcloning the raw data, you lose versioning information for files because that versioning information might not be represented in S3 but on Postgres or in the SSTables. Essentially, how LakeFS represents 'a file was deleted' on the filesystem is a key concern. For instance, consider a scenario where a file
file1.txt
exists in version 1, gets updated in version 2, and then deleted in version 3. In S3, you might have the data from version 1 and version 2, but without the versioning information, there's no way to know that
file1.txt
was deleted in version 3. This lack of context would make it impossible to accurately represent the state of a dataset after migration.
So our current solution to that is to literally store metadata in parquet files in S3 that points to file changes. The union of updates can be used to build a merkle tree of a datasets lineage. These live with the dataset themselves, but it's a fairly complex system we might be better off buying instead, because it seems like LakeFS is a bit more flexible and closer to the longer term needs.
e
It's very cool that you have built this internal system. lakeFS is built lightly differently so it can scale better and have a very small impact on performance. I think you're in the right place :-)
c
Yea. I mean, its a bit more robust than this because the S3 files are the file changes and can be re-processed from scratch. Each dataset has this idea of a delete table and so you can just take the union of the files a dataset has and the files that you have deleted to filter out soft deleted files. For the edge case where a file is added again its just a matter of looking at the id of a file (which is k-sortable) and seeing if the file id at that path is more recent.
this is all very close to something that lakefs seems to be doing. but being able to update and re-version datasets is where this gets trickier. because the backbone of being able to stage files to commit into a dataset isn't quite there.
@einat.orr and thank you!
jumping lakefs 1
e
But it is. I'm really not sure what you feel is missing. Maybe after you get to know lakeFS a little better, you see you have that, just not with the implementation you are thinking about due to the way your internal system work.
c
Well that's what I'm hoping !
šŸ™Œ 1
heart lakefs 1
a
Small correction: @einat.orr linked to docs for some older version of the lakeFS Spark Metadata client. Please use this documentation instead.
šŸ™ 1
c
I created a repository with my data in, deleted the repository, when i look in MinIO the files are garbled up to the LakeFS format, and I can't create a repository from the old one. Is there a way to recover my data from this state?
a
I am sorry to hear that you deleted your repo by mistkae. The objects you committed may be possible to recover, but that will be difficult. The uncommitted objects will probably be lost, (Well, each of them is of course readable individually, but you just deleted the mapping of paths to these objects, so all paths are lost.)
I once wrote a blog post that briefly explains these internals. Please note that this is relevant if you're curious, or if you want to develop lakeFS itself. You do will never need to know any of these internal implementation details in order to use lakeFS.
c
No I deleted it on purpose to test vendor lockin šŸ˜›
will give the post a read
a
You're right that lakeFS clobbers pathnames on the object store, for valid technical reasons. We offer multiple ways to export data from lakeFS; that blog post above links to several IIRC. Any of these is of course a great way if you need to leave lakeFS. Obviously OSS is not a good way to lock in our users šŸ˜€. What if you want to give us money? In lakeFS Enterprise you'd also be holding the KV on your own resources, obviously it couldn't be yanked. With lakeFS Cloud and a seriously confrontational operator you'd be at some low risk of losing uncommitted data. Here's how you could control the risk. Your right, the only thing I can take away from you on cloud is the KV - you actually control the object store (S3). As detailed in that box, it holds uncommitted data and the database of commit records - a small value holding the commit metadata (message, author, date) and the metarange digest. The contents of the metarange itself are under your control, in the object store. So I can cover your uncommitted data, but as soon as you commit I really control very little - just the connection from the commit to its metarange id. If you're really worried about us locking you in with the KV store, you could simply record the commit record into storage that you control whenever you commit. If I had to do that, I'd probably write a commit hook to do this, or just periodically copy commit records out. Hopefully this is just a thought experiment...