can anyone tell me what is in the _lakefs director...
# dev
t
can anyone tell me what is in the _lakefs directory with a repo? I am guessing it's rocksdb related?
o
that directory contains commit metadata. information about which objects are contained in which commit. see https://docs.lakefs.io/understand/versioning-internals.html#constructing-a-consistent-view-of-the-keyspace-ie-a-commit - there's a storage layout example there that illustrates that
t
@Oz Katz you beat me by 30 seconds. Just came accross that article. Thanks! having a read
o
👍😊
t
So far I am really liking lakefs. I wonder why not write a plugin for the git cli ? maybe that exists and I havent read enough yet.
o
there's actually an open issue on git integration- would love for you to join the discussion https://github.com/treeverse/lakeFS/issues/2073
t
awesome! will have a look!
@Oz Katz the run options say aws or azure. will the gc job not run on self hosted spark clusters?
i
The GC job is capable of running on hosted spark clusters too. It needs the aws/azure options since some functionalities communicate directly with the underlying object store. E.g. batch deletes of objects.
t
@Itai Admi i am trying to understand the implications. If I understand you correctly, the azure/aws storage backend implement CRUD operations used by the spark job that GCP does not?
i
Are you asking why garbage collection for GCP is currently not supported? Because we have plans to support that. Some of the GC work can be shared and some is specific to the storage. That's the reason we have partial support at the moment. If you're interested in GC for GCP feel free to elaborate on your use case here or on the issue so we'll know to prioritize it better.
t
Thanks @Itai Admi. I am asking why (but from a technical like what functionality is missing or incompatible perspective not like a philisophical or prioritizational perspective)
will have a look at that issue
👍 1
a
Hi @taylor schneider! I'd like to try to answer this. IIUC, you're asking why the GC job directly depends on the object store and connects to it directly, rather than through Hadoop. (I really hope that is your question, because the following is a long rant against Hadoop FileSystems, which will be boring if it's not what you wanted to hear... 😊) Ironically, it is really hard to use a Hadoop FileSystem such as S3A (the other object stores are pretty much the same) to perform operations on the underlying object store! GC needs to delete files en masse. But the Hadoop FileSystem contract forces them to be pretend that there is a directory structure. They can do this by adding an empty "directory marker" object. Delete deletes the object, then looks to see if there are other objects with the same directory marker prefix - if there are none then it attempts to delete that directory marker (and go up the fake "tree"). What that means is that every delete file operation in S3A is at least a DELETE object operation on S3 followed by a HEAD object operation on S3! For a job that aims to delete a huge number of files, this will be 2x slower and about as wasteful. What makes it particularly galling on S3 is that there is a perfectly good "bulk delete" operation lets us delete 1_000 files in a single S3 api call. So S3A uses around 2_000 api calls where it could use just one.
t
@Ariel Shaqed (Scolnicov) thanks for the explanation above. I am starting to understand the issues but have a few more clarifying questions: If the user requests a lakefs:// path through spark, i was thinking the process was that the lakefs server has metadata and translates the path into underlying S3 paths. Is this accurate?
*and the client speaks to the underlying s3
e
That is correct.
The metadata resides in S3 as well, and lakeFS server is responsible for retrieving the mapping and returning it to the client.
a
Yes! This is exactly how lakeFSFS (my nickname for The LakeFSFileSystem) reads and writes.
❤️ 1
The difference is that reading and writing are data operations of the underlying filesystem, therefore perfectly supported for us by S3A. But object store delete is a metadata operation, and S3A faffs around a lot more than it should. So we can't use S3A to delete for us.
t
@Ariel Shaqed (Scolnicov), taking ceph as an example, wouldnt it be possible to mount ceph as a local directory (in the same place on the client and lakefs server) using ceph fuse drivers, and then have the data operations take place through posix rather than s3? (possible in theory, i know the code base is not there).
a
Oh, absolutely! A network file system with POSIX semantics has a much superior api. But as you rightly point out, the code isn't there, certainly not for S3. And S3 is the most popular object store which we see. I believe that there is a sound reason why. Object stores intentionally limit their api. They are atomic only on single paths. This allows sharding the path space between servers, giving the kind of coarse-grained parallelism that SREs dream about. Everything follows from that decision. There is no "rename" or "move" operation, because it would be atomic in 2 places to be useful (you can copy-and-delete, which can be atomic in each file separately, but there's no point in the api replacing these 2 calls with 1). There is no "directory", because all the interesting operations (rename directory, delete directory) cannot be atomic. Now Ceph and Hadoop HDFS can do these things. (To a degree; I'd have to check up carefully on the entire chain including FUSE to see they do indeed guarantee atomicity!) The tradeoff is between cost/performance and scale. And lakeFS targets object store architectures more than it does nfs-type architectures. (Don't tell my bosses, but if you want to version data on nfs I'd recommend Git or even ClearCase...)
tl;dr: what you say would work! But it would have high Price/Performance at even moderate scale.
t
@Ariel Shaqed (Scolnicov) thanks for that detailed explanation. So If I understand you we are saying "The S3 api impliments an atomic CRUD API while POSIX does not. While Ceph may provide atomic operations, the NFS (posix) api does not. Supporting Ceph would be a hack job"
Also, I haven't heard of clearcase ill have a look. But i'd rather host minio and run a local s3 than use git for data
(speaking from experience hahaha)
a
Yeah, that's about it. NFS actually kinda provides atomic operations. It's just that the api is too good: it gives atomic operations on directories (mv). That makes everything possible and even convenient for programmers. It's just incredibly expensive to scale. Thats why most datalakes use an object store rather than a proper filesystem. And we follow suit. (Azure gen2 has some interesting semantics, and I need to read up on it more carefully. But we see so much data on S3...) If you can operate Ceph up to the sizes you need, I can certainly imagine Git/LFS/Ceph being a great combo for users. I mean, if anyone could, I'd expect Microsoft to be able to host a really big Git/LFS for Data on top of Azure gen2 - they have expertise on all 3 parts.
... I'd rather host minio and run a local s3...
I guess this is where I say "welcome to lakeFS!" sunglasses lakefs
🙌 1