Hey! I am trying to understand LakeFS. Can an offi...
# help
c
Hey! I am trying to understand LakeFS. Can an official get in touch with me so I can discuss my use cases? I've read through the documentation but coming from a traditional SWE background there are still plenty of bits unclear to me. Currently, we have all our data organized neatly in S3. We use Databricks Autoloader to scan and categorize that data into Delta Lake tables. We have several ETL post-processors that iterate through these datasets (also using Autoloader). This is all custom code. When new datasets arrive, they are added to our files and dataset tables, and we have several ETL jobs for metadata tables that run in tandem over all new files. This append-only mechanism works well for our use case. We want to introduce dataset versioning, which might involve replacing the current crawling service with something like LakeFS, with outputs to Delta Lake. I have some concerns about ingesting everything into LakeFS and storing it in S3. • Does LakeFS have a way to inspect the diff of files ingested via a checkpoint? Could I point Autoloader at your Delta Lake tables and ingest from there? • Does LakeFS run on a Spark server/instance? ◦ Would we need to keep that instance up all the time, or can LakeFS ingestion be done as a job? How is this triggered? We have petabytes of files, literally billions. Autoloader is great because Spark clusters can parallelize ingestion, so backfilling doesn't take long. ◦ Alternatively, if LakeFS ingestion is done by the same machine as the uploader, does this result in parallel writes to a Delta Lake table as you have documented? Or is the idea to create a branch from the 'main' branch for each dataset and then ingest all of the mutually exclusive datasets (around 10,000) into the main branch? In this case, without a centralised server, I don't really see how we can avoid multiple-writers unless we have a merge-queue for main branch ingestion. • Does LakeFS retain metadata, such as which datasets the files are from? It doesn't seem like the way it stores data in S3 will be human-readable, so how do we know which files are part of which datasets (currently organized by external relational Delta Lake tables in S3)? Thank you for reading! Looking forward to your response.
i
Hi @Callum Dempsey Leach - welcome to the lake
Let’s get on a quick call and answer your questions?