I have a third party CSV data that gets a new version publis lakeFS #help

I have a third-party CSV data that gets a new vers...

GK Palem

10/30/2023, 4:08 AM

I have a third-party CSV data that gets a new version published every month. This data needs to be put in database and lineage needs to be maintained. Wondering what is the best way to achieve this. Is this something Lakefs can do?

10/30/2023, 4:14 AM

If it's a single CSV then probably not worth Lakefs If it's a set of CSV, then worth it. Monthly, you can pull those CSVs, upload to Lakefs. Do a commit and potentially add some metadata to your commit and/or tag the commit

sunglasses lakefs 1

10/30/2023, 4:16 AM

then you have a magic time machine that allow you to retrieve those CSVs as different point in time corresponding to your commits

10/30/2023, 4:18 AM

Note: if it's just a couple of CSV file, you can even use git ...

👍 1

GK Palem

10/30/2023, 6:56 AM

Thank you @HT . How does the LakeFS to Database part work in such scenario. The CSV are set of files that are updated monthly. Some records change and some may not change, and the database need to have complete data of all records (even the historical values). Is there any way to automatically ingest the new changed records into the database with the new commit ID or should that be done manually? What is the usual practice (for LakeFS -> DB ingestion)?

10/30/2023, 7:12 AM

Sorry, but I don't really understand your use case here ... Each database ingestion need to access to all version of CSVs ? With lakefs S3 interface, you can do something like this:

<s3://a_repo_name/><commit1_hash>/path/my_file.csv

: return the

my_file.csv

at the time it was commited at commit1

<s3://a_repo_name/><commit2_hash>/path/my_file.csv

: return the

my_file.csv

at the time it was commited at commit2 ... So at each ingestion, you can list all existing commit, and tell your database to fetch all versions of your files. Is this what you want ??

GK Palem

10/30/2023, 8:52 AM

My question was more about loading from LakeFS into Database (e.g. postgres). The Postgres should have all records from all CSVs across all commits (as a complete source of truth for the data). From the LakeFS how to achieve it? When a new commit is done, is there any automated way to injest the fresh commit into the Database, or should it be handled manually (through code or S3 links etc.)?

10/30/2023, 8:57 AM

The ingestion part is done Manually via your custom code. Lakefs have webhook that can trigger your code after each commit.

GK Palem

10/31/2023, 2:17 PM

Ok. Thank you.

5 Views

Open in Slack

Previous Next