I have a third-party CSV data that gets a new vers...
# help
g
I have a third-party CSV data that gets a new version published every month. This data needs to be put in database and lineage needs to be maintained. Wondering what is the best way to achieve this. Is this something Lakefs can do?
h
If it's a single CSV then probably not worth Lakefs If it's a set of CSV, then worth it. Monthly, you can pull those CSVs, upload to Lakefs. Do a commit and potentially add some metadata to your commit and/or tag the commit
sunglasses lakefs 1
then you have a magic time machine that allow you to retrieve those CSVs as different point in time corresponding to your commits
Note: if it's just a couple of CSV file, you can even use git ...
👍 1
g
Thank you @HT . How does the LakeFS to Database part work in such scenario. The CSV are set of files that are updated monthly. Some records change and some may not change, and the database need to have complete data of all records (even the historical values). Is there any way to automatically ingest the new changed records into the database with the new commit ID or should that be done manually? What is the usual practice (for LakeFS -> DB ingestion)?
h
Sorry, but I don't really understand your use case here ... Each database ingestion need to access to all version of CSVs ? With lakefs S3 interface, you can do something like this:
<s3://a_repo_name/><commit1_hash>/path/my_file.csv
: return the
my_file.csv
at the time it was commited at commit1
<s3://a_repo_name/><commit2_hash>/path/my_file.csv
: return the
my_file.csv
at the time it was commited at commit2 ... So at each ingestion, you can list all existing commit, and tell your database to fetch all versions of your files. Is this what you want ??
g
My question was more about loading from LakeFS into Database (e.g. postgres). The Postgres should have all records from all CSVs across all commits (as a complete source of truth for the data). From the LakeFS how to achieve it? When a new commit is done, is there any automated way to injest the fresh commit into the Database, or should it be handled manually (through code or S3 links etc.)?
h
The ingestion part is done Manually via your custom code. Lakefs have webhook that can trigger your code after each commit.
g
Ok. Thank you.