https://lakefs.io/ logo
Title
a

Aviator

03/14/2023, 8:47 PM
Having read previous posts, I still feel I need clarification in a particular use case: constant data change in s3 bucket I have data uploaded to my s3 bucket, and I used lakeFS to import this data into my repository. Connected to Pyspark, I performed some data transformation on this data which I finally commit back to my repository. Now there is a change in data uploaded to my s3 bucket, how do I read this new data and compare it with the historical data already committed to my repository. Will lakeFS take note of this data change that took place in my s3 bucket ?
b

Barak Amar

03/14/2023, 8:49 PM
The data import will have your current information reference from lakeFS. From this point working using lakeFS will not change any data on your bucket and you can read/write/commit changes over. The data will be stored under the configured bucket but not modify the original imported data.
If you modify the original data was support to be immutable, as lakeFS will keep the original metadata while reference the original data. It doesn't expect any modification to the data.
The import help accessing existing data and use-case where adding new data to existing bucket should be referenced and used though lakeFS.
It will benefit from access the same data without a copy and enabling lakeFS capabilities - like commit, diff, branch and etc.
The constant data change in s3 bucket using import supports changes - while you continue importing the same dataset. It will identify changes - add/delete, and reflect this information for each import.
a

Aviator

03/14/2023, 8:56 PM
Ok, so basically, I should be doing import regularly when there is a change in my storage bucket
b

Barak Amar

03/14/2023, 8:57 PM
true - but keep it mind that import reference existing data and will not copy the data.
a

Aviator

03/14/2023, 9:00 PM
Ok, I get it now Thanks for clarifying
:lakefs: 1