Hi is there a way to perform zero copy of objects within Lak lakeFS #help

Hi, is there a way to perform zero-copy of objects...

Florentino Sainz

11/02/2023, 2:16 PM

Hi, is there a way to perform zero-copy of objects within LakeFS? (i.e. is copy_object api zero-copy/metadata-copy? or it makes an actual copy).

Iddo Avneri

11/02/2023, 2:28 PM

@Florentino Sainz,do you mind sharing the use case? For example, do you want to copy a subset of files from one repository to another?

Florentino Sainz

11/02/2023, 2:34 PM

hi Iddo, its just copying within the same repository and branch. We want to keep something like ingested_data/latest/most_likely_csv.csv ingested_data/archived/date=2023-01-01/most_likely_csv.csv ingested_data/archived/date=2023-01-02/most_likely_csv.csv I know its somehow visible also from checking LakeFS history, but if its free, having those would help with some debug queries on input csvs

Guy Hardonag

11/02/2023, 2:54 PM

Hi @Florentino Sainz, lakeFS doesn’t allow this, the garbage collection process assumes there is no zero copy. Running GC with zero copy data may accidentally delete the origin There may be a way to achieve this with deprecated commands but this isn’t recommended

Guy Hardonag

11/02/2023, 3:01 PM

Not sure it helps, but you can use tagging for this If for example you have a tag

v20230101

you can access the archived from

2023-01-01

data by lakefs://your-repo/v20230101/ingested_data/latest/most_likely_csv.csv That way you don’t need to copy the data or run any process for that and also benefit from the garbage collection to remove your archived data by the garbage collection rules you configure

Florentino Sainz

11/02/2023, 3:06 PM

hmm, didn't think on tags, i think it might be too many tags to make them manageable, but maybe we can use it for big sources (and for small ones just copy), thanks for that!

Ariel Shaqed (Scolnicov)

11/02/2023, 4:22 PM

How many objects are we talking about? How large are they? (This is me being curious about how many savings we could get here, I do not have a suggestion in the current product though)

Iddo Avneri

11/02/2023, 4:26 PM

This might be irrelevant; But, other users solved these type of queries in saving a metadata file that includes the timestamp of when something changed. i.e. you can store:

ingested_data/latest/most_likely_csv.csv

and

ingested_data/latest/most_likely_csv_dateChanged.txt

Will that help?

Florentino Sainz

11/02/2023, 5:22 PM

@Ariel Shaqed (Scolnicov) typically upto 6 objects per day, could be around 10gb per day (compressed). And then other multiple separate "folders" (i.e. sources) which contain small files each. @Iddo Avneri We already have the "intelligent" file in another layer, but in this phase we just dont process the file, its what we got in our ingestion system (sometimes even gzs), also the big one has changes everyday. Anyways no worries on it, was just looking for the technical options, will figure a way (im not even sure if we'll need it, but always nice to have them for a while just in case)

👍 1

3 Views

Open in Slack

Previous Next