Hi, is there a way to perform zero-copy of objects...
# help
f
Hi, is there a way to perform zero-copy of objects within LakeFS? (i.e. is copy_object api zero-copy/metadata-copy? or it makes an actual copy).
i
@Florentino Sainz,do you mind sharing the use case? For example, do you want to copy a subset of files from one repository to another?
f
hi Iddo, its just copying within the same repository and branch. We want to keep something like ingested_data/latest/most_likely_csv.csv ingested_data/archived/date=2023-01-01/most_likely_csv.csv ingested_data/archived/date=2023-01-02/most_likely_csv.csv I know its somehow visible also from checking LakeFS history, but if its free, having those would help with some debug queries on input csvs
g
Hi @Florentino Sainz, lakeFS doesn’t allow this, the garbage collection process assumes there is no zero copy. Running GC with zero copy data may accidentally delete the origin There may be a way to achieve this with deprecated commands but this isn’t recommended
Not sure it helps, but you can use tagging for this If for example you have a tag
v20230101
you can access the archived from
2023-01-01
data by lakefs://your-repo/v20230101/ingested_data/latest/most_likely_csv.csv That way you don’t need to copy the data or run any process for that and also benefit from the garbage collection to remove your archived data by the garbage collection rules you configure
f
hmm, didn't think on tags, i think it might be too many tags to make them manageable, but maybe we can use it for big sources (and for small ones just copy), thanks for that!
a
How many objects are we talking about? How large are they? (This is me being curious about how many savings we could get here, I do not have a suggestion in the current product though)
i
This might be irrelevant; But, other users solved these type of queries in saving a metadata file that includes the timestamp of when something changed. i.e. you can store:
ingested_data/latest/most_likely_csv.csv
and
ingested_data/latest/most_likely_csv_dataChanged.txt
Will that help?
f
@Ariel Shaqed (Scolnicov) typically upto 6 objects per day, could be around 10gb per day (compressed). And then other multiple separate "folders" (i.e. sources) which contain small files each. @Iddo Avneri We already have the "intelligent" file in another layer, but in this phase we just dont process the file, its what we got in our ingestion system (sometimes even gzs), also the big one has changes everyday. Anyways no worries on it, was just looking for the technical options, will figure a way (im not even sure if we'll need it, but always nice to have them for a while just in case)
👍 1