Robin Moffatt

03/01/2023, 10:05 AM
thinking out loud…if lakeFS UI will show the contents of a CSV file that's been added, could it also do the same for a parquet file that's been added using the same duckDB-powered viewer that's available in the main object view? I get an actual diff on a changed file wouldn't be possible, but if the file is brand new it might be nice for the user.
same file seen in the object page

Oz Katz

03/01/2023, 10:53 AM
That's a valid point! However... the majority (I believe) of systems that read/write Parquet follow the hive conventions that typically care more about partitions - in which case, removing a record translates into dropping a parquet file in a prefix, and adding a differently named one into the same prefix. I agree that the simple case where I only added a file would be easier, but it's not enough to look into the file itself - we'd also need to look for deletion of neighboring files to know whether it's an addition or update (and actually scan both).
👍 1
For the same reason, I'm not the biggest fan of the current object-content-diff that we have in place atm, I think it could be a bit misleading for CSVs and other text-based formats as well.