https://lakefs.io/ logo
Title
a

Alvaro Rodrigo Alonso

01/04/2023, 1:49 PM
Hi everyone, Alvaro here from Gothenburg, Sweden. I am learning lakeFS from @mishraprafful! I just put [this comment](https://github.com/RaRe-Technologies/smart_open/discussions/751) in [smart_open](https://github.com/RaRe-Technologies/smart_open) to add a new lakeFS transport to that project.
:jumping-lakefs: 2
a

Adi Polak

01/04/2023, 2:08 PM
Very interesting, and welcome aboard - Alvaro ! I am completely new to smart_open technology and curios about file format dependency, is there any? or can it stream/move any big file?
a

Ariel Shaqed (Scolnicov)

01/04/2023, 2:19 PM
Hi @Alvaro Rodrigo Alonso! Welcome to lakeFS smart_open seems like a neat project. Please note that you have a workaround if you need a solution today: you can use the S3 gateway directly. When/if you decide to go for a lakeFS solution, please ask any lakeFS API questions on #help or #dev, and we'll be happy to help!
a

Ankit Srinivas

01/04/2023, 2:47 PM
Welcome to the lakeFS community! :jumping-lakefs:
🙏 1
a

Alvaro Rodrigo Alonso

01/04/2023, 3:00 PM
@Adi Polak Thanks! From what I have seen in their codebase there is no file format dependency and in their readme says
“`smart_open` is a Python 3 library for efficient streaming of very large files from/to storages such as S3, GCS, Azure Blob Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem.” ….“`smart_open` is a drop-in replacement for Python’s built-in `open()`: it can do anything
open
can (100% compatible, falls back to native
open
wherever possible), plus lots of nifty extra stuff on top.”
@Ariel Shaqed (Scolnicov) that is a good idea, Thanks! what lakefs functionality would be lost in the case of using the S3 gateway?
a

Ariel Shaqed (Scolnicov)

01/05/2023, 7:44 AM
There is no functional difference between using the S3 gateway and the native lakeFS API. But there are nonfunctional differences: • The lakeFS API allows for much more scalable code. The S3 gateway has to transfer all data (S3 protocol traffic is impossible to redirect), so all your data will have to flow through your lakeFS servers. The lakeFS API lets us transfer just metadata through lakeFS if you have credentials to read from underlying storage (e.g. AWS IAM keys): Use the statObject lakeFS API the lakeFS object, then read it directly from ObjectStats.physical_address on underlying storage. • I may be biased, but I find the lakeFS REST API nicer to program against than the S3 API. • The S3 gateway can be easier for end-users because it is obviously supported by more tools . That's the workaround I proposed >:-) .
a

Alvaro Rodrigo Alonso

01/05/2023, 8:20 AM
Nice explanation, thanks. I think I understand. I would give it a try to implement the lakefs transport. I will keep you updated and (if you can) you can let me know if the implementation makes sense. Thanks in advance
:lakefs: 1
a

Ariel Shaqed (Scolnicov)

01/05/2023, 11:11 AM
Sure! I will be happy to advise. Also please feel free to put me as a reviewer on the smart_open PR you open. If GitHub somehow doesn't let you put reviewers (I honestly don't remember) then please post me the PR and I will be happy to comment on it.
Hi! Not sure how the discussion will turn out 😞 But looking at smart_open it seems possible to extend it by registering an external package. (Hope you'll get by without having to do that, of course!)
a

Alvaro Rodrigo Alonso

01/09/2023, 8:14 AM
Thank you @Ariel Shaqed (Scolnicov) for your comment in the discussion! I think you make it very clear. I hope we can integrate it into smart_open but, of course, if the maintainer is not ok with it we will have to find other options. I am going soon on vacations so I will came back to this in February (just wanted to let you know if I am not responsive to the messages)
a

Ariel Shaqed (Scolnicov)

01/09/2023, 10:19 AM
Have a great time! Let's talk when you get back...
👍 1
a

Alvaro Rodrigo Alonso

03/13/2023, 12:53 PM
Hi @Ariel Shaqed (Scolnicov) and @mishraprafful I just opened this PR to add lakefs transport to smart_open https://github.com/RaRe-Technologies/smart_open/pull/764
:gratitude-obrigado: 2
Sorry, for the delay! and thanks in advance for checking it out