Title
s

Selva

12/06/2022, 5:20 AM
Hi guys. I have two questions on LakeFS and thought you can help me. 1. How can I use the LakeFS url in matlab and C# to read the content? 2. Is there any GUI to checkout file, then edit (in notepad) and then check in. We are used to using sourcetree and would like to know if LakeFS has one similar.
👀 1
g

Guy Hardonag

12/06/2022, 5:53 AM
Hi @Selva , I would love to hear more about your use case 😃, according to your questions 1. lakeFS is S3 compatible, you can use the S3 SDK for Matlab or C# and read the content via the S3 gateway 2. There currently isn’t a way to edit online, you will need to download, edit and upload lakeFS Architecture lakeFS architecture overview. Learn more about lakeFS components, including its S3 API gateway.
s

Selva

12/06/2022, 6:28 AM
Thanks @Guy Hardonag. I am using Azure and how can I resolve the lakefs:// url to az:// url automatically?
Can I download only needed files, edit and upload? Are you suggesting to use the web ui for this? My team is more accustomed to git way of working so they wanted some method of accessing and modifying the text files using notepad. Then, checkin their changes. Consider the scenario - 1. we have a repo which contains parameter file 2. Users checkout and modify the parameters in the file and then check in with a commit 3. Then launch matlab or c# application which takes the parameter file path as argument (This matlab or c# application creates huge data) 4. Then the data is upload to the repository as a commit
g

Guy Hardonag

12/06/2022, 6:55 AM
According to editing: I suggest using the client that suets you best, if you prefer working with UI then yes, go to the webUI, download the parameter file, edit it on you local machine and upload it back via webUI. According to Matlab/C# application, Can you please provide an example of how you access data today (just a small code snippet will do)? I will try to provide the lakeFS (or S3 gateway) alternative.
s

Selva

12/13/2022, 7:45 AM
Sorry for delayed message @Guy Hardonag First my repository structure is as follows
repo
|_code
  |_prep.m
  |_scale.m
  |_split.m
  |_train.m
  |_val.m
|_data
  |_prep_results
    |_<txt files>
  |_scale_results
    |_<txt files>
  |_split_results
    |_<txt files>
  |_train_results
    |_<model files>
My users wants the experience of git gui for code changes and local debugging. That is, clone the repo to
c://repo
, make changes to the script, run the matlab script with limited inputs. Once satisfied, commit the script and our powerful server machine (HPC) run for full data (create millions of files). Git is not useful for this case. Because, I cannot commit millions of files and I cannot clone them to my local later. I need some kind of feature where I clone only the code folder and not the data folder.
DVC
was promising this by storing only the reference but it cannot work on millions of files. Hence, I have to ditch this. Then, I came with a second alternative of breaking my repo into two (which I hate because of higher possibility of manual errors by users)
repo_git
|_code
  |_prep.m
  |_scale.m
  |_split.m
  |_train.m
  |_val.m
repo_lakefs
|_data
  |_prep_results
  |_scale_results
  |_split_results
  |_train_results
So now, my users can clone the repo_git alone and make code changes and then run. Once satisfied, they can commit the code change and our HPC will run for full data. So the writing output code in
prep.m
currently,
Pitch = [0.7;0.8;1;1.25;1.5];
Price = [10.0;13.59;10.50;12.00;16.69];
Stock = [376;502;465;1091;562];
T = table(Pitch,Price,Stock)
writetable(T,'../data/prep_results/tabledata.txt');
Should be changed as
Pitch = [0.7;0.8;1;1.25;1.5];
Price = [10.0;13.59;10.50;12.00;16.69];
Stock = [376;502;465;1091;562];
T = table(Pitch,Price,Stock)
%Todo: create a LakeFS branch with the same name as current git branch
%Todo: Get the azure path of the data folder in this new branch and store in lakefspath variable
txtpath = fullfile(lakefspath,’prep_results’,’tabledata.txt’
writetable(T,txtpath);
Could you please fill the todo? Also, if you have any other alternatives please suggest
l

Lynn Rozen

12/13/2022, 1:06 PM
Hi Selva, unfortunately I didn't find a compatible solution for you after my first check. As Guy mentioned, lakeFS is S3 compatible therefore we thought that you can use a s3 SDK for matlab and point it to lakeFS. This is the relevant information I found. However, I didn't find information about how to connect s3 to lakeFS (in this way all s3 operations would get to lakeFS, and then lakeFS could handle those operations and use your underlying storage, Azure in your case). I'll see if we can think of another solution for you.
s

Selva

12/13/2022, 7:19 PM
Thanks @Lynn Rozen. Could you please elaborate on “you can use a s3 sdk for matlab and point it to LakeFS”. Does it mean, I create a random azure blob path, then write my Matlab content to that path and open the web ui (or python api) to ingest this azure blob content into my repo? Or does it mean, I can use the LakeFS path directly in the azure blob api?