Hi :wave: My team is currently working on a LakeFS...
# help
p
Hi 👋 My team is currently working on a LakeFS implementation for our pipeline. However, we are struggling when using multiprocessing to download files using boto3 (Using lakefs as the client). The same code using the standard boto3 works, but as soon as we use Boto3 + Lakefs some files content starts missing / gets cut or even appended to others. Eventually it leads to errors. As soon as we reduce the processes to a single process, everything works well. Is this familiar to anyone? It seems processes are racing against each other, but then it should also happen when using the standard boto3 I suppose (and it does not). For context, this is a very small PoC and we currently have lakeFS in a very simple AWS EC2 instance. It has no load balancer attached. Is just an EC2 instance with a binary installed. We are also only using a "dataset" with 25 files for testing this. Thank you!
n
Hi, Could you please share the code snippet you are using to download from Lakefs?
p
Hi Nadav, It is the base boto3 download_file function. Where "client" is a
boto3.resource
initialized to fetch from the Lakefs URL For context, this is a function inside a @classmethod method. Destination path is a local path and key_string the path to the file in Lakefs
Copy code
client.Bucket(bucket_name).download_file(key_string, destination_path)
And to start the processes we use:
Copy code
with multiprocessing.Pool(self.njobs) as pool:
            return list(tqdm(pool.imap(self.callback_function, data), total=len(data), desc=desc))
Where self.callback_function is the function that calls the class.download_function(*args) where the previous snippet is
n
@Pedro Vide Simoes Python and Boto version will also be very helpful in understanding the issue
👍 1
p
So boto3 version is 1.34.69 Python is 3.10.14
Update: This seems to be related with the Boto3 client and how it is used in multiprocessing. I changed some code and for now it seems to be working 👍 By the way, the documentation shows boto3 being used with .client. There should be no problem if instead of client we use the high level approach "resource". The s3 gateway should also be fully functional when resquest go through lakefs I assume?
o
@Pedro Vide Simoes Please feel free to open a documentation bug in GitHub for this
n
@Pedro Vide Simoes if I'm not mistaken both create the same resource so it should be fine