Hey guys, Kind of new to the field. Sorry if this ...
# help
n
Hey guys, Kind of new to the field. Sorry if this doesn't make sense I want to write the parquet files with partitioning. I am currently using aws wrangler, how can I integrate lakeFS to this?
partitioning = ['cust_id', 'file_name', 'added_year', 'added_month', 'added_date']
loop = asyncio.get_event_loop()
s3_path = _f_"s3://{AWS_BUCKET_NAME}/parquet_data"
await loop.run_in_executor(None, _lambda_: <http://wr.s3.to|wr.s3.to>_parquet(
_df_=<http://batch.to|batch.to>_pandas() ,
_path_=s3_path,
_dataset_=True,
_max_rows_by_file_=MAX_ROWS_PER_FILE,
_use_threads_=True,
_partition_cols_ = partitioning,
_mode_='append',
_boto3_session_=s3_session,
_filename_prefix_=basename_template
))
Like this.
n
Hi @Nethsara Siyum, welcome to the lake lakefs If I understand correctly you are using boto to perform the actual writes to the blockstore. To integrate with lakeFS you'll need to configure the boto client to the lakeFS endpoint (see this doc ). Also when writing the file, instead of the AWS bucket, you'll need to provide the repository and branch names you wish to write to.
n
Hello @Niro, Thank you for the reply. Yeah I tried but failed. This is my Path
s3_path = "<s3://customer-data/parquet_data>"
I created boto using this
s3 = boto3.client('s3',
_endpoint_url_=lakefsEndPoint,
_aws_access_key_id_=lakefsAccessKey,
_aws_secret_access_key_=lakefsSecretKey
)
s3_session = boto3.Session(
_aws_access_key_id_=lakefsAccessKey,
_aws_secret_access_key_=lakefsSecretKey
)
The issue is this is getting errors.
ERROR:app.config:Error occurred during writing to Parquet file: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.
n
Your S3 path seems invalid. What is the repository and branch you are trying to write to in lakeFS? Have you created them already?
n
Yes, On locally
n
Ok so in order to write to the main branch for example, your S3 path should be:
<s3://customer-data/main/parquet_data|s3://customer-data/main/parquet_data>
n
Still same issue,
ERROR:app.config:Error occurred during writing to Parquet file: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.
n
I suggest trying to write something directly with the boto client to check that the configuration is indeed correct and only then trying your code
n
This is the keys
lakefsEndPoint = '<http://localhost:8000>'
lakefsAccessKey = 'AKIAJxxMVQ'
lakefsSecretKey = 'I1V0Mp26GxxtJYUYF4TL'
n
The error you are referencing is an AWS and not a lakeFS error which means you are not communicating with lakeFS
n
So you mean my boto configuration is wrong I guess
n
My guess is that a client is created using the session you provided to the code which uses the default S3 endpoint. You need to somehow configure the session to use the lakeFS endpoint or alternatively if possible provide an already configured client instead of a session. I can't guarantee the following will work but you might want to try this option
n
Hey @Niro than you so much for your help, I figured out that was not configured correctly for LakeFS, Now it fixed. https://stackoverflow.com/questions/78110521/boto3-change-the-endpoint-of-session/78110647#78110647
👍 1