https://lakefs.io/ logo
Title
s

Sam Gaw

04/18/2023, 11:23 AM
I'm feeling very stupid not being able to follow the idiot's guide but here I am :) Using Cloudflare's R2 (S3) as the backend, I've created a bucket at:
https://<id>.<http://r2.cloudflarestorage.com/my_bucket|r2.cloudflarestorage.com/my_bucket>
Created a lakefs config file:
$ cat .config/lakefs.yaml
database:
  type: "postgres"
  postgres:
    connection_string: "<postgresql://sg@127.0.0.1:5432/lakefs_dev>"

auth:
  encrypt:
    secret_key: "a4d333b733f7b63d69fe9065701a4cf8"

blockstore:
  type: s3
  s3:
     force_path_style: true
     endpoint: https://<id>.<http://r2.cloudflarestorage.com|r2.cloudflarestorage.com>
     discover_bucket_region: false
     credentials:
        access_key_id: <key>
        secret_access_key: <secret>
Running
lakefs --config ~/.config/lakefs.yaml run
does what's expected. I create an admin account and attempt to create a new repo with the following settings:
Repository ID: local_dev
Storage Namespace: s3://<id>.<http://r2.cloudflarestorage.com/my_bucket|r2.cloudflarestorage.com/my_bucket>
Default Branch: trunk
That gives me a "failed to create repository: failed to access storage" error.
ERROR  [2023-04-18T12:03:06+01:00]lakeFS/pkg/api/auth_middleware.go:351 pkg/api.userByAuth authenticate error="2 errors occurred:\n\t* built in authenticator: credentials not found\n\t* email authenticator: crypto/bcrypt: hashedSecret too short to be a bcrypted password\n\n" service=api_gateway user=""
ERROR  [2023-04-18T12:04:39+01:00]lakeFS/pkg/block/s3/adapter.go:309 pkg/block/s3.(*Adapter).Get failed to get S3 object bucket <id>.<http://r2.cloudflarestorage.com/my_bucket|r2.cloudflarestorage.com/my_bucket> key dummy  error="InvalidBucketName: The specified bucket name is not valid.\n\tstatus code: 400, request id: , host id: " host="127.0.0.1:8000" method=POST operation=GetObject operation_id=CreateRepository path=/api/v1/repositories request_id=11d96aa3-7c8e-42f2-8f64-bb35429258a9 service_name=rest_api user=sg
WARNING[2023-04-18T12:04:39+01:00]lakeFS/pkg/api/controller.go:1576 pkg/api.(*Controller).CreateRepository Could not access storage namespace error="InvalidBucketName: The specified bucket name is not valid.\n\tstatus code: 400, request id: , host id: " reason=unknown service=api_gateway storage_namespace="s3://<id>.<http://r2.cloudflarestorage.com/my_bucket|r2.cloudflarestorage.com/my_bucket>"
ERROR  [2023-04-18T12:05:20+01:00]lakeFS/pkg/block/s3/adapter.go:275 pkg/block/s3.(*Adapter).streamToS3 bad S3 PutObject response error="s3 error: <?xml version=\"1.0\" encoding=\"UTF-8\"?><Error><Code>NotImplemented</Code><Message>STREAMING-AWS4-HMAC-SHA256-PAYLOAD not implemented</Message></Error>" host="127.0.0.1:8000" method=POST operation=PutObject operation_id=CreateRepository path=/api/v1/repositories request_id=061e03f2-a2e2-471b-9dfe-b22f98fb0cb8 service_name=rest_api status_code=501 url="https://<id>.<http://r2.cloudflarestorage.com/my_bucket/dummy|r2.cloudflarestorage.com/my_bucket/dummy>" user=sg
WARNING[2023-04-18T12:05:20+01:00]lakeFS/pkg/api/controller.go:1576 pkg/api.(*Controller).CreateRepository Could not access storage namespace error="s3 error: <?xml version=\"1.0\" encoding=\"UTF-8\"?><Error><Code>NotImplemented</Code><Message>STREAMING-AWS4-HMAC-SHA256-PAYLOAD not implemented</Message></Error>" reason=unknown service=api_gateway storage_namespace="<s3://my_bucket>"
ERROR  [2023-04-18T12:05:36+01:00]lakeFS/pkg/block/s3/adapter.go:309 pkg/block/s3.(*Adapter).Get failed to get S3 object bucket <id>.<http://r2.cloudflarestorage.com/my_bucket|r2.cloudflarestorage.com/my_bucket> key dummy  error="InvalidBucketName: The specified bucket name is not valid.\n\tstatus code: 400, request id: , host id: " host="127.0.0.1:8000" method=POST operation=GetObject operation_id=CreateRepository path=/api/v1/repositories request_id=c01cf2d9-5be1-4bff-be29-467f082de1d4 service_name=rest_api user=sg
WARNING[2023-04-18T12:05:36+01:00]lakeFS/pkg/api/controller.go:1576 pkg/api.(*Controller).CreateRepository Could not access storage namespace error="InvalidBucketName: The specified bucket name is not valid.\n\tstatus code: 400, request id: , host id: " reason=unknown service=api_gateway storage_namespace="s3://<id>.<http://r2.cloudflarestorage.com/my_bucket|r2.cloudflarestorage.com/my_bucket>"
If I click the info icon in the create repo modal, it takes me to a dead link: https://docs.lakefs.io/setup/create-repo.html#create-the-repository so I've no idea what format "Storage Namespace" is expecting.
o

Or Tzabary

04/18/2023, 11:27 AM
hey @Sam Gaw 🙂 let me have a look
I haven’t tested cloudflare in specific, but I think that for the storage namespace you should drop the
https://<id>.<http://r2.cloudflarestorage.com/|r2.cloudflarestorage.com/>
prefix as it’s already configured in the configuration file. can you please retry with
<s3://my_bucket>
only? regarding the broken link, sorry to hear that, I opened an issue to address this
s

Sam Gaw

04/18/2023, 11:35 AM
That moved things in the right direction. New error message uncovers:
STREAMING-AWS4-HMAC-SHA256-PAYLOAD not implemented
Seems CF doesn't support Streaming SigV4 used for chunked uploading.
Only way around it is to disable payload signing on putting objects.
So atm CF R2 isn't supported.
o

Or Tzabary

04/18/2023, 11:40 AM
that’s true. mind sharing a bit on your use case? if that’s something you’re still interested in, you can open an issue in the lakeFS repository to add support for CloudFlare for us to examine and see if there are any other community members looking for the same capability.
s

Sam Gaw

04/18/2023, 11:42 AM
CF have no egress fees on reading objects so for training data that's regularly accessed, it's the most cost effective for us. ~2000 files, largest ~30GB, all updating daily
We're also planning on migrating 180TB from BiqQuery to parquet files stored in R2.
o

Or Tzabary

04/18/2023, 11:47 AM
interesting, thank you for sharing. let me check to see if there’s a workaround for you to try. I wonder if CF support SigV2 or if we could leverage presigned URLs for this to work
s

Sam Gaw

04/18/2023, 11:50 AM
Seems they use SigV4, they just don't support Streaming. So SigV4 still needs to be used for presigned URLs.
👍 1
o

Or Tzabary

04/18/2023, 11:53 AM
I wonder, have you used lakeFS with GCP or were you experimenting it with CloudFlare first?
s

Sam Gaw

04/18/2023, 11:54 AM
We're making the conscious decision to move away from all Google infra. GCP isn't an option.
o

Or Tzabary

04/18/2023, 11:57 AM
I understand, I wonder if you were already using lakeFS with GCP and now you’re trying to migrate it to CloudFlare, or as part of onboarding to CloudFlare you’re testing new technologies
s

Sam Gaw

04/18/2023, 12:00 PM
Haven't used lakefs before. I'm brand new.
Going to try setting up with Wasabi to see if they have better compatibility. CF was just because we already had an existing account.
o

Or Tzabary

04/18/2023, 12:08 PM
Got it… let me know 🙏
e

einat.orr

04/18/2023, 12:21 PM
@Sam Gaw if you're choosing storage, how about Min.io? It's OSS with great performance, and fully compatible with lakeFS 🙂.
s

Sam Gaw

04/18/2023, 12:28 PM
@einat.orr The problem with minio is hosting costs. Rather than juggling multiple systems, I'd like to get the bigquery data under the same roof. And when you go into the 200TB range, the costs become crippling. Colo would be the only option to make that viable longterm.
👍 1
So Wasabi works well and has comparable costs -- $6 TB/month and no egress charges.
:jiggling-lakefs: 2
Although something else I noticed is that
lakefs
seems to be ignoring the
search_path
param given to pgx in the postgres connection string. If I try to set the postgres user to a specific schema it barfs
error="setup failed: ERROR: no schema has been selected to create in (SQLSTATE 3F000)"
on startup. If I try to enforce it only from the DB side with ALTER USER, it ignores the search_path set in the session and explicitly tries to create tables in public --
error="setup failed: ERROR: permission denied for schema public (SQLSTATE 42501)"
Appreciate I seem to be QAing a lot of edge cases all at once 🙂
o

Or Tzabary

04/18/2023, 1:17 PM
no worries 🙂
if I understand correctly, following the configuration you shared, you’d expect lakeFS to use the
lakefs_dev
scheme and it seems to be trying to use the
public
scheme regardless of the connection string, is that the case?
s

Sam Gaw

04/18/2023, 1:28 PM
Generally we try to have a monolith DB in Postgres, spliting responsiblities by schemas. To get it running I had to use:
<postgresql://lakefs:password@127.0.0.1:5432/lakefs_dev?application_name=lakefs>
But after creating a 'datalake' schema and pinning the
lakefs
user to it via _search_path_, the config that produced the errors was:
<postgresql://lakefs:password@127.0.0.1:5432/world_dev?application_name=lakefs&search_path=datalake>
Note: application_name is respected and I can confirm with
SELECT * FROM pg_stat_activity WHERE application_name = 'lakefs';
o

Or Tzabary

04/18/2023, 1:31 PM
I see, let me check
can you try that without the
application_name
filter?