Hi Everyone, I have a question about the database ...
# dev
Hi Everyone, I have a question about the database storage sizing guide in the lakeFS documentation According to the documentation, the storage requirements are about 150MiB per every 100,000 uncommitted writes - which is roughly around 1500 bytes per write. Looking at the code - I see that the lakeFS is writing the following
struct per write:
Copy code
ent := &Entry{
		Address:      entry.PhysicalAddress,
		AddressType:  addressTypeToProto(entry.AddressType),
		Metadata:     entry.Metadata,
		LastModified: timestamppb.New(entry.CreationDate),
		ETag:         entry.Checksum,
		Size:         entry.Size,
		ContentType:  ContentTypeOrDefault(entry.ContentType),
Creating a gross calculation taking into account field limits: Address - according to AWS guidelines does not exceed -- 1024 bytes AddressType - int32 -- 4 bytes Metadata - according to AWS limited to 2KB user data -- 2048 bytes LastModified - int64 -- 8bytes Etag - AWS limitation -- 1024 bytes Size - int64 -- 8 bytes ContentType - Lets use the worst case scenario -- 1024 bytes Summing this up we get over 5000 bytes which is far from the given estimation, and this is without taking into consideration other data which is saved such as entry key and checksum Am I missing something here?? (Keep in mind that these are general approximation - not trying to do exact math here but rather get a sense of the size)
I think you may be summing upper bounds, which is good at estimating the worst case but not so good at estimating the average case. Address is typically <<256 bytes (you'd need a 224 byte storage namespace prefix to get there) , metadata barely exists, etag is 32 bytes or something, content type typically fits into 32 bytes. (For Graveler SSTables, of course, the column compression will really smash these numbers down!) If a user wants to DoS themselves, I imagine they could set a 1000 byte storage namespace prefix, invent content types, and use huge metadata. So you're right, 1500 bytes is very little for these users - but I'm not sure we would consider that "typical". Do you think we should clarify this is an expected number rather than the hard limit?
I was evaluating the possible changes in the sizing guide towards transitioning to key-value store and was trying to get an understanding of how the numbers were created. So I guess if we are talking about the average case, there isn't a significant change between implementations
👍🏼 3