cweng
10/18/2022, 2:59 PMRanjeetha Raja
10/18/2022, 3:35 PMCC
10/19/2022, 2:18 PMAdi Polak
{: .note .note-info }
Cristian Caloian
10/28/2022, 10:10 AM<https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar>
<https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.12.178/aws-java-sdk-1.12.178.jar>
<https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.1/hadoop-aws-3.3.1.jar>
<https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.375/aws-java-sdk-bundle-1.11.375.jar>
<https://repo1.maven.org/maven2/io/lakefs/hadoop-lakefs-assembly/0.1.8/hadoop-lakefs-assembly-0.1.8.jar>
<https://repo1.maven.org/maven2/io/lakefs/lakefs-spark-client-301_2.12/0.5.1/lakefs-spark-client-301_2.12-0.5.1.jar>
I am running following command
spark-submit --class io.treeverse.clients.GarbageCollector \
--packages org.apache.hadoop:hadoop-aws:3.3.1 \
-c spark.hadoop.lakefs.api.url="https://<my-lakefs-api-url>/api/v1" \
-c spark.hadoop.lakefs.api.access_key="<my-lfs-access-key>" \
-c spark.hadoop.lakefs.api.secret_key="<my-lfs-secret-key>" \
-c spark.hadoop.fs.s3a.access.key="<my-s3-access-key>" \
-c spark.hadoop.fs.s3a.secret.key="<my-s3-secret-key>" \
$SPARK_HOME/jars/lakefs-spark-client-301_2.12-0.5.1.jar \
<my-repo> eu-west-1
and I get the following error
Exception in thread "main" com.google.common.util.concurrent.ExecutionError: java.lang.NoClassDefFoundError: dev/failsafe/function/CheckedSupplier
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
at io.treeverse.clients.ApiClient$.get(ApiClient.scala:43)
at io.treeverse.clients.GarbageCollector$.main(GarbageCollector.scala:273)
at io.treeverse.clients.GarbageCollector.main(GarbageCollector.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Sander Hartlage
10/31/2022, 1:11 PM--allow-empty
option? we're ingesting third-party data into lakefs and making a daily commit; it would be helpful to commit even without changes so that history shows there were no new files that dayTemilola Onaneye
11/08/2022, 3:31 PM#Reading data
repo = "<my-repo>"
branch = "main"
dataPath = "s3a://<my-repo/main/bronze/inp_claimsk_lds_5_2020-sample.csv"
# df = spark.read.(dataPath)
df = spark.read.format("csv").load(dataPath, inferSchema = True)
And I get the following error
Py4JJavaError: An error occurred while calling o85.load.
: java.nio.file.AccessDeniedException: smartlens: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider InstanceProfileCredentialsProvider : com.amazonaws.SdkClientException: The requested metadata is not found at <http://169.254.169.254/latest/meta-data/iam/security-credentials/>
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:187)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:111)
at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261)
at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:236)
at org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:375)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:311)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider InstanceProfileCredentialsProvider : com.amazonaws.SdkClientException: The requested metadata is not found at <http://169.254.169.254/latest/meta-data/iam/security-credentials/>
at org.apache.hadoop.fs.s3a.AWSCredentialProviderList.getCredentials(AWSCredentialProviderList.java:159)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.getCredentialsFromContext(AmazonHttpClient.java:1257)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.runBeforeRequestHandlers(AmazonHttpClient.java:833)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:783)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5227)
at com.amazonaws.services.s3.AmazonS3Client.getBucketRegionViaHeadRequest(AmazonS3Client.java:6189)
at com.amazonaws.services.s3.AmazonS3Client.fetchRegionFromCache(AmazonS3Client.java:6162)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5211)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5173)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1438)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1374)
at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$verifyBucketExists$1(S3AFileSystem.java:376)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
... 30 more
Caused by: com.amazonaws.SdkClientException: The requested metadata is not found at <http://169.254.169.254/latest/meta-data/iam/security-credentials/>
at com.amazonaws.internal.EC2ResourceFetcher.doReadResource(EC2ResourceFetcher.java:89)
at com.amazonaws.internal.EC2ResourceFetcher.doReadResource(EC2ResourceFetcher.java:70)
at com.amazonaws.internal.InstanceMetadataServiceResourceFetcher.readResource(InstanceMetadataServiceResourceFetcher.java:75)
at com.amazonaws.internal.EC2ResourceFetcher.readResource(EC2ResourceFetcher.java:66)
at com.amazonaws.auth.InstanceMetadataServiceCredentialsFetcher.getCredentialsEndpoint(InstanceMetadataServiceCredentialsFetcher.java:58)
at com.amazonaws.auth.InstanceMetadataServiceCredentialsFetcher.getCredentialsResponse(InstanceMetadataServiceCredentialsFetcher.java:46)
at com.amazonaws.auth.BaseCredentialsFetcher.fetchCredentials(BaseCredentialsFetcher.java:112)
at com.amazonaws.auth.BaseCredentialsFetcher.getCredentials(BaseCredentialsFetcher.java:68)
at com.amazonaws.auth.InstanceProfileCredentialsProvider.getCredentials(InstanceProfileCredentialsProvider.java:165)
at org.apache.hadoop.fs.s3a.AWSCredentialProviderList.getCredentials(AWSCredentialProviderList.java:137)
... 48 more
Ranjeetha Raja
11/17/2022, 10:16 AMDylan Butler
11/21/2022, 5:48 PMAlessandro Mangone
11/29/2022, 9:45 AMAlessandro Mangone
11/29/2022, 11:27 AMThe Delta log is an auto-generated sequence of text files used to keep track of transactions on a Delta table sequentially. Writing to one Delta table from multiple lakeFS branches is possible, but note that it would result in conflicts if later attempting to merge one branch into the other. For this reason, production workflows should ideally write to a single lakeFS branch that could then be safely merged into main.
Is this also true when the table is concurrently updated by different processes, but they work on different partitions?
Ideally I would like to create one branch per writer and merge into main when the write finishes the data processing.Alessandro Mangone
12/02/2022, 10:39 AMLakeFSClient
class.
I’ve noticed that while on python client the commits api are exposed, this doesn’t happen on the Java counterpart.
Is this something missing or is there a reason for the CommitsApi not being exposed there?Adi Polak
CC
12/13/2022, 5:16 PMViktor Kövesd
12/14/2022, 12:49 PMIddo Avneri
12/14/2022, 1:16 PMJuarez Rudsatz
12/22/2022, 1:01 PMQuentin Nambot
12/30/2022, 3:23 PMspark.hadoop.fs.s3a.impl shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.endpoint <http://s3.eu-west-1.amazonaws.com|s3.eu-west-1.amazonaws.com>
spark.hadoop.fs.lakefs.secret.key ...
spark.hadoop.fs.lakefs.access.key ...
spark.hadoop.fs.lakefs.endpoint http://...:8000/api/v1
spark.hadoop.fs.lakefs.impl io.lakefs.LakeFSFileSystem
and io.lakefs:hadoop-lakefs-assembly:0.1.9
installed on the cluster
And when I try to write any date with spark I have the following error:
java.io.IOException: get object metadata using underlying wrapped s3 client
In the stacktrace I also see:
Caused by: java.lang.NoSuchMethodException: com.databricks.sql.acl.fs.CredentialScopeFileSystem.getWrappedFs()
I am wondering if Databricks changed something 🤔
(Note that using spark locally or python client on databricks I can upload objects so it seems really related to spark on databricks)Quentin Nambot
01/02/2023, 10:06 AMMiguel Rodríguez
01/04/2023, 10:19 PMMissing Credential Scope
error, which I guess comes from lakeFS not being able to authenticate to the storage account where the files actually are.
The ingest
command worked and could authenticate cause I set the Azure Storage Account key in the AZURE_STORAGE_ACCESS_KEY
environment variable locally, but how can I make LakeFS authenticate later to read the files when I need them?CC
01/11/2023, 9:36 PMAdi Polak
main_repo_path = "<lakefs://adi-test/main/>"
This is the exception i get:
y4JJavaError: An error occurred while calling o625.load.
: java.io.IOException: statObject
at io.lakefs.LakeFSFileSystem.getFileStatus(LakeFSFileSystem.java:731)
at io.lakefs.LakeFSFileSystem.getFileStatus(LakeFSFileSystem.java:43)
at org.apache.hadoop.fs.FileSystem.isDirectory(FileSystem.java:1777)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:59)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:407)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:369)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:325)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:325)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:238)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
at py4j.Gateway.invoke(Gateway.java:306)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
at java.lang.Thread.run(Thread.java:750)
Caused by: io.lakefs.shaded.api.ApiException: Content type "text/html; charset=utf-8" is not supported for type: class io.lakefs.shaded.api.model.ObjectStats
at io.lakefs.shaded.api.ApiClient.deserialize(ApiClient.java:822)
at io.lakefs.shaded.api.ApiClient.handleResponse(ApiClient.java:1020)
at io.lakefs.shaded.api.ApiClient.execute(ApiClient.java:944)
at io.lakefs.shaded.api.ObjectsApi.statObjectWithHttpInfo(ObjectsApi.java:1115)
at io.lakefs.shaded.api.ObjectsApi.statObject(ObjectsApi.java:1089)
at io.lakefs.LakeFSFileSystem.getFileStatus(LakeFSFileSystem.java:727)
I configured Spark cluster with the same credentials and endpoint as the programmable lakeFS client. any idea what is missing?Zdenek Hruby
01/16/2023, 4:15 PMPy4JJavaError: An error occurred while calling o407.parquet.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class io.lakefs.LakeFSFileSystem not found
I used the hadoop fs settings as mentioned in docus `
spark.hadoop.fs.lakefs.impl io.lakefs.LakeFSFileSystem
Please 🙏, any idea how to deal with that?Miguel Rodríguez
01/18/2023, 3:44 PMOhad GR
01/22/2023, 12:55 PMRobin Moffatt
01/25/2023, 5:39 PMmake test
. I've logged an issue here - LMK if I can add any more details to help 🙂Jonas
01/27/2023, 1:56 PMAnts Young
01/29/2023, 2:31 AMOhad GR
01/30/2023, 11:11 AMCristian Caloian
01/30/2023, 2:08 PMcreation_date
and status 201
). I am using same id and policy statement. This happens both when using the lakectl
api 0.55.0
, the Golang api v0.89.0
, and curl
. I expected to get a 409
status code, as described in the API. Is this intended behaviour? Thanks!