Hi Team I want to to use lakefs in the debugger mo...
# help
v
Hi Team I want to to use lakefs in the debugger mode. Can you please me how can I set one up? I want to debug one Hadoop filesystem
l
Hi @Vaibhav Kumar, you want to debug the Hadoop filesystem code? Run lakeFS server with the dubugger?
v
yes @Lynn Rozen
l
Yes on both?(:
v
Actually I am working on this issue https://github.com/treeverse/lakeFS/issues/2801 So, Need to put debugger in place. So I am not sure as of now do I need both or not
l
There are different ways to debug the HadoopFS client, I'm using for example spark local job with intellij. You can read more here https://sparkbyexamples.com/spark/how-to-debug-spark-application-locally-or-remote/
v
I checked it, but this is the way to put debugger on the spark app. If we were to put debugger on the HadoopFS client don't you think we need to run the lakeFS in some mode (I feel binary mode). When we will put
spark.read.parquet("<lakefs://repo-that-doesnt-exist/main/path/to/data>")
for the control to come to
HadoopFS client
the lakeFS should be running somewhere right and within that I will have to put the debugger. Does that make sense?
y
Hey @Vaibhav Kumar, if I understand correctly, you would like to run the lakeFS Golang server in debug mode. That is, while using/developing the Hadoop FileSystem, you want to be able to put breakpoints in the server-side code. Is that correct?
v
yes correct @Yoni Augarten As I am working on this issue So there are two parts to it I feel 1.
spark.read.parquet("<lakefs://repo-that-doesnt-exist/main/path/to/data>") -
-> I know i can put a debugger on spark app side 2. Hadoopfs The server side I am not sure how to use it in debugger mode. And how things will land here?
y
Great! I'm sorry that we don't have an up-to-date guide on this one, but here is the gist of it: 1. Clone the lakeFS project. 2. Make sure you have Golang installed. 3. Run
make all
4. Import the project to your IDE. 5. Run
cmd/lakefs/cmd/main.go
with the
run --local-settings
arguments. (you will later change these to connect your installation to the storage). 6. Come back here and let me know how it works 🙂
v
Sure, thanks
During
make-all
I got the below error
Copy code
go: downloading <http://golang.org/x/xerrors|golang.org/x/xerrors> v0.0.0-20200804184101-5ec99f83aff1
go: downloading <http://golang.org/x/sys|golang.org/x/sys> v0.0.0-20210510120138-977fb7262007
Makefile:103: *** "Missing dependency - no docker in PATH". Stop.
y
Yep - so you also need docker for the tests. For now maybe try
make gen
instead - but you may still need docker.
v
@Yoni Augarten I was able to run the lakeFS server using -
go run main.go run --local-settings
As main.go was a Go file and our HadoopFS client is in Java so how do I link these things together. Eventually I have to put breakpoints in HadoopFS client code.
y
Great progress! So the client is a little trickier to debug. The simplest way is to: 1. Import
clients/hadoopfs
as a separate IDE project. 2. Write a main method to call the code that you want to test. You will set the
fs.lakefs.*
configuration to point to your local instance of lakeFS. This is simpler than running a Spark program and debugging it - although that's also possible. Let me know if that makes sense.
v
The way you suggested is the better way I think as well. Using spark seems complicating things. Now the question is reading the below trace from the issue I see
getFileStatus
(Line 749) is the first function where the call goes to.This function expect
path
as a param, So I hope this the same spark path which we pass from
spark.read.parquet("<lakefs://repo-that-doesnt-exist/main/path/to/data>")
Kindly confirm if my observation is correct. Trace from the issue 2801
Copy code
java.io.IOException: listObjects
at io.lakefs.LakeFSFileSystem$ListingIterator.readNextChunk(LakeFSFileSystem.java:901)
	at io.lakefs.LakeFSFileSystem$ListingIterator.hasNext(LakeFSFileSystem.java:881)
	at io.lakefs.LakeFSFileSystem.getFileStatus(LakeFSFileSystem.java:707)
	at io.lakefs.LakeFSFileSystem.getFileStatus(LakeFSFileSystem.java:40)
	at org.apache.hadoop.fs.FileSystem.isDirectory(FileSystem.java:1439)
y
You are correct: you would use
lakefs://
paths when calling
getFileStatus
on the LakeFSFileSystem.
v
Thank you for confirming. Will start my work from here and will let you know if in case I see any issues
y
Sure thing, we're lucky to have you as a contributor! 🙂 Let me know if there's anything else.
dancing lakefs 1
jiggling lakefs 1
v
To start using the hadoopFS client as a JAR in another project I tried building the JAR from this POM. I am getting this weird error, Seems like something wrong with Java version . Do you know what is causing this error?
Copy code
Caused by: java.lang.IllegalArgumentException: Unsupported class file major version 63
        at net.bytebuddy.jar.asm.ClassReader.<init>(ClassReader.java:196)
        at net.bytebuddy.jar.asm.ClassReader.<init>(ClassReader.java:177)
        at net.bytebuddy.jar.asm.ClassReader.<init>(ClassReader.java:163)
        at net.bytebuddy.utility.OpenedClassReader.of(OpenedClassReader.java:86)
        at net.bytebuddy.dynamic.scaffold.TypeWriter$Default$ForInlining.create(TypeWriter.java:3889)
        at net.bytebuddy.dynamic.scaffold.TypeWriter$Default.make(TypeWriter.java:2166)
        at net.bytebuddy.dynamic.scaffold.inline.RedefinitionDynamicTypeBuilder.make(RedefinitionDynamicTypeBuilder.java:224)
        at net.bytebuddy.dynamic.scaffold.inline.AbstractInliningDynamicTypeBuilder.make(AbstractInliningDynamicTypeBuilder.java:123)
        at net.bytebuddy.dynamic.DynamicType$Builder$AbstractBase.make(DynamicType.java:3659)
        at org.mockito.internal.creation.bytebuddy.InlineBytecodeGenerator.transform(InlineBytecodeGenerator.java:391)
        at java.instrument/java.lang.instrument.ClassFileTransformer.transform(ClassFileTransformer.java:244)
        at java.instrument/sun.instrument.TransformerManager.transform(TransformerManager.java:188)
        at java.instrument/sun.instrument.InstrumentationImpl.transform(InstrumentationImpl.java:541)
        at java.instrument/sun.instrument.InstrumentationImpl.retransformClasses0(Native Method)
        at java.instrument/sun.instrument.InstrumentationImpl.retransformClasses(InstrumentationImpl.java:169)
        at org.mockito.internal.creation.bytebuddy.InlineBytecodeGenerator.triggerRetransformation(InlineBytecodeGenerator.java:276)
        ... 46 more

Running io.lakefs.FSConfigurationTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec

Results :

Tests in error: 
  testGetFileStatus_ExistingFile(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testExists_ExistsAsDirectoryInSecondList(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testExists_NotExistsNoPrefix(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testRename_existingDirToExistingFileName(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testDeleteDirectoryRecursiveBatch120(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testDeleteDirectoryRecursiveBatch123(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testCreateExistingDirectory(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testExists_ExistsAsDirectoryContents(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testRename_srcEqualsDst(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testRename_existingDirToNonExistingDirWithParent(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  getUri(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testOpen(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testDelete_FileNotExists(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testRename_existingDirToExistingNonEmptyDirName(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testRename_existingFileToExistingDirName(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testListStatusDirectory(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testExists_NotExistsPrefixWithNoSlashTwoLists(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testRename_nonExistingSrcFile(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testExists_NotExistsPrefixWithNoSlash(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testListStatusNotFound(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testRename_srcAndDstOnDifferentBranch(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testDelete_NotExistsRecursive(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testDelete_FileExists(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testDelete_EmptyDirectoryExists(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testExists_ExistsAsObject(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testDeleteDirectoryRecursiveBatch1(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testDeleteDirectoryRecursiveBatch2(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testDeleteDirectoryRecursiveBatch3(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testDeleteDirectoryRecursiveBatch5(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testGetFileStatus_NoFile(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testCreateExistingFile(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testRename_existingDirToNonExistingDirWithoutParent(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testListStatusFile(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testAppend(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testCreate(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testRename_fallbackStageAPI(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testOpen_NotExists(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testMkdirs(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testGetFileStatus_DirectoryMarker(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testGlobStatus_SingleFile(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testRename_existingFileToExistingFileName(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testDelete_DirectoryWithFile(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testDelete_DirectoryWithFileRecursive(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testExists_ExistsAsDirectoryMarker(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testRename_existingFileToNonExistingDst(io.lakefs.LakeFSFileSystemSimpleModeTest): (..)
  testGetFileStatus_ExistingFile(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testExists_ExistsAsDirectoryInSecondList(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testExists_NotExistsNoPrefix(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testRename_existingDirToExistingFileName(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testDeleteDirectoryRecursiveBatch120(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testDeleteDirectoryRecursiveBatch123(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testCreateExistingDirectory(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testExists_ExistsAsDirectoryContents(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testRename_srcEqualsDst(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testRename_existingDirToNonExistingDirWithParent(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  getUri(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testOpen(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testDelete_FileNotExists(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testRename_existingDirToExistingNonEmptyDirName(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testRename_existingFileToExistingDirName(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testListStatusDirectory(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testExists_NotExistsPrefixWithNoSlashTwoLists(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testRename_nonExistingSrcFile(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testExists_NotExistsPrefixWithNoSlash(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testListStatusNotFound(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testRename_srcAndDstOnDifferentBranch(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testDelete_NotExistsRecursive(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testDelete_FileExists(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testDelete_EmptyDirectoryExists(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testExists_ExistsAsObject(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testDeleteDirectoryRecursiveBatch1(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testDeleteDirectoryRecursiveBatch2(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testDeleteDirectoryRecursiveBatch3(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testDeleteDirectoryRecursiveBatch5(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testGetFileStatus_NoFile(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testCreateExistingFile(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testRename_existingDirToNonExistingDirWithoutParent(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testListStatusFile(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testAppend(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testCreate(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testRename_fallbackStageAPI(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testOpen_NotExists(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testMkdirs(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testGetFileStatus_DirectoryMarker(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testGlobStatus_SingleFile(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testRename_existingFileToExistingFileName(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testDelete_DirectoryWithFile(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testDelete_DirectoryWithFileRecursive(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testExists_ExistsAsDirectoryMarker(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)
  testRename_existingFileToNonExistingDst(io.lakefs.LakeFSFileSystemPresignedModeTest): (..)

Tests run: 105, Failures: 0, Errors: 90, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  05:05 min
[INFO] Finished at: 2023-04-02T00:19:55+05:30
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (default-test) on project hadoop-lakefs: There are test failures.
[ERROR] 
[ERROR] Please refer to /Users/simar/lakeFS/clients/hadoopfs/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] <http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException>
y
@Vaibhav Kumar, the FileSystem is compile and tested with Java 8, since it needs to be compatible with legacy Hadoop versions. You should have Java 8 installed, and set your JAVA_HOME to it when running the tests.
v
Ok sure
I tried building it with Java 8, The build still failed but with different error.
Copy code
[INFO] Building jar: /Users/simar/lakeFS/clients/hadoopfs/target/hadoop-lakefs-0.1.0.jar
[INFO] 
[INFO] --- gpg:1.5:sign (sign-artifacts) @ hadoop-lakefs ---
Downloading from central: <https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-utils/3.0.15/plexus-utils-3.0.15.pom>
Downloaded from central: <https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-utils/3.0.15/plexus-utils-3.0.15.pom> (3.1 kB at 51 kB/s)
Downloading from central: <https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-utils/3.0.15/plexus-utils-3.0.15.jar>
Downloaded from central: <https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-utils/3.0.15/plexus-utils-3.0.15.jar> (239 kB at 4.1 MB/s)
/bin/sh: gpg: command not found
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  03:15 min
[INFO] Finished at: 2023-04-02T20:55:03+05:30
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-gpg-plugin:1.5:sign (sign-artifacts) on project hadoop-lakefs: Exit code: 127 -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] <http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException>
y
There is a signing step there that you should skip, by adding
-P'!treeverse-signing'
to the Maven command
v
done
I tried importing the JAR to my test project but I see some builder function in
LakeFSFileStatus
I am not sure how to use it
y
What are you trying to achieve?
v
As per we discussed above, I am trying to call
getFileStatus
and will pass the lakeFS file path here to read in it.
y
getFileStatus is a method on the LakeFSFileSystem. So you want to have an instance of the LakeFSFileSystem.
FileSystem.get is your friend here
In pseudo code, you want to do something like:
Copy code
Path p = new Path("<lakefs://my-repo/main/1.txt>");
LakeFSFileSystem fs = FileSystem.get(hadoopConf, p);
fs.getFileStatus(p);
You don't necessarily need a Spark session around
v
My bad I was calling the wrong file above. 😶 Apart from this I should just have my LakefsServer running right like this
go run main.go run --local-settings
y
Yes. Take a look at the LakeFSFileSystem to learn how to connect it to lakeFS
v
I am facing some issue while creating lakefsClient . I had expected
spark.sparkContext.hadoopConfiguration.set
to set up Lakefs client but seems the syntax is not working. In the doc it was mentioned to use it the below way.
Copy code
def main(args : Array[String]) {
  val spark = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExample")
    .getOrCreate();
  val path = new Path("<S3a://test-repo/main/sample.json>")
  spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "AKIAJLDV6JMK2R5TRQSQ")
  spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "oOal7VcsJQQnGoPcM9AEYXCe1Q76PHMpX4R1+Ai+")
  spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", "http:localhost:8000")
  spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")

  val y = new LakeFSFileSystem()
  y.getFileStatus(path)
Error
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "io.lakefs.LakeFSClient.getObjectsApi()" because "this.lfsClient" is null
y
@Vaibhav Kumar, please read about the distinction between using the S3-compatible API and using the LakeFSFileSystem. Your configuration is for the former, while it should be for the latter. Also, you shouldn't simply instantiate a FileSystem instance, you should use
FileSystem.get
like I mentioned above.
v
Yes, I used s3 one because I don’t have exhausted the AWS account free credits.. Please confirm if this test will work on the local storage as well.
y
LakeFSFileSystem will only work with S3 as the backing storage.
Working with the
fs.s3a
configuration will not use LakeFSFileSystem, so it's not relevant for solving this issue.
You can try to use MinIO or another S3-compatible storage instead of S3.
Again, please read the attached doc to understand how LakeFSFileSystem works.
v
Yes, I have gone the doc. I was just confirming it once with you 🙂
Copy code
Path p = new Path("<lakefs://my-repo/main/1.txt>");
LakeFSFileSystem fs = FileSystem.get(hadoopConf, p);
fs.getFileStatus(p);
In the above function shall I pass
hadoopConf
as a map of fs.* params?
v
I have used tried the implementation the below way. Now is just need to work with lakefs:// prefix instead of
s3a://
. Please confirm If I have used it the correct way? One observation
FileSystem._get_
is working with the URI option not directly with the
path
variable
def main(args : Array[String]) {
val conf = new Configuration()
conf.set("fs.s3a.access.key", "AKIAJLDV6JMK2R5TRQSQ")
conf.set("fs.s3a.secret.key", "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")
conf.set("fs.s3a.endpoint", "<http://localhost:8000>")
conf.set("fs.s3a.path.style.access", "true")
val uri = "<s3a://test-repo/main/sample.json>"
val path = new Path("<s3a://test-repo/main/sample1.json>")
URI._create_(uri)
val  fs =  FileSystem._get_(URI._create_(uri), conf)
fs.getFileStatus(path)
e
Hi @Vaibhav Kumar, Yes, that looks about right. You'll need to use a
lakefs://
uri, as you mentioned.
v
Ok and the rest things look good right , the way I have used path and URI ?this is in reference to solving issue 2801 as mentioned above.
e
Yes, looks about right
v
One more thing I have used
fs.getFileStatus(path)
from hadoop FileSytems above but according to the issue stack trace the function call should be from LakefsFileSystem?
e
The
lakefs://
uri tells the
FileSystem
instance to use lakeFSFS, that's why you need the
lakefs://
uri
v
after changing it to
lakefs://
it doesn't look like it is referring to
hadoopFs
client from lakefs. The trace below still shows hadoop's Filesystem error.
0    [main] WARN  org.apache.hadoop.util.NativeCodeLoader  - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "lakefs"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3443)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.example.SparkSessionTest$.main(SparkSessionTest.scala:27)
at org.example.SparkSessionTest.main(SparkSessionTest.scala)
e
Did you set
fs.lakefs.impl
? That’s what registers a handler for the
lakefs://
URIs
v
I have set that. I am using Minio as storage. Shall I now read the file from
S3
or
lakefs
under uri and path. Below is my code
def main(args : Array[String]) {
val conf = new Configuration()
conf.set("fs.s3a.access.key", "0yfZnzCeJdB9Y2i1")
conf.set("fs.s3a.secret.key", "uhYMtk6s97qLKs8jnJhrIMLfBs3uGkv6")
conf.set("fs.lakefs.access.key", "AKIAJLDV6JMK2R5TRQSQ")
conf.set("fs.lakefs.secret.key", "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")
conf.set("fs.lakefs.endpoint", "<http://localhost:8000/api/v1>")
conf.set("fs.s3a.endpoint", "<http://127.0.0.1:9090>")
conf.set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem")
val uri = "<lakefs://s3bucket/sample1.json>"
val path = new Path("<lakefs://s3bucket/sample1.json>")
val  fs =  FileSystem._get_(URI._create_(uri), conf)
fs.getFileStatus(path)
Error
Copy code
805  [main] WARN  org.apache.hadoop.fs.FileSystem  - Failed to initialize fileystem <lakefs://test-repo/main/sample1.json>: java.lang.RuntimeException: lakeFS blockstore type local unsupported by this FileSystem
Exception in thread "main" java.lang.RuntimeException: lakeFS blockstore type local unsupported by this FileSystem
	at io.lakefs.storage.PhysicalAddressTranslator.translate(PhysicalAddressTranslator.java:29)
	at io.lakefs.LakeFSFileSystem.initializeWithClientFactory(LakeFSFileSystem.java:153)
	at io.lakefs.LakeFSFileSystem.initialize(LakeFSFileSystem.java:110)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
	at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
	at org.example.SparkSessionTest$.main(SparkSessionTest.scala:29)
	at org.example.SparkSessionTest.main(SparkSessionTest.scala)
a
Sorry to hear it's not working. I suspect we still don't have a correct configuration. According to the error message your lakeFS blockstore type is "local". If you configured lakeFS to use an s3 compatible blockstore, that type should be "s3". I can offer 2 options... 1. If you can send your lakeFS configuration I can take a look. But please make sure to remove all secret information before sending! 2. This old guide explains how to configure lakeFS to work with minio. I went over it and it seems still relevant. (Supporting MinIO in production is a roadmap item; afair mostly we're missing continuous integration).
v
I have lakefs running on docker and storage namespace was local. Was this you mean as incorrectly configured?
e
lakeFSFS only works with S3-compatible block store adaptors
v
Ok so what I shall change in the lakefs config and restart the container
e
Please take a look at Ariel's last message Those are the options
If you'd like to work locally, you'll need to set up minio
a
If you enjoy using Docker, another easy way to spin something up is to load the Everything Bagel docker-compose. That's a docker-compose file that's in the lakeFS repo that brings up lakeFS with MinIO and a bunch of popular data science tools. You might bring it up, then go over the Docker compose file and get rid of all containers except lakefs, minio, and minio-setup. It's probably a question of what's most convenient for you, mostly in terms of how you like to build and develop.
lakefs 1
v
Thanks @Ariel Shaqed (Scolnicov) @Elad Lachmi. This was helpful. I am however facing issue uploading the file on Lakefs. The button 'choose file ' seems not working.
a
That's odd! Can you try to enter a destination path? Does it still stay grayed out? Or, can you try to use "lakectl fs upload"? I'll only be able to take a look tomorrow morning, sorry.
v
@Ariel Shaqed (Scolnicov) it seems like something has been wrong with my system. I am facing this in every other apps as well. Never mind the file has been uploaded to lakefs file system backed by Minio. Now when I run it I am getting the below error. Code
Copy code
object SparkSessionTest {
  def main(args : Array[String]) {
    val conf = new Configuration()
        conf.set("fs.s3a.access.key", "minioadmin")
        conf.set("fs.s3a.secret.key", "minioadmin")
        conf.set("fs.lakefs.access.key", "AKIAIOSFODNN7EXAMPLE")
        conf.set("fs.lakefs.secret.key", "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY")
        conf.set("fs.lakefs.endpoint", "<http://localhost:8000/api/v1>")
        conf.set("fs.s3a.endpoint", "<http://127.0.0.1:9090>")
        conf.set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem")

    val uri = "<lakefs://example/main/sample1.json>"
    val path = new Path("<lakefs://example/main/sample1.json>")
    val  fs =  FileSystem.get(URI.create(uri), conf)
    fs.getFileStatus(path)
 }}
Error
Copy code
0    [main] WARN  org.apache.hadoop.util.NativeCodeLoader  - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
682  [main] WARN  org.apache.hadoop.metrics2.impl.MetricsConfig  - Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
754  [main] INFO  org.apache.hadoop.metrics2.impl.MetricsSystemImpl  - Scheduled Metric snapshot period at 10 second(s).
754  [main] INFO  org.apache.hadoop.metrics2.impl.MetricsSystemImpl  - s3a-file-system metrics system started
1200 [shutdown-hook-0] INFO  org.apache.hadoop.metrics2.impl.MetricsSystemImpl  - Stopping s3a-file-system metrics system...
1200 [shutdown-hook-0] INFO  org.apache.hadoop.metrics2.impl.MetricsSystemImpl  - s3a-file-system metrics system stopped.
1200 [shutdown-hook-0] INFO  org.apache.hadoop.metrics2.impl.MetricsSystemImpl  - s3a-file-system metrics system shutdown complete.
1201 [Thread-2] WARN  org.apache.hadoop.util.ShutdownHookManager  - ShutdownHook 'ClientFinalizer' failed, java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 'void org.apache.hadoop.fs.statistics.IOStatisticsLogging.logIOStatisticsAtLevel(org.slf4j.Logger, java.lang.String, java.lang.Object)'
java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 'void org.apache.hadoop.fs.statistics.IOStatisticsLogging.logIOStatisticsAtLevel(org.slf4j.Logger, java.lang.String, java.lang.Object)'
	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
	at org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:124)
	at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
Caused by: java.lang.NoSuchMethodError: 'void org.apache.hadoop.fs.statistics.IOStatisticsLogging.logIOStatisticsAtLevel(org.slf4j.Logger, java.lang.String, java.lang.Object)'
	at org.apache.hadoop.fs.s3a.S3AFileSystem.close(S3AFileSystem.java:3963)
	at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:3678)
	at org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer.run(FileSystem.java:3695)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:577)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1623)
a
Glad to hear things are starting to improve! These lines
Copy code
java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 'void org.apache.hadoop.fs.statistics.IOStatisticsLogging.logIOStatisticsAtLevel(org.slf4j.Logger, java.lang.String, java.lang.Object)'
java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: 'void org.apache.hadoop.fs.statistics.IOStatisticsLogging.logIOStatisticsAtLevel(org.slf4j.Logger, java.lang.String, java.lang.Object)'
make me suspect that the lakefs-spark-client assembly version might not match the Spark version that you are using, or possibly the Hadoop version. • Which Spark version are you using? Is this what you get from the Everything Bagel docker-compose, or something else? • Which lakeFS client are you using? That is, I need either the full Maven coordinates ( it will look something like "io.lakefslakefs spark client 301 2.120.6.5", and we need to look at the entire name) or the name of the jar (it will look something like ".../lakefs-spark-client-312-hadoop3-assembly-0.6.5.jar", and we need to look at the entire name). THANKS!
v
@Ariel Shaqed (Scolnicov) is it possible for us to connect sometime for this issue and resolve things in a go? I feel to cut down on the iterations it would be easier as well. What do you suggest 🙂?
✔️ 1
I am using the below POM.xml which shows my spark ,hadoop lakefs version Within Bagel I removed everything else but just kept Minio, Minio set up and lakefs
Copy code
<project xmlns="<http://maven.apache.org/POM/4.0.0>" xmlns:xsi="<http://www.w3.org/2001/XMLSchema-instance>" xsi:schemaLocation="<http://maven.apache.org/POM/4.0.0> <http://maven.apache.org/maven-v4_0_0.xsd>">
  <modelVersion>4.0.0</modelVersion>
  <groupId>org.example</groupId>
  <artifactId>test-scala</artifactId>
  <version>1.0-SNAPSHOT</version>
  <name>${project.artifactId}</name>
  <description>My wonderfull scala app</description>
  <inceptionYear>2010</inceptionYear>
  <licenses>
    <license>
      <name>My License</name>
      <url>http://....</url>
      <distribution>repo</distribution>
    </license>
  </licenses>

  <properties>
    <maven.compiler.source>1.5</maven.compiler.source>
    <maven.compiler.target>1.5</maven.compiler.target>
    <encoding>UTF-8</encoding>
    <scala.version>2.13.0</scala.version>
  </properties>


  <dependencies>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.13</artifactId>
      <version>3.2.1</version>
      <scope>compile</scope>
    </dependency>

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.13</artifactId>
      <version>3.2.1</version>
      <scope>compile</scope>
    </dependency>
    <dependency>
      <groupId>io.lakefs</groupId>
      <artifactId>hadoop-lakefs</artifactId>
      <version>0.1.0</version>
    </dependency>
    <!-- <https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common> -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>3.3.5</version>
    </dependency>
    <!-- <https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws> -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-aws</artifactId>
      <version>3.3.5</version>
    </dependency>
  </dependencies>

  <build>
    <sourceDirectory>src/main/scala</sourceDirectory>
    <plugins>
    </plugins>
  </build>
</project>
a
There are some mismatched versions in there 😞 • All Spark versions should be the same, and compatible with what you use for the Jar. I would consider using spark-3.1.2 for everything. • The FS version here seem very odd! Our guide recommends using version 0.1.13, so something like
Copy code
<dependency>
      <groupId>io.lakefs</groupId>
      <artifactId>hadoop-lakefs-assembly</artifactId>
      <version>0.1.13</version>
    </dependency>
v
I changed the versions suggested by you. Now my program goes to hung state
Copy code
0    [main] WARN  org.apache.hadoop.util.NativeCodeLoader  - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
489  [main] INFO  org.apache.commons.beanutils.FluentPropertyBeanIntrospector  - Error when creating PropertyDescriptor for public final void org.apache.commons.configuration2.AbstractConfiguration.setProperty(java.lang.String,java.lang.Object)! Ignoring this property.
498  [main] WARN  org.apache.hadoop.metrics2.impl.MetricsConfig  - Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
588  [main] INFO  org.apache.hadoop.metrics2.impl.MetricsSystemImpl  - Scheduled Metric snapshot period at 10 second(s).
588  [main] INFO  org.apache.hadoop.metrics2.impl.MetricsSystemImpl  - s3a-file-system metrics system started
Revised POM
Copy code
<project xmlns="<http://maven.apache.org/POM/4.0.0>" xmlns:xsi="<http://www.w3.org/2001/XMLSchema-instance>" xsi:schemaLocation="<http://maven.apache.org/POM/4.0.0> <http://maven.apache.org/maven-v4_0_0.xsd>">
  <modelVersion>4.0.0</modelVersion>
  <groupId>org.example</groupId>
  <artifactId>test-scala</artifactId>
  <version>1.0-SNAPSHOT</version>
  <name>${project.artifactId}</name>
  <description>My wonderfull scala app</description>
  <inceptionYear>2010</inceptionYear>
  <licenses>
    <license>
      <name>My License</name>
      <url>http://....</url>
      <distribution>repo</distribution>
    </license>
  </licenses>

  <properties>
    <maven.compiler.source>1.5</maven.compiler.source>
    <maven.compiler.target>1.5</maven.compiler.target>
    <encoding>UTF-8</encoding>
    <scala.version>2.13.0</scala.version>
  </properties>


  <dependencies>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.12</artifactId>
      <version>3.1.2</version>
    </dependency>

    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.12</artifactId>
      <version>3.1.2</version>
      <scope>provided</scope>
    </dependency>


    <dependency>
      <groupId>io.lakefs</groupId>
      <artifactId>hadoop-lakefs-assembly</artifactId>
      <version>0.1.13</version>
    </dependency>
    <!-- <https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common> -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-common</artifactId>
      <version>3.1.2</version>
    </dependency>
    <!-- <https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws> -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-aws</artifactId>
      <version>3.1.2</version>
    </dependency>
  </dependencies>

  <build>
    <sourceDirectory>src/main/scala</sourceDirectory>
    <plugins>
    </plugins>
  </build>
</project>
a
The fact that it doesn't do anything is worrying 😞 Spark is running, no worrying messages 🙂 I am really not sure where to go from here -- I would expect to see some outputs from spark-submit, somewhere. Does Docker see any interesting log lines from any of the Spark containers?
v
I know it is a bit frustrating I have spent more than a week setting it but still not able to replicate it 😑 I am using intelliJ and running the App directly. Do I need to use spark-submit?
a
I'll find someone who can advise with running in IntelliJ. I literally never run IntelliJ (or Spark in the debugger).
v
I have spark container running as well. If you help me with how can I run the JAR in those container may be I can give it another try . there is running spark3.0 there not 3.1.2
Some good news. I have tried spark submit and got a different error. I hope the below trace helps @Ariel Shaqed (Scolnicov). I feel
conf.set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem")
is doing something fishy it is not letting io.lakefs do things.
Copy code
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class io.lakefs.LakeFSFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2688)
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
        at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
        at com.sparkbyexamples.spark.SparkSessionTest$.main(SparkSessionTest.scala:34)
        at com.sparkbyexamples.spark.SparkSessionTest.main(SparkSessionTest.scala)
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
        at java.base/java.lang.reflect.Method.invoke(Method.java:578)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at <http://org.apache.spark.deploy.SparkSubmit.org|org.apache.spark.deploy.SparkSubmit.org>$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Code
Copy code
val spark = SparkSession.builder()
  .master("local[1]")
  .appName("SparkByExample")
  .getOrCreate();

println("First SparkContext:")
println("APP Name :"+spark.sparkContext.appName);
println("Deploy Mode :"+spark.sparkContext.deployMode);
println("Master :"+spark.sparkContext.master);

val conf = new Configuration()
conf.set("fs.s3a.access.key", "minioadmin")
conf.set("fs.s3a.secret.key", "minioadmin")
conf.set("fs.lakefs.access.key", "AKIAIOSFODNN7EXAMPLE")
conf.set("fs.lakefs.secret.key", "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY")
conf.set("fs.lakefs.endpoint", "<http://localhost:8000/api/v1>")
conf.set("fs.s3a.endpoint", "<http://127.0.0.1:9090>")
conf.set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem")

val uri = "<lakefs://example/main/sample1.json>"
val path = new Path("<lakefs://example/main/sample1.json>")
val  fs =  FileSystem.get(URI.create(uri), conf)
fs.getFileStatus(path)
Pom
Copy code
<project xmlns="<http://maven.apache.org/POM/4.0.0>" xmlns:xsi="<http://www.w3.org/2001/XMLSchema-instance>"
         xsi:schemaLocation="<http://maven.apache.org/POM/4.0.0> <http://maven.apache.org/maven-v4_0_0.xsd>">


    <groupId>com.sparkbyexamples</groupId>
    <modelVersion>4.0.0</modelVersion>
    <artifactId>spark-scala-examples</artifactId>

    <version>1.0-SNAPSHOT</version>
    <inceptionYear>2008</inceptionYear>
    <packaging>jar</packaging>
    <properties>
        <scala.version>2.12.12</scala.version>
        <spark.version>3.0.0</spark.version>
    </properties>

    <repositories>
        <repository>
            <id><http://scala-tools.org|scala-tools.org></id>
            <name>Scala-Tools Maven2 Repository</name>
            <url><http://scala-tools.org/repo-releases></url>
        </repository>
    </repositories>

    <pluginRepositories>
        <pluginRepository>
            <id><http://scala-tools.org|scala-tools.org></id>
            <name>Scala-Tools Maven2 Repository</name>
            <url><http://scala-tools.org/repo-releases></url>
        </pluginRepository>
    </pluginRepositories>

    <dependencies>

        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>
        <dependency>
            <groupId>org.specs</groupId>
            <artifactId>specs</artifactId>
            <version>1.2.5</version>
            <scope>test</scope>
        </dependency>

        <dependency>
            <groupId>com.thoughtworks.xstream</groupId>
            <artifactId>xstream</artifactId>
            <version>1.4.11</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.12</artifactId>
            <version>${spark.version}</version>
            <scope>compile</scope>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>${spark.version}</version>
         <scope>compile</scope>
        </dependency>

        <dependency>
           <groupId>com.databricks</groupId>
           <artifactId>spark-xml_2.11</artifactId>
           <version>0.4.1</version>
       </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-avro_2.12</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>io.lakefs</groupId>
            <artifactId>hadoop-lakefs-assembly</artifactId>
            <version>0.1.13</version>
        </dependency>
        <!-- <https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common> -->
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>3.0.0</version>
        </dependency>


    </dependencies>

    <build>
        <sourceDirectory>src/main/scala</sourceDirectory>
        <resources><resource><directory>src/main/resources</directory></resource></resources>
        <plugins>
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
                <configuration>
                    <scalaVersion>${scala.version}</scalaVersion>
                    <args>
                        <arg>-target:jvm-1.5</arg>
                    </args>
                </configuration>
            </plugin>

        </plugins>
    </build>
    <reporting>
        <plugins>
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <configuration>
                    <scalaVersion>${scala.version}</scalaVersion>
                </configuration>
            </plugin>
        </plugins>
    </reporting>
</project>
y
Hey @Vaibhav Kumar, how are you running your code? I was able to use your pom file to successfully run the attached code. I used the following command:
Copy code
mvn package exec:java -Dexec.mainClass=Main
(Assuming your code is in a main method in an object called Main)
v
@Yoni Augarten I am using
spark-submit --class com.sparkbyexamples.spark.SparkSessionTest target/spark-scala-examples-1.0-SNAPSHOT.jar
Below is my whole code
Copy code
package com.sparkbyexamples.spark
import org.apache.hadoop.fs.Path
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.conf.Configuration
import java.net.URI
import org.apache.spark.sql.SparkSession
import io.lakefs
object SparkSessionTest {

  def main(args:Array[String]): Unit ={


    val spark = SparkSession.builder()
      .master("local[1]")
      .appName("SparkByExample")
      .getOrCreate();
    
    println("First SparkContext:")
    println("APP Name :"+spark.sparkContext.appName);
    println("Deploy Mode :"+spark.sparkContext.deployMode);
    println("Master :"+spark.sparkContext.master);

    val conf = new Configuration()
    conf.set("fs.s3a.access.key", "minioadmin")
    conf.set("fs.s3a.secret.key", "minioadmin")
    conf.set("fs.lakefs.access.key", "AKIAIOSFODNN7EXAMPLE")
    conf.set("fs.lakefs.secret.key", "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY")
    conf.set("fs.lakefs.endpoint", "<http://localhost:8000/api/v1>")
    conf.set("fs.s3a.endpoint", "<http://127.0.0.1:9090>")
    conf.set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem")

    val uri = "<lakefs://example/main/sample1.json>"
    val path = new Path("<lakefs://example/main/sample1.json>")
    val  fs =  FileSystem.get(URI.create(uri), conf)
    fs.getFileStatus(path)

  }
}
y
I don't think your Jar contains any of your dependencies, that's why LakeFSFileSystem cannot be found. Make sure you are running your program with a classpath that includes all of the Maven dependencies. This can be done either using Maven like I did above, or by creating an "uber-jar", or by using an IDE to manage dependencies, or by some other way.
a
Thanks, Yoni, that really sounds like a Spark issue! Vaibhav, are you using a downloaded jar or did you build it yourself? You'll want to run "mvn package" to assemble a hat that brings in all its dependencies.
v
I can see the jar under External Libraries .PFB the screenshots.
y
Then if you run it from your IDE it should work. When you simply spark-submit a jar, these external dependencies are not included.
v
but when we package a project to JAR then all code and dependencies comes at a single place right which is the JAR Another point is if lib was not found my import statements would have failed but those are passing Let me know if my understanding is correct
y
In what way are they "passing"?
Import statements in Java are not executed in the same way code is
Regarding the JAR, it depends how you created it. Since your pom doesn't include any plugin that knows to create an uber-jar, your jar probably only includes your compiled classes, and not other dependencies.
v
@Yoni Augarten I think you are right when I run from IDE directly it is now throwing a different error
Copy code
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2479)
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3254)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3286)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:123)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3337)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3305)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:476)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
	at io.lakefs.LakeFSFileSystem.initializeWithClientFactory(LakeFSFileSystem.java:154)
	at io.lakefs.LakeFSFileSystem.initialize(LakeFSFileSystem.java:110)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3288)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:123)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3337)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3305)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:476)
	at com.sparkbyexamples.spark.SparkSessionTest$.main(SparkSessionTest.scala:34)
	at com.sparkbyexamples.spark.SparkSessionTest.main(SparkSessionTest.scala)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2383)
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2477)
	... 16 more
aws-sdk is missing i think
y
@Vaibhav Kumar - thanks for updating. Developing Spark applications does have its learning curve when trying to work with external libraries. Some are easier to import using pom files, while with others there is no other choice but downloading the Jar from the internet. Unfortunately there is no one-size-fits-all approach as there are just too many moving parts. It's a matter of trial and error in the end.
v
no problem at the end we all are leaning and growing 🙂
😊 1
I reran after importing the below module and now my program is in hung state
Copy code
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-aws</artifactId>
    <version>3.0.0</version>
</dependency>
while copying I get the below error most of the time lakeFS % ./lakectl fs upload -s /Users/simar/Downloads/sample1.json -d lakefs://example/main/
request failed: parameter "path" in query has an error: value is required but missing: value is required but missing
a
Can you try adding a path at the end of the destination URL?
v
path as in ?lakefs://example/main/ is the path where I want to add the file
a
v
@Ariel Shaqed (Scolnicov) @Yoni Augarten I saw some different error this time. This time there is some connection issue with Minio. Please see the log below
Copy code
Exception in thread "main" org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on example: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to 127.0.0.1:9090 [/127.0.0.1] failed: Connection refused (Connection refused): Unable to execute HTTP request: Connect to 127.0.0.1:9090 [/127.0.0.1] failed: Connection refused (Connection refused)
	at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:144)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:332)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:275)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3288)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:123)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3337)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3305)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:476)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
	at io.lakefs.LakeFSFileSystem.initializeWithClientFactory(LakeFSFileSystem.java:154)
	at io.lakefs.LakeFSFileSystem.initialize(LakeFSFileSystem.java:110)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3288)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:123)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3337)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3305)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:476)
	at com.sparkbyexamples.spark.SparkSessionTest$.main(SparkSessionTest.scala:34)
	at com.sparkbyexamples.spark.SparkSessionTest.main(SparkSessionTest.scala)
Minio is running already at http://127.0.0.1:9001/ in my browser
a
I think you have a configuration issue with your S3 endoint somewhere: it says
Copy code
Connect to 127.0.0.1:9090 [/127.0.0.1] failed: Connection refused (Connection refused):
and it looks to be using another port.
v
@Ariel Shaqed (Scolnicov) it has worked I changed the port to the API one. jumping lakefs
but how to print anything out of it? I mean when the file was present it worked without any error But when the file is not present it didn't exactly gave me the error that you showed in the issue in the stack trace.
Copy code
Exception in thread "main" java.io.FileNotFoundException: <lakefs://example/main/sample2.json> not found
	at io.lakefs.LakeFSFileSystem.getFileStatus(LakeFSFileSystem.java:784)
	at io.lakefs.LakeFSFileSystem.getFileStatus(LakeFSFileSystem.java:74)
	at com.sparkbyexamples.spark.SparkSessionTest$.main(SparkSessionTest.scala:35)
	at com.sparkbyexamples.spark.SparkSessionTest.main(SparkSessionTest.scala)
👀 1
a
Yay, now we're cooking! Yeah, that's the good flow. Now try a repo that doesn't exist, or a branch that doesn't exist.
v
Yup, It worked fully.
Now, moving and solving the problem I need to gather the response from lakefs server that is being received in LakeFSFileSystem right?
a
Yeah, swagger.yml can give all possible return status codes and messages. This issue is to translate these exceptions into e.g. FileNotFoundException, which is more informative to the user.
v
a
Yup! lakeFSFS problems a bunch of calls, reach can fail. I would start with the open and create methods, they hold many of the most common failures. After that list, getFileStatus, listStatus might also have similar failure modes.
v
@Ariel Shaqed (Scolnicov) I went through the swagger doc and responses we are getting using postman. I think I have figured out where we need to add the code. It should be somewhere in
readNextChunk()
in
LakeFSFileSystem.java
I am not getting how shall I get the response code in
LakeFSFileSystem.java
e
Hi @Vaibhav Kumar, To make sure I understand what you’re trying to do… The idea is to handle errors better instead of simply throwing them?
v
I want to get the response directly from the lakefs server for clear messages. If you see in this issue which I am working on the error thrown is not very clear. While if you hit the list object api we get proper error like '"repository not found". I have set up ready already to reproduce the same error but I am not sure how to get those response codes @Elad Lachmi
e
It depends on how the client is implemented If the client throws an error for any non-200 response HTTP status, then you’ll need to handle it in the
catch
block If it throws only for e.g. 5xx, then it might require checking the response status on the
resp
If the client throws an error for any non-200 response HTTP status, then you’ll need to handle it in the
catch
block
👆🏻 Looking through some of the existing code, it looks like this is the case
v
yes you are correct but I am not getting any suggestions around response code in the below screenshot
e
You need to look at
e
I believe
resp
is out of scope when you’re in the
catch
block
v
Same with
e
as well
e
ApiException
has a
getCode()
method and a few more you might want to use for this purpose
👍🏼 1
v
ok, can you please share some document for my reference? I am not getting it how will I use it.
e
You can have the
ApiException
class here
Copy code
clients/java/src/main/java/io/lakefs/clients/api/ApiException.java
You can look at it and see which methods are available
v
yes I can see now
getCode()
is there. So now all I have to do is
e.getCode()
in the catch block right?
e
Yes, and you have the
HttpStatus
enum to compare to for different HTTP statuses
v
ok sure, So after making the changes in this code. I can just run the lakeFS go server and test my changes right? To replicate the error till nor I was using hadoop client (pasted below) from my intelliJ. This talks to my lakefs docker and MInio as of now.
Copy code
def main(args : Array[String]) {
//    val spark = SparkSession.builder.master("local[1]").appName("SparkByExample").getOrCreate
    val conf = new Configuration()
        conf.set("fs.s3a.access.key", "minioadmin")
        conf.set("fs.s3a.secret.key", "minioadmin")
        conf.set("fs.lakefs.access.key", "AKIAIOSFODNN7EXAMPLE")
        conf.set("fs.lakefs.secret.key", "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY")
        conf.set("fs.lakefs.endpoint", "<http://localhost:8000/api/v1>")
        conf.set("fs.s3a.endpoint", "<http://127.0.0.1:9090>")
        conf.set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem")


    val uri = "<lakefs://test1/main/sample1.json>"
    val path = new Path("<lakefs://test1/main/sample1.json>")
    val  fs =  FileSystem.get(URI.create(uri), conf)
    fs.getFileStatus(path)
e
Sounds like a good plan Good luck!
v
Thankyou @Elad Lachmi. Appreciate this help from you. Will let you my findings.
e
Sure, happy to help 🙂
v
I have made the changes to my
LakefsFilesystem.java
and run it from the go command
lakefs % go run main.go run --local-settings
I have created a hadoop client on my local and trying to test the changes (response code in exception)] to my lakefs code . The changes that I have made I cannot see those reflecting in the logs when I run my client shown below. Do you know what could be the issue here?
Copy code
package org.example
import org.apache.hadoop.fs.Path
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.conf.Configuration
import java.net.URI

object SparkSessionTest {
  def main(args : Array[String]) {
    val conf = new Configuration()
        conf.set("fs.s3a.access.key", "minioadmin")
        conf.set("fs.s3a.secret.key", "minioadmin")
        conf.set("fs.lakefs.access.key", "AKIAJLDV6JMK2R5TRQSQ")
        conf.set("fs.lakefs.secret.key", "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")
        conf.set("fs.lakefs.endpoint", "<http://localhost:8000/api/v1>")
        conf.set("fs.s3a.endpoint", "<http://127.0.0.1:9090>")
        conf.set("fs.lakefs.impl", "io.lakefs.LakeFSFileSystem")


    val uri = "<lakefs://test1/main/sample1.json>"
    val path = new Path("<lakefs://test1/main/sample1.json>")
    val  fs =  FileSystem.get(URI.create(uri), conf)
    fs.getFileStatus(path)
i
Hi, I would first make sure that I’m setting the correct log level, second would be making sure that the code change you made are reflecting in the code that is running.
v
@Isan Rivkin 1. I am using LOG.info in
LakefsFilesystem.java
. Let me know if that is incorrect 2. I have the server running from this changed code. I have the client running separately. This thing I have checked.
I have observed something unusual , Earlier when I was running my lakefs,Minio in a docker setup(Baggle). So as soon as I upload file to lakefs repo it got reflected in minio as well. Now as I running lakefs from terminal using go command and minio still from docker. I cannot see the same behaviour . the file are not syncing. Now I am getting error like
lakeFS blockstore type local unsupported by this FileSystem
i
I can’t tell if INFO log level is the right one, it all depends on the logs you are trying to see, if you’re logging DEBUG then you should probably use that. Regarding your error, looking at the PhysicalAddressTranslator.java class the Hadoop client supports s3 and does not support local storage.
v
Yes but I am using minio storage and I have used the properties accordingly in client I have tried with
LOG.debug
as well but it is not working. Can someone from Lakefs connect with me for sometime on this? It is not a really complex PR but to set things for testing it took me a lot of time.
i
It would be more beneficial for the community to be able to search different questions. Im not sure what is the issue you are currently facing, outside the PR you are trying to do. What’s the issue you are asking about because this thread contains a lot of conversations. Is your current problem that you run lakeFS and expect to see logs that you added but don’t actually see them?
v
I am working on this issue. Upon further investigating I got to know that I have to work on catching some server responses in this file File . So now I am trying to add some debug statement in the same file under
getFileStatus
function . I am trying to hit the lakeFS from outside by creating the hadoop client and then to check if I get those logs or not.
i
Thanks for the context, To solve the logs issue I would suggest adding simple log even at the start of the code you are running in Java and try to see if it logs or not, then if not debug it from there.
v
I have tried almost everything around logs.but it is not working.
i
If you see other logs then I guess it’s either that that the Java code you are running is not the code you think is running therefore your logs are not appearing, or something overrides the log level. Try making a change that’s not related to logs then and see if the behavior changes at all.
v
@Isan Rivkin I think I got the issue. I am running this compose file which obviously pulls the latest lakefs image and not my changed code. When I added some logs in the main.go file and running it locally I can see the logs from
main.go
on my terminal. Now to narrow down the problem how shall I set the blockstore to minio when I run using go run command. In other words what are the argument required to set the below properties in
go run main.go run --local-settings
Copy code
## Commands from docker compose file to set blockstore to MINIO 
- LAKEFS_DATABASE_TYPE=local
      - LAKEFS_BLOCKSTORE_TYPE=s3
      - LAKEFS_BLOCKSTORE_S3_FORCE_PATH_STYLE=true
      - LAKEFS_BLOCKSTORE_S3_ENDPOINT=<http://minio:9000>
      - LAKEFS_BLOCKSTORE_S3_CREDENTIALS_ACCESS_KEY_ID=minioadmin
      - LAKEFS_BLOCKSTORE_S3_CREDENTIALS_SECRET_ACCESS_KEY=minioadmin
      - LAKEFS_AUTH_ENCRYPT_SECRET_KEY=some random secret string
      - LAKEFS_STATS_ENABLED
      - LAKEFS_LOGGING_LEVEL
      - LAKECTL_CREDENTIALS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
      - LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
      - LAKECTL_SERVER_ENDPOINT_URL=<http://localhost:8000>
i
Glad the logs issue solved
v
yes, but I still have the other part left to be solved 😄
a
I believe you can change your compose file to run treeverse/lakefs:dev. Now if you
make build-docker
and rm the lakeFS container and start a new one, you should be able to see your code. Alternatively, you might be able to
docker cp
a lakefs executable into the container on /app/lakefs. Restart that container and your code should run. (This one can be trickier if your container runs a sufficiently different version of Linux than your physical machine.)
v
As of now I found this config doc . So I was creating `.lakefs.yaml`file and adding configs this way
database.type="local"
let me know if I am going in the right direction. Sorry, but I didn't followed you exactly on the docker related stuff that you have suggested. I think you are saying to build the local image using pom.xml? Can you share some doc for me to take a look into that as well?
I have set the config the below way in
.lakefs.yaml
but I am getting some syntax issues 🫣
Copy code
database.type="local"
blockstore.type="s3"
blockstore.s3.force_path_style=true
blockstore.s3.endpoint="<http://minio:9000>"
blockstore.s3.credentials.access_key_id="minioadmin"
blockstore.s3.credentials.secret_access_key="minioadmin"
a
That's not YAML. I think you are having all of these configuration issues that have nothing to do with the issue that you are trying to resolve. You might consider taking some time off fixing the issue to read up on how lakeFS configuration works. We have some examples on the page (here's one). We try to make our configuration very similar to configuration of other systems. So hopefully it will also be helpful for configuring other systems.
v
yes I think I am not using the configs the right way. So my links between the lakefs server , Minio and the hadoop client are broken.
170 Views