Hi, I am running into some issues regarding Hive m...
# help
s
Hi, I am running into some issues regarding Hive metastore copy - thread
I've created a table in Hive metastore called test_spark.lakefs_table. The table location is pointed to the parquet data files in the master branch. I want to create a table in the same schema, except called test_spark.branch_lakefs_table which points to a new lakefs branch called metastore_test.
I issued the following command
lakectl metastore copy --from-schema test_spark --from-table lakefs_table --to-schema test_spark --to-table branch_lakefs_table --to-branch metastore_test
which does create the new table branch_lakefs_table, but the table location still appears to be pointing to the master branch, not metastore_test branch.
For example, I insert more data into that table, and the new data files show up in master, the metastore_branch directory appears to be unaffected.
b
Hi Sid
When you wrote that it still appears to be pointing to the master table - this the branched table, before you did any change to the table, right?
Can you dump the location you see on the branch_lakefs_table?
Copy code
DESCRIBE FORMATTED branch_lakefs_table;
s
Yes the branch_lakefs_table is the new table that I created using the metastore copy command, which I would think should point to the data files in the metastore_test branch, but appears to be pointing to the data files in the master branch
b
can you run the above command from Hive
s
You can see that it is pointing to master
Clearer image
👀 1
b
Copy code
lakectl --version
s
lakectl version 0.40.3
b
thanks, checking
can you try running the same command just with a branch that does not exists?
Copy code
lakectl metastore copy --from-schema test_spark --from-table lakefs_table --to-schema test_spark --to-table branch_lakefs_table_test --to-branch no_such_branch
and
Copy code
DESCRIBE FORMATTED lakefs_table;
s
image.png
👀 1
Still pointing to master, even though the branch doesnt even exist
b
Tried to reproduce the two scenarios about without success. Will try more tomorrow and if we fail - I'll send you a version with more diagnostics to help us capture as must information as possible to understand how this flow works for you.
👍 1
Hi, looks like we found the reason that explains the issue you had with
lakectl metadata copy
. The
copy
command does copy the information from one table to another, but it also holds a merge functionality. When the destination table exists the merge operation tries to copy all information from the source table. This is for cases that the metadata was updated (like partitions and columns) and you like to update the destination. The problem is that when you merge, you can't move the location - which is based on the target branch. This means that when you like to
copy
the branch name (and the location) of the new table has meaning and when you merge information it has no effect. Regarding the issue you experienced. Is it possible you performed a copy and didn't specify the branch name the first time you run the command and when we run it again we expected it to get updated, which explains all the pending request where we specified a missing branch and still the operation completed the update and didn't report an error (without location update)? I will open an issue that explains the issue and suggest splitting the two operations, for a better user experience.
If you delete the table and run the copy again it should work for you as the location will be created based on the target branch.
s
I deleted the table and reran the metastore copy command, and I'm getting the same behavior. It creates the new table but still points to the master branch directory.
New strange behavior that I noticed - 1. I create table_1 in master branch and insert values 1,2,3,4 2. I do metastore copy command to new table_2 on new feature branch 3. I insert value 5 into table_2 SELECT * FROM table_2 contains 1,2,3,4,5 and SELECT * FROM table_1 still contains 1,2,3,4, even though they both point to the master branch (as can be seen in the DESCRIBE FORMATTED command)
b
Can you check the lakeFS UI for changes (changes tab) on the new feature branch and on master branch? do you see any modified or new file there?
s
There are no changes in the new feature branch, but there are in the master branch.
b
When you had the table on master branch and insert 1,2,3,4 - did you commit the changes?
s
yes they were committed
When I rollback "insert value 5" from the master branch, it no longer appears when I select from either table_1 or table_2
b
Can you add into table_1 and see if it updated 'master' branch like it did for table_2
Lets try something like the above - if possible
can you paste here the command you used to create the table?
s
Hope this helps
CREATE TABLE test_spark.lakefs_table_1 (  id INT ) USING PARQUET LOCATION 'lakefs://origin/master/test_spark/lakefs_table' INSERT INTO test_spark.lakefs_table_1 VALUES (1), (2), (3), (4) DESCRIBE table_1 —> path = 'lakefs://origin/master/test_spark/lakefs_table_1' Lakefs commit changes to master Created new branch metastore_test_3 in the UI ./lakefs/lakectl metastore copy --from-schema test_spark --from-table lakefs_table_1 --to-schema test_spark --to-table lakefs_table_2 --to-branch metastore_test_3 New table created, shows up in Hive registry DESCRIBE table_2 —> path = 'lakefs://origin/master/test_spark/lakefs_table_1' INSERT INTO test_spark.lakefs_table_2 VALUES (5) SELECT * FROM test_spark.lakefs_table_1 —> 1,2,3,4 SELECT * FROM test_spark.lakefs_table_2 —> 1,2,3,4,5 INSERT INTO test_spark.lakefs_table_1 VALUES (6) SELECT * FROM test_spark.lakefs_table_1 —> 1,2,3,4,5,6  !! Note that the value 5 now shows up in table_1 SELECT * FROM test_spark.lakefs_table_2 —> 1,2,3,4,5 Lakefs UI shows two uncommitted changes in master - one parquet for value 5, and one parquet file for value 6. No changes shown in branch metastore_test_3
b
Thank you - will it be possible to online for a quick talk?
s
Yes I am, Zoom or Slack would work
b
I'll share zoom link