< Iddo Avneri> is there a tutorial on how to use Spark table lakeFS #help

<@U02BRC5J7U0> is there a tutorial on how to use S...

Edmondo Porcu

05/02/2022, 5:54 PM

@Iddo Avneri is there a tutorial on how to use Spark tables with LakeFS? I was wondering what's the right approach. Should one create a clone of a database but pointing it to a different branch?

Iddo Avneri

05/02/2022, 5:56 PM

Are you on Databricks?

Or Tzabary

05/02/2022, 5:59 PM

hey @Edmondo Porcu ! mind sharing some context on the use-case?

Or Tzabary

05/02/2022, 6:04 PM

We have our Spark integration guide which contains a high-level introduction on how to use lakeFS with Spark, you can also read Similarweb's case study. Looking forward on hearing more on you use-case.

Edmondo Porcu

05/02/2022, 9:44 PM

This uses s3 API directly, while people often use the hive metastore and interacts with data using tables

Edmondo Porcu

05/02/2022, 9:44 PM

In practice if you want to restore a previous version of a certain database, you need to change the location at which the database is pointing

Or Tzabary

05/02/2022, 10:00 PM

You can use lakeFS to revert a branch to specific commits/tags and this way you won't need to change the location

Or Tzabary

05/02/2022, 10:01 PM

You can see it in the first example here: https://docs.lakefs.io/usecases/production.html

einat.orr

05/03/2022, 3:30 AM

@Edmondo Porcu You can use lakeFS with Hive Metastore. Instructions on how to do so are here: https://docs.lakefs.io/integrations/hive.html

🙏 1

Or Tzabary

05/05/2022, 2:26 PM

hi @Edmondo Porcu I wonder, did you manage to solve your issue?

Edmondo Porcu

05/05/2022, 10:14 PM

nope 😞

Edmondo Porcu

05/05/2022, 10:15 PM

I don't understand how you switch the data underneath a table transparently, but maybe is the wrong thing to do. However that's how people on Spark are used to deal with tables

Or Tzabary

05/05/2022, 10:23 PM

I'm not sure I understand, are you referring to lakeFS in general? Mind clarifying?

Edmondo Porcu

05/05/2022, 10:25 PM

So, if you have a pipeline that works with "file API" Spark, you have a prefix that point to the current branch. You create a new branch, you change the prefix, great. But what about when you work with tables API?

Or Tzabary

05/05/2022, 10:30 PM

lakeFS does support Hive integration. Have you seen the link Einat shared?

Edmondo Porcu

05/08/2022, 7:06 PM

yes, it sounds thought the workflow would not what people expect

Or Tzabary

05/08/2022, 7:11 PM

care to share how would you expect to see it? we're open source and open to changes 🙂

Edmondo Porcu

06/23/2022, 6:17 PM

You would want to change the branch of a datasource and get all the hive tables updated, probably...

3 Views

Open in Slack

Previous Next