• g

    GR

    7 months ago
    Can someone please explain how lakeFS is better than Nessie? https://www.dremio.com/subsurface/nessie-git-for-data-lakes/
    g
    Oz Katz
    +1
    6 replies
    Copy to Clipboard
  • g

    GR

    6 months ago
    Can lakeFS be used for Data Versioning in ML?
    g
    einat.orr
    4 replies
    Copy to Clipboard
  • Iddo Avneri

    Iddo Avneri

    6 months ago
    I have a question about a use case that just came up in a conversation: I have a pipeline that runs twice a day. The output of it is around 4K parquet files, of data from the last 3 months. The only difference between the output of the pipeline running in 7:00 am and the one running in 7:00 PM should be the first and last 12 hours of the 3 months.1. What will be the best way in lakeFS to confirm that is the case? 2. Is there a way to confirm it visually in a commit comparison way? Meaning, not only what object changed, but also is it the first and last 12 hours?
    Iddo Avneri
    Oz Katz
    +1
    3 replies
    Copy to Clipboard
  • Mike Dwell Siegmund

    Mike Dwell Siegmund

    6 months ago
    Hi all, I am focusing on Data Quality Assessment: I have read the blog article and have the following questions: There are 6 dimensions I would like to measure: • Accuracy: How well does a piece of information reflect reality? • Completeness: Does it fulfill your expectations of what’s comprehensive? • Consistency: Does information stored in one place match relevant data stored elsewhere? • Timeliness: Is your information available when you need it? • Validity (aka Conformity): Is the information in a specific format, type, or size? Does it follow business rules/best practices? • Integrity: Can different data sets be joined correctly to reflect a larger picture? Are relations well defined and implemented?https://lakefs.io/data-quality-testing/ I know it is hard to fully automate all these and measure quantitatively, for example, "timeliness" can probably only be evaluated subjectively and qualitatively. But still, I am looking for open-source software tools that can help the quantitative part, i.e. producing plots/graphs, benchmarks and scores. Moreover, I am looking for tools that can help conduct the statistical tests mentioned in the above-mentioned blog article(https://lakefs.io/data-quality-testing/) Could anybody give me some pointers about the software tools? Thanks a lot!
    Mike Dwell Siegmund
    einat.orr
    3 replies
    Copy to Clipboard
  • Edmondo Porcu

    Edmondo Porcu

    6 months ago
    Hello world 🙂
    Edmondo Porcu
    Paul Singman
    +2
    14 replies
    Copy to Clipboard
  • Lynn Rozen

    Lynn Rozen

    6 months ago
    Hi all 🙂 I have an implementation question I'd love to share and hear your thoughts about. Say we have a spark cluster that runs spark jobs on multiple lakeFS installations (a spark job per installation). I want to create a flow that runs those spark jobs. However, the installations batch can change dynamically. What do you think would be the best way to do that? 🤔 I thought about airflow, but not sure how to create that kind of dynamic DAG, and if it's possible.
    Lynn Rozen
    Oz Katz
    +3
    10 replies
    Copy to Clipboard
  • y

    Yusuf K

    6 months ago
    Hey I'm trying to use rclone to directly copy into lakefs branches (awesome btw). It's throwing a 'no such host' when I run the rclone sync command for the lakefs endpoint. I swapped out example for my repository name here. I'm using a local postgres installation, should I be putting the postgres connection string here?.. Sorry for the dumb question. I'm also including what the lakefs part of my rclone config looks like.
    y
    Eden Ohana
    +1
    15 replies
    Copy to Clipboard
  • Ariel Shaqed (Scolnicov)

    Ariel Shaqed (Scolnicov)

    6 months ago
    Wow, Kafka managed the great move to Raft! Amazing news for the users, and a stellar accomplishment by the devs!https://www.confluent.io/blog/why-replace-zookeeper-with-kafka-raft-the-log-of-all-logs/
    Ariel Shaqed (Scolnicov)
    Oz Katz
    +1
    3 replies
    Copy to Clipboard
  • g

    GR

    6 months ago
    How does the Dev/Test/Prod environment for the Data Lake work when using lakeFS? Example, currently we have separate data lakes for Dev/Test/Prod environments and we promote the changes to Prod through pull requests and Code Review. When using lakeFS, do we just use one single Data Lake and just maintain Dev/Test/Prod branches?
    g
    Tal Sofer
    16 replies
    Copy to Clipboard
  • d

    donald

    5 months ago
    Does lakeFS support to use HDFS as block storage? I just looked at the configuration reference page and it seems that it only supports local, Amazon/google/Azure cloud storage as the available types are local, s3, gs, azure, mem.
    d
    Barak Amar
    +2
    7 replies
    Copy to Clipboard