https://lakefs.io/ logo
Docs
Join the conversationJoin Slack
Channels
announcements
blockers_for_windward
career-opportunities
celebrations
cuddle-corner
data-discussion
data-events
dev
events
general
help
iceberg-integration
lakefs-for-beginners
lakefs-hubspot-cloud-registration-email-automation
lakefs-releases
lakefs-suggestions
lakefs-twitter
linen-dev
memes-and-banter
new-channel
new-channel
say-hello
stackoverflow
test
Powered by Linen
data-discussion
  • f

    Farman Pirzada

    08/24/2022, 3:15 PM
    thank you so much for this!
  • f

    Farman Pirzada

    08/24/2022, 3:15 PM
    super helpful
    :60fps_parrot: 1
    :jumping-lakefs: 1
  • h

    Hugh Nolan

    08/29/2022, 4:42 PM
    Hi, I'm looking at options for "rewriting history" for branches/tags/commits. Context - we have tabular files containing data from multiple users together. We're using LakeFS to version these files, then they get merged to act as complementary data/metadata for bigger less easily structured data. We're running on top of S3 storage. I'm looking to use tags/commits on data as immutable "data release" points in time, so we can do analysis on a specific version and revert if necessary for data traceability & lineage reasons. HOWEVER If a user wants to withdraw their data, or we need to remove a data column/feature/etc for some other reason (accidental commit, change of laws, data expiry, whatever), the data needs to be removed from history. I'm looking for thoughts on practicalities. Most straightforward <to me, naive user> working option: • iterate all commits, find all historical versions of a file that needs updated (stat object, look at physical address), then load/redact/save all of these back to S3 (sidestepping LakeFS). This seems to work in a test, but I'm not sure is there checksum/data integrity tests/etc that will break, or are in a roadmap to break. This is a useful flow as it means all the data loading/tagging/history works as it is, data appears immutable and historical analysis can be recreated (so long as redactions don't affect a specific analysis). Commit IDs remain unchanged so I don't have to mess around with tags, branches, etc. Alternatively, • Iterate all tags ("immutable" versions), add an extra commit with the updated/redacted file(s), delete/recreate tag, then delete old file from physical storage. This is "more legit" as the change is tracked in LakeFS history, but branch histories get all messed up - can't revert or checkout old commits, as the file will be missing in history. Any thoughts or ideas, things I've missed, other architectures/setups that might support this better?
  • e

    einat.orr

    08/29/2022, 4:56 PM
    Hi @Hugh Nolan When thinking about this challenge, I had the same two options in mind. For the first option, which I prefer, you can use our commit log to find the commits in which a file changed. The reason I prefer the first option, is that it preserves reproducibility, to the best of your data governance ability. What you must remove is removed, but other aspects of the version are kept.
    👍 1
  • e

    einat.orr

    08/29/2022, 4:56 PM
    Is that approximated reproducibility valuable in your opinion?
    h
    3 replies · 2 participants
  • a

    Adi Polak

    09/07/2022, 11:37 PM
    Hi <!here> , i am researching managed Hadoop offerings. I'm curious, from your experience, what are the culprits of EMR and S3 solutions? This is what I have learned so far. EMR Pain: Upgrading Hadoop/Spark/Hive/Presto etc., Requires a new EMR Release, which upgrades often more than what you need. – this can trigger a full migration. Some docker images are supported, but not all of them. EMR bootstrap actions seems like an interesting solution for installing system dependencies 1. Installing system dependencies doesn't always work as expected with bootstrap. 2. Spot - sometimes AWS will kill all your spot instances and cause clusters and jobs to fail (to happen infrequently enough that it is hard to justify paying for on-demand instances, but frequently enough that it causes a significant and draining level of operational toil). 3. Configuring Ranger Authorization requires installing it outside of EMR cluster and creates conflicts with AWS internal record reader Authorization. Conflicting security rules – can be resolved but require more efforts. S3 Pain: 1. S3 Consistency used to be a pain and was solved with strong read after write which made EMRFS obsolete. File consistency is not a pain anymore, but lack of atomic directories operations is a pain – copying a large set of parquet files (10k) and having multiple readers and writers at the same time can be a challenge when looking at the directories as a whole and not a file by file consistency. 2. AWS will often return HTTP 503 Slow Down errors for request rates that are much, much lower than their advertised limits. 3. You are supposed to code clients with backoffs/retries adequate to absorb arbitrary levels of HTTP 503 Slow Down until they finish scaling (which is entirely unobservable and could be minutes, hours, or days). 4. Many readers to table with 1K parquet files and one update – update is not an atomic action and readers might read the wrong data. 5. Storage optimization – optimize performance - no optimization at a table level.
    v
    a
    4 replies · 3 participants
  • a

    Adi Polak

    10/04/2022, 10:56 AM
    🥇some people say that old is gold? an interesting reference architecture from Oracle for big data processing. they broke the system down into parts and articulated that for a business to stay competitive, investments in a foundation data layer and access and performance layer are a must:
    Foundation Data Layer: abstracts the data away from the business process through the use of a business process-neutral canonical data model. This gives the data longevity, so that changes in source systems or the interpretation placed on the data by current business processes does not necessitate model or data changes.
    Access and Performance Layer: allows for multiple interpretations (e.g. past, present and future) over the same data as well as to simplify the navigation of the data for different user communities and tools. Objects in this layer can be rapidly added and changed as they are derived from the Foundation Data Layer.
    The Foundation Data Layer and Access and Performance Layers offer two additional levels of abstraction that further reduce the impact of schema changes in the data platform while still presenting a single version of the truth to consumers.
    It seems like with the shift to data lakes and object stores, these two are lost and there is a need to adopt better tools for these layers - or the tech under the hood that enables them, which is - data versioning engine a penny for your thoughts...
    r
    1 reply · 2 participants
  • a

    Adi Polak

    12/05/2022, 1:41 PM
    Many of us stroll the worlds trying to understand – How can we optimize our data systems to achieve more? More deliverables, more volume, more valuable INSIGHTS. One valuable practice is CI/CD for data systems. 🤔 What is CI/CD for data? CI/CD stands for continuous integration and continuous delivery, and it is a software development practice that involves regularly merging code changes into a central repository, building and testing the code automatically, and deploying it to production. In the context of data, CI/CD can be used to automate the process of integrating, validating, and deploying data pipelines and models. This can help to ensure that data is consistently processed and made available in a timely and reliable manner, enabling data-driven applications and services to function smoothly and effectively. ✅ How do I implement CI/CD for data? To achieve CI/CD for data, all you have to do is lakeFS your data. Yes, just lakeFS it! Curious to learn more? Join our O'Reilly course on CI/CD for data. 🛑 Don't have an O'Reilly subscription? Not a problem. Comment with "_lakeFS it"._ And I will organize a dedicated online meetup where we will go through what it is and how to implement it.
    💡 8
    🔥 7
    💯 7
    💪 5
    💗 5
    🤩 3
    a
    j
    +5
    7 replies · 8 participants
  • a

    Adi Polak

    12/09/2022, 9:33 AM
    Very interesting
    data size to technology
    scale
    m
    2 replies · 2 participants
  • b

    Beegee Alop

    12/12/2022, 10:00 AM
    How much of cloud do you use for data engineering? My company uses AWS and the data engineers double downed on the cloud. We use lambdas, dynamodb, s3 (of course), kinesis, firehose, sqs, eventbridge, etc. Our airflow can invoke • lambdas • containers • databricks • emr
    👀 2
    o
    3 replies · 2 participants
  • a

    Adi Polak

    12/20/2022, 9:17 AM
    The newest StackOverflow Survey for 2022 it out. Some interesting highlights : 🐳 Docker is becoming a similar fundamental tool for Professional Developers, increasing from 55% to 69%. 🦀 Rust is on its seventh year as the most loved language 📊 PostgreSQL becomes the most loved and wanted DBs and wins over redis. 💼 Big-data skills are well compensated with Apache Spark, Apache Kafka, and Hadoop all in the top three.
    👀 3
    b
    4 replies · 2 participants
  • j

    Jonathan Rosenberg

    12/20/2022, 12:54 PM
    Hi everyone! Are there any Delta Lake users in the audience? I would appreciate your help answering a few questions 🙏
    g
    1 reply · 2 participants
  • a

    Adi Polak

    12/21/2022, 11:32 AM
    some drama about dbt cloud raising their prices today on hn -
    👀 1
  • a

    Adi Polak

    12/21/2022, 11:57 AM
    Nick Schrock, the creator of Dagster (cloud native orchestration tool that competes with airflow) and a legend in the big data community, is stepping down as CEO. 💡 Some highlights: • Pete Hunt steps in as CEO, founding member of the React team, one of the most widely-used JavaScript tools among developers today. • Rapid growth in the data orchestration field has made this change necessary. Quote from Nick:
    "It just became clear to me that he was really good at stuff that I'm not as passionate about: people management, organizational design, designing goals, operations, things like that," Schrock said. "I, as a solo founder, by having to work on all that, wasn't able to spend as much time as I wanted doing what my zone of expertise is."
    🧐 3
    e
    1 reply · 2 participants
  • v

    Vino

    12/22/2022, 7:24 PM
    Here is the list of all the tools built on top of chatGPT. Have fun!! 🙂
    :jumping-lakefs: 3
    😲 1
  • a

    Adi Polak

    12/25/2022, 8:44 AM
    the virtual info broke down the various trends in 2022, where we are today and some predictions for the future. 📉 Trends covered are: 1. data mesh 2. metrics layer 3. reverse ETL 4. active metadata & 3rd gen data catalog ( wow 3rd gen already 😲 ) 5. data teams as product teams 6. data observability
    🤔 2
  • a

    Adi Polak

    12/26/2022, 2:16 PM
    it seems like more people are aware of delta lake ¯\_(ツ)_/¯
  • a

    Adi Polak

    12/26/2022, 6:15 PM
    Neptune discusses the need for a tool to manage machine learning lifecycle and its artifact versions- data, code, hyperparameters, model and so on for reproducibility of the model, transparency/availability and collaboration. Interesting article breaking down the requirements https://neptune.ai/blog/git-for-ml-model-version-control
    👀 2
  • a

    Adi Polak

    12/27/2022, 12:09 PM
    💭 Some interesting thoughts on possibilities with ChatGPT and GPT-4. Due to its ability to build a model based on a huge corpus of text [aka document data types], many of us leverage it to find answers to questions fast. This is somewhat competing with Google Search Engine, specialty when ChatGPT ELEGANTLY served 5 million users on its first day of being public. 🤯 Here is a thread from multiple CEOs of companies that aim to compete with google. They shared their bold, optimistic predictions for the future. 📊 Some of them refer to managing multiple ML experiments with 1 trillion parameters [ we already had that in 2021], and it will be possible because of fine-tuning of smaller models - aka - more Experiments 🔓 The hidden secret of #LLMs? How much training data you have matters as much as model size. Aka scalable data. big data. and so on. I wonder what do you think about ChatGPT and how it might impact how we build data systems today.
    🤯 2
  • o

    Oz Katz

    12/28/2022, 3:55 PM
    Who needs fancy tools like DBT when we have M4 from 1977? 🙂 I heard @Ariel Shaqed (Scolnicov) really likes his autoconf - apparently there's another use for it! https://emiruz.com/post/2022-12-28-composable-sql/
    🤣 2
    🧟 1
    e
    a
    3 replies · 3 participants
  • a

    Ariel Shaqed (Scolnicov)

    01/04/2023, 5:42 AM
    Nice piece about the state of Iceberg. https://www.theregister.com/2023/01/03/apache_iceberg/
    👍 2
    a
    1 reply · 2 participants
  • v

    Vino

    01/05/2023, 5:51 AM
    Instead of Googling for an answer, you might #Bing for it instead. 😀 That's because Microsoft and OpenAI are working on incorporating #ChatGPT into the #Bing search engine. OpenAI's services are also available for commercial use exclusively on #Azure cloud, supporting models such as #GPT3 (which #ChatGPT is based on) and #Codex (which #CoPilot is based on). The Azure OpenAI Service brings together the OpenAI API and Azure's enterprise-level security, compliance, and regional availability. https://www.msn.com/en-us/news/technology/microsoft-working-with-openai-to-incorporate-chatgpt-into-bing/ar-AA15XvJV
    a
    b
    3 replies · 3 participants
  • v

    Vino

    01/06/2023, 7:52 PM
    Flash news! Confluent is acquiring Immerok - a startup offering a fully managed service for Apache Flink!! 😱😱😱 https://www.confluent.io/blog/cloud-kafka-meets-cloud-flink-with-confluent-and-immerok/
    🤯 5
    a
    1 reply · 2 participants
  • a

    Adi Polak

    01/12/2023, 9:03 AM
    We often spend hours, days, and even weeks analyzing whether a metric is correct! 🤯 Sandeep Uttamchandani (Engineering @ Unravel) shared an interesting approach: circuit-breakers<quality gates> for production data pipelines!
    TL;DR
    💡 When data quality issues occur, the circuit opens, preventing low-quality data from propagating to downstream processes
    🎯 Result? This data will be missing in the reports for time periods of low quality
    👉 you’re automating data availability to be directly proportional to data quality
    🔥 No more fire-fighting & verifying-&-fixing metrics/reports on a case-by-case basis
    This probably reads familiar to you as in the :lakefs: world we refer to it as CI/CD for data pipelines .. I wonder what you think about this approach.. any insights?
    💡 4
  • a

    Adi Polak

    01/17/2023, 11:39 AM
    :ninja:PyTorch vs. 💂TensorFlow - we’ve all heard about this ongoing battle. TensorFlow is the most widespread deep learning framework, but it stopped growing around 2018. PyTorch, on the other hand, is steadily gaining traction. 📈 Ari Joury argues in this article that TensorFlow’s decline will become more pronounced in the next few years, especially in the world of Python, because PyTorch: 💫Is more pythonic 💫Has more available models 💫Is better for students and research 💫Its ecosystem has grown faster But let’s not forget that TensorFlow has a better deployment infrastructure and is not all about Python which is great news to all those who don’t have 🐍skills. Are you team PyTorch or TensorFlow? Do you think TensorFlow is really slowly dying?
    ✅ 1
  • v

    Vino

    01/18/2023, 10:19 PM
    Zhamak Dehghani, announced her company - nextdata.com that offers data-mesh-native platform. Data mesh, for me, has always been a concept/idea and I struggle to understand how I could leverage it as a data engineer. Curious to see how the concept evolves into a platform/product🤞🏼 She says,
    Our vision is to build a world where AI/ML and analytics are powered by decentralized, responsible, and equitable data ownership, across boundaries of organizations, technology, and most importantly boundaries of trust.
    
    Our purpose is to change the experience of creating, sharing, discovering, and using data forever, to be connected, fast, and fair based on data mesh principles.
    
    Our technology is designed to empower data developers, users and owners with a delightful experience where data products are a first-class primitive, with trust built-in.
    
    We are here to accept the reality, that the world of data is complex and messy; data models are out-of-date the moment they are created; data is owned across trust boundaries; data is stored on different platforms; data is used in many different modes and most importantly data can't protect itself. We are here to recognize that past approaches to tackle these complexities with centralized data collection, modeling and governance are ineffective at best and pathologically unfair at worst. We are here to reimagine, with you.
    👏 4
  • a

    Adi Polak

    02/06/2023, 2:00 PM
    Late-arriving data (sometimes referred to as dimensions) is an issue we often encounter in data warehouse solutions. The solution depends on the team, and there are many different strategies for tackling this. I read this article from 2021 that shows how to handle late-arriving dimensions while protecting data integrity with a few different design approaches . • Never Process Fact - essentially discarding the record from loading into the fact table • .Park and Retry - insert the unmatched transactional record into a landing table and retry in the next batch process to load data into the fact table. • Inferred Flag - using a flag to state that this dimension is not available. Have you dealt with late-arriving dimensions? How did you make it work for you?
    r
    3 replies · 2 participants
  • o

    Oz Katz

    02/08/2023, 6:04 PM
    Interesting move by dbt: https://www.getdbt.com/blog/dbt-acquisition-transform/
    😲 2
    🤔 1
    a
    1 reply · 2 participants
  • r

    Richa Kukar

    02/09/2023, 11:21 AM
    https://www.linkedin.com/feed/update/urn:li:activity:7029408029200580608?utm_source=share&amp;utm_medium=member_desktop
    r
    n
    5 replies · 3 participants
  • r

    Robin Moffatt

    02/10/2023, 4:36 PM
    I came across 12 Factor today via a Reddit thread, after someone on
    r/dataengineering
    asked for good primers on software engineering practice. It got me to thinking - has anyone 'translated' this into data engineering context?
Powered by Linen
Title
r

Robin Moffatt

02/10/2023, 4:36 PM
I came across 12 Factor today via a Reddit thread, after someone on
r/dataengineering
asked for good primers on software engineering practice. It got me to thinking - has anyone 'translated' this into data engineering context?
View count: 2