• Paul Singman

    Paul Singman

    1 year ago
    Started by a former Databricks engineer: https://www.datamechanics.co/ essentially Spark on Kubernetes. Anyone tried it out?
    Paul Singman
    1 replies
    Copy to Clipboard
  • Paul Singman

    Paul Singman

    1 year ago
    Interesting and relevant trio of articles from the DAGshub blog • Why Git is not enough for data science (4/29) • Datasets should behave like git repositories (1/18) • Comparing Data Version Control Tools - 2020 (10/31) — comparing DVC, Delta, Git LFS, Dolt, Pachyderm, and lakeFS!
    Paul Singman
    1 replies
    Copy to Clipboard
  • Xubo Fei

    Xubo Fei

    1 year ago
    Hello, any thoughts/insight about using lakeFS with Deltalake / Iceberg / Hudi. Those are pretty popular and all provide some version control functionality. Would it be any benefit of using lakeFS together with them? How should we architect if we have a data lake with some of the tabular data in those formats, and some non tabular files?
    Xubo Fei
    Oz Katz
    2 replies
    Copy to Clipboard
  • Karen

    Karen

    1 year ago
    Atlan: The Rise of the Metadata Lake Modern business operations increasingly depend on data to derive their business. As the data takes the central role in the business operation, the number of stakeholders interacting with the data is more diverse than ever. In this increasingly diverse data world, metadata holds the key to the elusive promised land. Is it a time to think about metadata lake? The blog narrates the role of metadata lake in the modern data stack. https://towardsdatascience.com/the-rise-of-the-metadata-lake-1e95127594de
    Karen
    1 replies
    Copy to Clipboard
  • Yoni Augarten

    Yoni Augarten

    1 year ago
    Hey all, I'm trying to stress-test lakeFS using Spark. I'm looking for ideas for how to generate "heavy" Spark jobs. By heavy I mean lots of reads, writes, computes and large amount of data written. Would love to get your input.
    Yoni Augarten
    1 replies
    Copy to Clipboard
  • Paul Singman

    Paul Singman

    1 year ago
    thoughtful piece from Snowflake on open source/standards: https://www.snowflake.com/blog/choosing-open-wisely/ main takeaway for me is how open APIs access is more important than direct access to low-level files
    Paul Singman
    1 replies
    Copy to Clipboard
  • Paul Singman

    Paul Singman

    1 year ago
    $20M raised by Ahana — a managed Presto offering on AWS marketplace. Curious if it is lakeFS-friendlier than Athena: https://www.datanami.com/2021/08/03/ahana-grabs-20m-to-grow-presto-biz/
    Paul Singman
    1 replies
    Copy to Clipboard
  • Paul Singman

    Paul Singman

    1 year ago
  • Paul Singman

    Paul Singman

    1 year ago
    🐦* Tweet of the Week😗 A question from Erik Bernhardsson that touches on themes of unbundling and ETL connectors discussed above
    Paul Singman
    Pasha Finkelshteyn
    +1
    3 replies
    Copy to Clipboard
  • Paul Singman

    Paul Singman

    11 months ago
    🎥 Video of the Week:

    Advanced Apache Spark Training

    Sameer Farooqui This 6 hour video (sorry!) from 2015 is outdated in some respects, but still gives the clearest explanations of Spark fundamentals that are certainly relevant today. Sameer covers: – Spark’s history – RDD fundamentals – Runtime architecture & resource managers – Memory & persistence …and more! One thing I found interesting is when he shares stats about the Spark project and contributions (500 total contributors, 370k lines of code, 500 active production deployments). For some of those figures it is hard to get updated numbers (there certainly are a lot more than 500 production deployments), but I can say from Github there are now 1,720 total contributors. Not as high as I might have thought. Anyway, if you’re a Spark user, hop around to a section that interests you and I’m sure you’ll learn something new.
    Paul Singman
    1 replies
    Copy to Clipboard