• Romain

    Romain

    11 months ago
    Hello! I don't have any experience in data science although I can clearly see the value. What architecture would you recommend for a startup to start with? So as to have a minimum viable data collection pipeline? So far this is my plan: • Metrics: Prometheus > Thanos > S3 • Logs: Loki > S3 • Clickstream: Rudderstack > S3 We are not going to have a data scientist for a while but I'd like to: • Adopt best practices from the start • Have historic data at hand when a data scientist joins. I came across LakeFS and as a software engineer using Git and GitOps, it appeals a lot to me. However I am not sure if it would be interesting for me to set it up now or if it is useless unless someone actually works on the data. Thanks for helping a data novice understand this world.
    Romain
    Barak Amar
    +1
    5 replies
    Copy to Clipboard
  • Karen

    Karen

    11 months ago
    This is an interesting excerpt from the latest https://www.blef.fr/ newsletter: 📢 Why Machine Learning Engineers are replacing Data Scientists This article is busting myth about the ML engineers and how they are included in a data team and what are the skills required. Without any spoil that doesn't mean data scientists will disappear but more that it will transform their daily job. If I can simplify, machine learning engineers are here to bridge the gap between data engineers and data scientists.
    Karen
    einat.orr
    2 replies
    Copy to Clipboard
  • Karen

    Karen

    10 months ago
    The 10 Hottest Big Data Startups Of 2021 (according to CRN): 1. Airbyte 2. Bigeye 3. Cribl 4. Firebolt 5. Grafana Labs 6. Molecula 7. Monte Carlo 8. Speedata 9. Syncari 10. Yugabyte More details: https://www.crn.com/slide-shows/applications-os/the-10-hottest-big-data-startups-of-2021
    Karen
    1 replies
    Copy to Clipboard
  • Oz Katz

    Oz Katz

    8 months ago
    An interesting write-up from Uber Engineering about the different optimizations they've made to Apache Parquet. It's not common to see companies invest engineering resources into file formats, so it's easy to disregard them or think of them as "solved problems". In Uber's case, the ROI for the optimizations they've made ended up being very substantial. Also, ZSTD is really cool 😃
    Oz Katz
    Paul Singman
    2 replies
    Copy to Clipboard
  • Paul Singman

    Paul Singman

    7 months ago
    Davis writes really detailed breakdowns of different data topics after talking to many companies. Loved this article on how different companies have implemented experimentation platforms (or struggled to do so). From my experience, I can agree it’s really hard!
    Paul Singman
    1 replies
    Copy to Clipboard
  • t

    tgosselin

    3 months ago
    hello, I am not sure if this would go here but I have a use case that I am not sure how to implement with LakeFS (or externally, if necessary). Say I have a container version A and 10 objects in a LakeFS main branch. I deploy the container as an AWS Batch Job with each Job: pulling an object from main, applying some transformation, creating a new branch on LakeFS, committing the data to the new branch, merging the new branch into main. During the commit, I create a key value pair in the commit message "container_hash":"a1234jsjdf" to represent the hash of container version A. I then create container version B and execute the same process. Is there a way then to "query" the repository for all objects with the container_hash value for container B? Do I need to create a separate data structure in order to query this information?
    t
    Oz Katz
    +1
    4 replies
    Copy to Clipboard
  • Adi Polak

    Adi Polak

    2 months ago
    Data and AI summit was a great conference with lots of innovation - here is a great recap of the announcements - https://www.advancinganalytics.co.uk/blog/2022/7/4/data-and-ai-summit-2022-announcements
    Adi Polak
    1 replies
    Copy to Clipboard
  • Oz Katz

    Oz Katz

    2 months ago
    Spotted lakeFS in Tikal's technology radar, as a technology to try 🙂 (@Chaim Turkel 💙)
    Oz Katz
    1 replies
    Copy to Clipboard
  • c

    Comte Frédéric

    1 month ago
    Hello, I develop an opensource datalab platform (https://onyxia.sh) which offer many services based on kubernetes and S3. I recently added lakeFS on my catalog ( https://datalab.sspcloud.fr/catalog ). Is spark the only client that can read/write metada into lakefs and directly request S3 storage ( minio in my case) ? I am asking that because I am afraid of lakefs being the bottleneck in front of minio.
    c
    Adi Polak
    +1
    12 replies
    Copy to Clipboard
  • Oz Katz

    Oz Katz

    1 month ago
    Just came across this: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ViewFs.html it's pretty interesting and seems close in concept to our Hadoop Router FS.. I wonder if anyone here came across this? uses this?