Ask HN: How do you version your data?

2 points by Sheepzez 19 hours ago

I'm working at a company that processes data through multiple distinct stages, and struggling to figure out what tooling to use for versioning and maintaining an auditable history of changes.

I'd be interested to hear about first hand experiences with how you version production data, use it safely during testing or experimentation, and maintain audit trails.

toomuchtodo 19 hours ago

Potentially useful threads for your consideration.

Data Version Control - https://news.ycombinator.com/item?id=41888937 - Oct 2024 (52 comments)

Data Version Control - https://news.ycombinator.com/item?id=33047634 - Oct 2022 (59 comments)

Oxen.ai: Fast Unstructured Data Version Control - https://news.ycombinator.com/item?id=34831547 - Feb 2023 (63 comments)

Show HN: Oxen.ai – Fast Unstructured Data Version Control - https://news.ycombinator.com/item?id=34825056 - Feb 2023 (5 comments)

Ask HN: How do you version your data? - https://news.ycombinator.com/item?id=13683539 - Feb 2017 (55 comments)

With regards to tooling, https://github.com/pachyderm/pachyderm may satisfy this use case.

  • Sheepzez 17 hours ago

    Thanks for the pointers! Definitely some interesting discussions there.

gschoeni 17 hours ago

We're working on Oxen.ai which is an Open Source CLI and Server with Python bindings as well. Optimized for ML/AI workloads but works with any type of data and we see usage from game companies, bio, aerospace etc.

Feel free to check it out here: https://github.com/Oxen-AI/oxen-release

Or a hub you can host data on (we have public and private repos, or private VPC deployments): https://oxen.ai

The CLI mirrors git so it's easy to learn. It has some interesting build in tooling for diff-ing datasets and working on them remotely without downloading a full copy of the data as well.

Happy to answer any other questions!

  • Sheepzez 17 hours ago

    Interesting, thanks! Is there any form of UI available for the self-hosted version (e.g. just data exploration)?

    • gschoeni 16 hours ago

      Right now the UI is only available through a VPC deployment. We are thinking about making the data grid / query interface embeddable or available through a library which would make it easy to self host.

bpf120 18 hours ago

Check out www.dolthub.com

thenaturalist 17 hours ago

Another one is projectnessie.org

  • Sheepzez 17 hours ago

    Nessie looks really exciting! But being tied a particular data format isn't super appealing.