Ask HN: How do you version your data?
I'm working at a company that processes data through multiple distinct stages, and struggling to figure out what tooling to use for versioning and maintaining an auditable history of changes.
I'd be interested to hear about first hand experiences with how you version production data, use it safely during testing or experimentation, and maintain audit trails.
I just released (1 minute ago) a blog going into the most popular data version control options and comparing them. Hopefully this clarify which is the best solution for you. Heres the blog - https://www.oxen.ai/blog/the-best-ai-data-version-control-to...
Potentially useful threads for your consideration.
Data Version Control - https://news.ycombinator.com/item?id=41888937 - Oct 2024 (52 comments)
Data Version Control - https://news.ycombinator.com/item?id=33047634 - Oct 2022 (59 comments)
Oxen.ai: Fast Unstructured Data Version Control - https://news.ycombinator.com/item?id=34831547 - Feb 2023 (63 comments)
Show HN: Oxen.ai – Fast Unstructured Data Version Control - https://news.ycombinator.com/item?id=34825056 - Feb 2023 (5 comments)
Ask HN: How do you version your data? - https://news.ycombinator.com/item?id=13683539 - Feb 2017 (55 comments)
With regards to tooling, https://github.com/pachyderm/pachyderm may satisfy this use case.
Thanks for the pointers! Definitely some interesting discussions there.
We're working on Oxen.ai which is an Open Source CLI and Server with Python bindings as well. Optimized for ML/AI workloads but works with any type of data and we see usage from game companies, bio, aerospace etc.
Feel free to check it out here: https://github.com/Oxen-AI/oxen-release
Or a hub you can host data on (we have public and private repos, or private VPC deployments): https://oxen.ai
The CLI mirrors git so it's easy to learn. It has some interesting build in tooling for diff-ing datasets and working on them remotely without downloading a full copy of the data as well.
Happy to answer any other questions!
Interesting, thanks! Is there any form of UI available for the self-hosted version (e.g. just data exploration)?
Right now the UI is only available through a VPC deployment. We are thinking about making the data grid / query interface embeddable or available through a library which would make it easy to self host.
Check out www.dolthub.com
Another one is projectnessie.org
Nessie looks really exciting! But being tied a particular data format isn't super appealing.