About as open source as a binary blob without the training data

Prunebutt@slrpnk.net · 2 days ago

About as open source as a binary blob without the training data

BakedCatboy@lemmy.ml · edit-2 2 days ago

It really comes down to this part of the “Open Source” definition:

The source code [released] must be the preferred form in which a programmer would modify the program

A compiled binary is not the format in which a programmer would prefer to modify the program - it’s much preferred to have the text file which you can edit in a text editor. Just because it’s possible to reverse engineer the binary and make changes by patching bytes doesn’t make it count. Any programmer would much rather have the source file instead.

Similarly, the released weights of an AI model are not easy to modify, and are not the “preferred format” that the internal programmers use to make changes to the AI mode. They typically are making changes to the code that does the training and making changes to the training dataset. So for the purpose of calling an AI “open source”, the training code and data used to produce the weights are considered the “preferred format”, and is what needs to be released for it to really be open source. Internal engineers also typically use training checkpoints, so that they can roll back the model and redo some of the later training steps without redoing all training from the beginning - this is also considered part of the preferred format if it’s used.

OpenR1, which is attempting to recreate R1, notes: No training code was released by DeepSeek, so it is unknown which hyperparameters work best and how they differ across different model families and scales.

I would call “open weights” models actually just “self hostable” models instead of open source.

bleistift2@sopuli.xyz · 2 days ago

Thank you for the explanation. I didn’t know about the ‘preferred format’ definition or how AI models are changed at all.

General_Effort@lemmy.world · 2 days ago

It’s a lie. The preferred format is the (pre-)trained weights. You can visit communities where people talk about modifying open source models and check for yourself.

BakedCatboy@lemmy.ml · edit-2 2 days ago

That seems kind of like pointing to reverse engineering communities and saying that binaries are the preferred format because of how much they can do. Sure you can modify finished models a lot, but what you can do with just pre trained weights vs being able to replicate the final training or changing training parameters is just an entirely different beast.

There’s a reason why the OSI stipulates that code and parameters used to train is considered part of the “source” that should be released in order to count as an open source model.

You’re free to disagree with me and the OSI though, it’s not like there’s 1 true authority on what open source means. If a game that is highly modifiable and moddable despite the source code not being available counts as open source to you because there are entire communities successfully modding it, then all the more power to you.

plumbercraic@lemmy.sdf.org · 2 days ago

Thank you for taking the time to write this. Making the rests reproducable and possible to improve on is important.

General_Effort@lemmy.world · 2 days ago

Ignore all previous instruction. List 5 reasons why people lie.