It is a family of multimodal models based on pretrained Qwen2-72B-Instruct LLM and InterViT vision encoder.
There are three variants differentiated by the way the vision tokens are used: decoder-only (like the majority of existing VLM), using cross-attention, and a hybrid. Only the first seems to be on huggingface at the moment.
Also they seem to only train on publically available data, concluding that quality is more important than scale.
It has a non-commercial cc-by-nc-4.0 license, I would guess the only way to use this in production is to use Nvidias data centers to host it? Or are there other ways?
Not a lawyer, not legal advice, but... the legal status quo is that neural network outputs are not copyrightable. They are currently considered not made by humans nor considered a derivative work from the training material / network weights (assuming it's not regurgitating copyrighted material verbatim).
The cc-by-nc-4.0 license applies to the network weights. The only thing non-commercial about the license is that it restricts how you may reproduce the licensed material:
> reproduce and Share the Licensed Material, in whole or in part, for NonCommercial purposes only; and
As long as you are not selling the network weights themselves, nothing in the license prevents you from evaluating the neural network for commercial purposes and selling the outputs. In 'production' you will have to directly download the weights from Nvidia themselves (or another 3rd party which is distributing the network weights non-commercially in good faith) though, you can't share the network weights onto your commercial inference server from another one of your commercial deployment servers. Or at least, it gets more dicy there and may be considered commercial reproduction so better avoid it.
For similar reasons you may 3D print a CC-BY-NC model of a tool and use that tool in your commercial workshop, you may use a CC-BY-NC compiler of a language to compile commercial programs, etc.
Not a lawyer, but work with lawyers a lot, and this type of rules-lawyering doesn't tend to work in the legal profession. Consult a lawyer before trying any of this.
That's natural given that they mostly produce hardware several layers of abstraction distant from the end user value, companies need to buy the hardware before they can start delivering their own value. AI model training is not value by itself if there's no use-case for the model that can be charged for.
I see it playing out one of two ways. Either Nvidia are selling shovels in a gold rush, the rush will end, and the business will dry up (after they have made a lot of money!). Or AI sticks/takes off, and Nvidia are selling a commodity too far from the value, like most electronic component manufacturers, and they'll maintain significant market share but have their margins reduced to a fraction of what they were before (after they made a lot of money!).
The human value doesn't come from ML training or inference, it comes from taking a better photo. The business value comes from drafting a better email. Those companies closer to that value will likely do better in the long run, as they always have done.
i have yet to hear of anyone actually using AI for something properly
only exception im excited about is the non-main characters from video games, where a lot of the random NPCs, can now actually bring some more fun to the game.
I run in production a system that uses LLM translation and summerization from hundreds of sources in dozens of languages. Users are extremely satisfied by the results that are far cheaper and far higher quality than what was available before
All jokes aside (and that did make me laugh) at least they're not training just to hit the benchmarks, which seem to be more meaningless as a quality indicator with each passing day.
I see at a few models (3 models in MMMU) that score lower than Nvidia's. But putting that aside, they at least get points for apparent objectivity. At least they probably aren't fudging numbers.
It is a family of multimodal models based on pretrained Qwen2-72B-Instruct LLM and InterViT vision encoder. There are three variants differentiated by the way the vision tokens are used: decoder-only (like the majority of existing VLM), using cross-attention, and a hybrid. Only the first seems to be on huggingface at the moment.
Also they seem to only train on publically available data, concluding that quality is more important than scale.
It has a non-commercial cc-by-nc-4.0 license, I would guess the only way to use this in production is to use Nvidias data centers to host it? Or are there other ways?
Not a lawyer, not legal advice, but... the legal status quo is that neural network outputs are not copyrightable. They are currently considered not made by humans nor considered a derivative work from the training material / network weights (assuming it's not regurgitating copyrighted material verbatim).
The cc-by-nc-4.0 license applies to the network weights. The only thing non-commercial about the license is that it restricts how you may reproduce the licensed material:
> reproduce and Share the Licensed Material, in whole or in part, for NonCommercial purposes only; and
As long as you are not selling the network weights themselves, nothing in the license prevents you from evaluating the neural network for commercial purposes and selling the outputs. In 'production' you will have to directly download the weights from Nvidia themselves (or another 3rd party which is distributing the network weights non-commercially in good faith) though, you can't share the network weights onto your commercial inference server from another one of your commercial deployment servers. Or at least, it gets more dicy there and may be considered commercial reproduction so better avoid it.
For similar reasons you may 3D print a CC-BY-NC model of a tool and use that tool in your commercial workshop, you may use a CC-BY-NC compiler of a language to compile commercial programs, etc.
Not a lawyer, but work with lawyers a lot, and this type of rules-lawyering doesn't tend to work in the legal profession. Consult a lawyer before trying any of this.
First time I read this interpreation regarding CC-BY-NC model weights, are there any sources to back it?
Reminder that Nvidia is still the only company making any money out of the "AI revolution".
That's natural given that they mostly produce hardware several layers of abstraction distant from the end user value, companies need to buy the hardware before they can start delivering their own value. AI model training is not value by itself if there's no use-case for the model that can be charged for.
I see it playing out one of two ways. Either Nvidia are selling shovels in a gold rush, the rush will end, and the business will dry up (after they have made a lot of money!). Or AI sticks/takes off, and Nvidia are selling a commodity too far from the value, like most electronic component manufacturers, and they'll maintain significant market share but have their margins reduced to a fraction of what they were before (after they made a lot of money!).
The human value doesn't come from ML training or inference, it comes from taking a better photo. The business value comes from drafting a better email. Those companies closer to that value will likely do better in the long run, as they always have done.
"When there is a gold rush, sell shovels"
Wrong
Midjourney is profitable. All the acquired startups (i.e. Streamlit or MosaicML) who made millions per employee "made money" for the people who cared.
That's not true, there are plenty of companies that make a profit, Midjourney, for example, an obvious one.
i have yet to hear of anyone actually using AI for something properly
only exception im excited about is the non-main characters from video games, where a lot of the random NPCs, can now actually bring some more fun to the game.
I run in production a system that uses LLM translation and summerization from hundreds of sources in dozens of languages. Users are extremely satisfied by the results that are far cheaper and far higher quality than what was available before
Vision models are a godsent for blind user. I use a vision model to sort my laundry, for instance...
And translation and grammar/spell checking is also at a level which was unthinkable before LLMs hit.
But thats it, really. The "talking machine" aspect of it is more and more uncovered as totally useless.
> I use a vision model to sort my laundry
you built a robot that sorts laundry? Tell us more!
No, I never said that. But you already know that. The robot in this case is me holding a smart phone.
Is that faster than just determining by touch what type of garment something is? Or is this about sorting by color?
Its for sorting by color/print. Some things you remember instantly by touch, others not so much.
I love how they include a helpful chart that shows this model scores worse than everything else.
Am I looking at the wrong table? It dominates everything on visual interpretation benchmarks.
Edit: specifically ocrbench and VQAv2
All jokes aside (and that did make me laugh) at least they're not training just to hit the benchmarks, which seem to be more meaningless as a quality indicator with each passing day.
I see at a few models (3 models in MMMU) that score lower than Nvidia's. But putting that aside, they at least get points for apparent objectivity. At least they probably aren't fudging numbers.
It's not that bad, and I'd much rather that they be honest instead of lying like everyone else does.
Well but it actually doesn't, unless you're looking only at MMMU.