Abstract: Pre-trained vision-language models (VLMs), such as CLIP, demonstrate impressive zero-shot classification capabilities with free-form prompts and even show some generalization in specialized ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results