Today Google releases a new and better family of multilingual vision-language encoders, SigLIP 2. The authors have extended the training objective of SigLIP (sigmoid loss) with additional objectives for improved semantic understanding, localization, and dense features.
SigLIP 2 models outperform the older SigLIP ones at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs).
A cherry on top is the dynamic resolution (naflex) variant. This is useful for downstream tasks sensitive to aspect ratio and resolution.
Vision encoders are simple - they take an image, encode it into a representation, and that representation is used for downstream tasks like classification, object detection, image segmentation, and more vision tasks. Researchers are always in pursuit of visual representations that are dense, locality-aware, and semantically rich.
CLIP and ALIGN are the first examples of image encoders and text encoders aligned together through joint training. This approach opened new ways to train vision models. SigLIP took it further, replacing CLIP's contrastive loss with sigmoid loss for even better encoders.
The takeaway? With smarter training objectives, we keep building vision encoders that are more structured, fine-grained, and powerful. SigLIP 2 is just that, a bunch of really interesting and smart training objectives applied on top of that of SigLIP's to provide better and stronger vision language encoders.
We will try something new with this blog post. Rather than stating what is new and where to find it, we will go through a little exercise together. We start off with SigLIP and then brainstorm a series of questions (prefixed with 🤔) and answers (a new heading) to gradually cover all the updates in SigLIP 2. Sounds good?
We will begin our journey with the vision encoder where the patch size is 16, and the image resolution is
256. We have four variants to start our training:
🤔 Question 1: What is a (low effort) auxiliary training objective that we can use to learn better visual representations (in terms of location awareness and sense of locality)?
Add a decoder (it’s that simple)
Let’s add a decoder to the mix. Now we have an image encoder, a text encoder, and a text decoder. The text decoder will have three objectives:
Predict a holistic image caption
Predict bounding box coordinates given captions describing specific image regions
Predict region-specific caption given bounding box coordinates
The decoder provides an additional signal to the vision encoder, making it location-aware. This marks the first improvement to the training recipe in SigLIP 2.
🤔 Question 2: How do we improve fine-grained local semantics of the image representation?
Self-distillation with Global-Local loss and Masked Prediction
To improve fine-grained local semantics in image representation, we introduce two key training objectives, Global-Local Loss, and Masked Prediction Loss. Taking inspiration from self-supervised learning literature, we use self-distillation. We can use a model as a teacher, and the same model as a student. Upon each iteration the teacher will be the moving average of the student's parameters.
Global-Local Loss: The student network gets a partial (local) view of the training image, and is trained to match the teacher’s representation, derived from the full image.
Masked Prediction Loss: 50% of the embedded image patches in the student network are masked with mask tokens. The student needs to match the features of the teacher at masked locations.
These objectives teach the vision encoder to be spatially aware and improve its local semantics. The authors add this loss only after 80% of the training is done with the sigmoid and decoder loss. This is done in order to save compute (additional losses are pretty expensive) and to not negatively affect the encoders.
🤔 Question 3: How to adapt models to different resolutions?
Adapting to different resolutions
It is a known fact that image models can be very sensitive to varying resolutions and aspect ratios. Here we can leverage two distinct methodologies to adapt these models on different resolutions and patch sizes.
Fixed resolution variant: Taking the checkpoints from 95% training, we can resize the positional embeddings and the patch embeddings and then continue training for a requested (potentially larger) resolution.
Dynamic resolution variant: Taking inspiration from FlexiViT, which uses inputs with different sequence lengths, and NaViT, which adheres to the native aspect ratios, we can create NaFlex variants. This is interesting because we can use a single model for OCR (little aspect ratio distortion) and document understanding (appropriate resolution).
Models with the -naflex suffix are the dynamic resolution variants. While the fixed-resolution models can be used out of the box with the existing SiglipModel class, you would need to use Siglip2Model to use the naflex variants. We handle this automatically when you use the pipeline API!
This brings us to the end of the evolution from SigLIP to SigLIP 2. In the next sections we will look at applications with SigLIP 2.
Run inference with transformers
Running inference on the models is pretty straightforward. You can copy paste the code below and run inference on a free tier Colab notebook 🚀
To run inference on SigLIP 2, please install transformers from main or from this stable branch:
pip install git+https://github.com/huggingface/transformers@v4.49.0-SigLIP-2
Zero-shot Classification
Here we use the handy pipeline API to showcase zero-shot classification capabilities for SigLIP 2.
from transformers import pipeline
ckpt = "google/siglip2-so400m-patch14-384"
pipe = pipeline(model=ckpt, task="zero-shot-image-classification")
inputs = {
"images": [
"https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg", # bear"https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000776.jpg", # teddy bear
],
"texts": [
"bear looking into the camera",
"bear looking away from the camera",
"a bunch of teddy bears",
"two teddy bears",
"three teddy bears"
],
}
outputs = pipe(inputs["images"], candidate_labels=inputs["texts"])
Let’s visualize the outputs.
Zero Shot Classification Scores Visualized
Encode images for downstream tasks
You can also encode images using the following:
import torch
from transformers import AutoModel, AutoProcessor
from transformers.image_utils import load_image
ckpt = "google/siglip2-so400m-patch14-384"
model = AutoModel.from_pretrained(ckpt, device_map="auto").eval()
processor = AutoProcessor.from_pretrained(ckpt)
image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg")
inputs = processor(images=[image], return_tensors="pt").to(model.device)
with torch.no_grad():
image_embeddings = model.get_image_features(**inputs)
print(image_embeddings.shape) # torch.Size([1, 1152])
Comparing SigLIP 1 with SigLIP 2
Looking at the table of all the SigLIP 2 models released, we see two distinct changes from SigLIP:
SigLIP 2 has new variants (naflex) for dynamic resolution.
SigLIP 2 adds a giant (1B) series.
The evaluation table of SigLIP 2 demonstrates its superiority over SigLIP.
Here is a demo where one can compare the zero-shot classification results of SigLIP 1 and SigLIP 2.
Using the encoder for VLMs
Vision encoders aligned to textual information have become increasingly vital in the development of Vision Language Models (VLMs). A common approach to building VLMs involves combining a pretrained vision encoder with a pretrained LLM, and training them together using multimodal data across a diverse set of vision-language tasks.
One standout example of a VLM leveraging the SigLIP family of vision encoders is PaliGemma. One can dive deeper into PaliGemma's capabilities in this PaliGemma blog post. Building on this foundation, the recently introduced PaliGemma 2 takes it a step further by integrating SigLIP with the advanced Gemma 2 LLM. It would be really exciting to swap out SigLIP with SigLIP 2 in a PaliGemma like setting and see how that model fares.
Acknowledgements
We would like to thank Michael Tschannen (first author of SigLIP 2), Vaibhav Srivastav and Sayak Paul for feedback on this blog post. A huge shout out to the Google team for releasing this amazing, and open, model family.
In no particular order we would like to thank Pavel, Ross, Pablo, Pedro, Lysandre and the rest of the Hugging Face team for their immense support and contribution towards this project.
https://huggingface.co/collections/timm/siglip-2-67b8e72ba08b09dd97aecaf9\n","updatedAt":"2025-02-21T22:00:51.444Z","author":{"_id":"604a5184dca2c7ac7508b849","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1667002643224-604a5184dca2c7ac7508b849.jpeg","fullname":"Ross Wightman","name":"rwightman","type":"user","isPro":false,"isHf":true,"isMod":false,"followerCount":264}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5440188646316528},"editors":["rwightman"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1667002643224-604a5184dca2c7ac7508b849.jpeg"],"reactions":[{"reaction":"🚀","users":["ariG23498","tolgacangoz","dhoa","loci-atharva","mrdbourke","kjerk"],"count":6}],"isReport":false}},{"id":"67b9eb8d97907c79dac781f6","author":{"_id":"6197a1be4de8c6729a8ad089","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6197a1be4de8c6729a8ad089/1mlcgsYvfxGCw8bdmKORY.jpeg","fullname":"Furkan Kınlı","name":"birdortyedi","type":"user","isPro":false,"isHf":false,"isMod":false},"createdAt":"2025-02-22T15:21:49.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks for blog. \n\nSiglip2 does not include 'spiece.model' as in Siglip. Is this expected? The processor raises an error indicating that the vocab file is a NoneType object, which may suggest a missing spiece model. Any ideas?\n\ntransformers.__version__ : 4.49.0","html":"
Thanks for blog.
\n
Siglip2 does not include 'spiece.model' as in Siglip. Is this expected? The processor raises an error indicating that the vocab file is a NoneType object, which may suggest a missing spiece model. Any ideas?
\n
transformers.version : 4.49.0
\n","updatedAt":"2025-02-22T15:21:49.920Z","author":{"_id":"6197a1be4de8c6729a8ad089","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6197a1be4de8c6729a8ad089/1mlcgsYvfxGCw8bdmKORY.jpeg","fullname":"Furkan Kınlı","name":"birdortyedi","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8786540031433105},"editors":["birdortyedi"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6197a1be4de8c6729a8ad089/1mlcgsYvfxGCw8bdmKORY.jpeg"],"reactions":[],"isReport":false},"replies":[{"id":"67bc3308a8c89b98ec57eed0","author":{"_id":"608aabf24955d2bfc3cd99c6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/608aabf24955d2bfc3cd99c6/T762Ut0Y-w0sZB2ynvfbJ.jpeg","fullname":"Aritra Roy Gosthipaty","name":"ariG23498","type":"user","isPro":true,"isHf":true,"isMod":false,"followerCount":130},"createdAt":"2025-02-24T08:51:20.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Could you install `transformers` from main?\n\n`pip install git+https://github.com/huggingface/transformers@main`\n\nLet us know if this solves the issue. 🤗","html":"
\n","updatedAt":"2025-02-24T08:51:20.119Z","author":{"_id":"608aabf24955d2bfc3cd99c6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/608aabf24955d2bfc3cd99c6/T762Ut0Y-w0sZB2ynvfbJ.jpeg","fullname":"Aritra Roy Gosthipaty","name":"ariG23498","type":"user","isPro":true,"isHf":true,"isMod":false,"followerCount":130}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.799191951751709},"editors":["ariG23498"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/608aabf24955d2bfc3cd99c6/T762Ut0Y-w0sZB2ynvfbJ.jpeg"],"reactions":[{"reaction":"👍","users":["Sudhaarsun","delsj"],"count":2}],"isReport":false,"parentCommentId":"67b9eb8d97907c79dac781f6"}},{"id":"67bc6a169cadab0712e38e3c","author":{"_id":"6669bdc109fc336f73b4f177","avatarUrl":"/avatars/540f47dd0fb41a572604e4bf8594c723.svg","fullname":"Mike","name":"mw44","type":"user","isPro":false,"isHf":false,"isMod":false},"createdAt":"2025-02-24T12:46:14.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2025-02-24T12:51:50.263Z","author":{"_id":"6669bdc109fc336f73b4f177","avatarUrl":"/avatars/540f47dd0fb41a572604e4bf8594c723.svg","fullname":"Mike","name":"mw44","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":1,"editors":[],"editorAvatarUrls":[],"reactions":[],"parentCommentId":"67b9eb8d97907c79dac781f6"}}]},{"id":"67b9f8534a1e1803da36dde5","author":{"_id":"6669bdc109fc336f73b4f177","avatarUrl":"/avatars/540f47dd0fb41a572604e4bf8594c723.svg","fullname":"Mike","name":"mw44","type":"user","isPro":false,"isHf":false,"isMod":false},"createdAt":"2025-02-22T16:16:19.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"I tried running the zeroshot classification example and got `ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected). ` transformers version=4.49.0.dev0\n\nI tried adding both padding=True and truncation=True to no avail. i also tried padding=\"max_length\"\n\nEDIT:\nit seems to work if my labels are all the same length. doing some debugging, i see that in `zero_shot_image_classification.py`, the padding provided to the tokenizer is forced to be `max_length` anyway here (L148-149)\n```\npadding = \"max_length\" if self.model.config.model_type == \"siglip\" else True\ntext_inputs = self.tokenizer(sequences, return_tensors=self.framework, padding=padding, **tokenizer_kwargs)\n```\nand yet, if my labels have variable lengths, the outputs are not the same length, and so calling torch.tensor on that ultimately fails\ni did spot this warning in my terminal as well:\n`Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.`","html":"
I tried running the zeroshot classification example and got ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (input_idsin this case) have excessive nesting (inputs typelistwhere typeintis expected). transformers version=4.49.0.dev0
\n
I tried adding both padding=True and truncation=True to no avail. i also tried padding=\"max_length\"
\n
EDIT: it seems to work if my labels are all the same length. doing some debugging, i see that in zero_shot_image_classification.py, the padding provided to the tokenizer is forced to be max_length anyway here (L148-149)
and yet, if my labels have variable lengths, the outputs are not the same length, and so calling torch.tensor on that ultimately fails i did spot this warning in my terminal as well: Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
\n","updatedAt":"2025-02-22T16:57:10.789Z","author":{"_id":"6669bdc109fc336f73b4f177","avatarUrl":"/avatars/540f47dd0fb41a572604e4bf8594c723.svg","fullname":"Mike","name":"mw44","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.8237224817276001},"editors":["mw44"],"editorAvatarUrls":["/avatars/540f47dd0fb41a572604e4bf8594c723.svg"],"reactions":[],"isReport":false},"replies":[{"id":"67bb11f36792208f4a947052","author":{"_id":"642d334ff65714b4585f2de4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642d334ff65714b4585f2de4/gxBynq5KyoUP0VlAQD3-w.jpeg","fullname":"Lucas Beyer","name":"giffmana","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":40},"createdAt":"2025-02-23T12:17:55.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"The warning gives you the answer: pass max_length=64","html":"
The warning gives you the answer: pass max_length=64
\n","updatedAt":"2025-02-23T12:17:55.550Z","author":{"_id":"642d334ff65714b4585f2de4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642d334ff65714b4585f2de4/gxBynq5KyoUP0VlAQD3-w.jpeg","fullname":"Lucas Beyer","name":"giffmana","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":40}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8192785978317261},"editors":["giffmana"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/642d334ff65714b4585f2de4/gxBynq5KyoUP0VlAQD3-w.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"67b9f8534a1e1803da36dde5"}},{"id":"67bb18054535b84d0bdde2fc","author":{"_id":"6669bdc109fc336f73b4f177","avatarUrl":"/avatars/540f47dd0fb41a572604e4bf8594c723.svg","fullname":"Mike","name":"mw44","type":"user","isPro":false,"isHf":false,"isMod":false},"createdAt":"2025-02-23T12:43:49.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"> The warning gives you the answer: pass max_length=64\n\nI've been passing a kwarg to `pipe` for `max_length` the whole time, and it doesn't propagate to the tokenizer, and the warning and error persist. Furthermore, if this is required, why is it not shown in the example? Why does the example work without specifying anything at all?\n\nThe only way i've managed to get this to work is by modifying `zero_shot_image_classification.py` myself, on the lines indicated above, by adding:\n```\nif \"max_length\" not in tokenizer_kwargs:\n tokenizer_kwargs[\"max_length\"] = 64\n```","html":"
\n
The warning gives you the answer: pass max_length=64
\n
\n
I've been passing a kwarg to pipe for max_length the whole time, and it doesn't propagate to the tokenizer, and the warning and error persist. Furthermore, if this is required, why is it not shown in the example? Why does the example work without specifying anything at all?
\n
The only way i've managed to get this to work is by modifying zero_shot_image_classification.py myself, on the lines indicated above, by adding:
\n
if \"max_length\" not in tokenizer_kwargs:\n tokenizer_kwargs[\"max_length\"] = 64\n
\n","updatedAt":"2025-02-23T12:51:50.138Z","author":{"_id":"6669bdc109fc336f73b4f177","avatarUrl":"/avatars/540f47dd0fb41a572604e4bf8594c723.svg","fullname":"Mike","name":"mw44","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9171953201293945},"editors":["mw44"],"editorAvatarUrls":["/avatars/540f47dd0fb41a572604e4bf8594c723.svg"],"reactions":[],"isReport":false,"parentCommentId":"67b9f8534a1e1803da36dde5"}},{"id":"67bb22b36f77b7facde87e46","author":{"_id":"642d334ff65714b4585f2de4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642d334ff65714b4585f2de4/gxBynq5KyoUP0VlAQD3-w.jpeg","fullname":"Lucas Beyer","name":"giffmana","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":40},"createdAt":"2025-02-23T13:29:23.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Ah sorry that wasn't clear from your message. I'm not familiar enough with this codebase to help more.","html":"
Ah sorry that wasn't clear from your message. I'm not familiar enough with this codebase to help more.
\n","updatedAt":"2025-02-23T13:29:23.222Z","author":{"_id":"642d334ff65714b4585f2de4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642d334ff65714b4585f2de4/gxBynq5KyoUP0VlAQD3-w.jpeg","fullname":"Lucas Beyer","name":"giffmana","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":40}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9700632095336914},"editors":["giffmana"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/642d334ff65714b4585f2de4/gxBynq5KyoUP0VlAQD3-w.jpeg"],"reactions":[{"reaction":"👍","users":["mw44"],"count":1}],"isReport":false,"parentCommentId":"67b9f8534a1e1803da36dde5"}},{"id":"67bb4c086f77b7facdf38eee","author":{"_id":"6669bdc109fc336f73b4f177","avatarUrl":"/avatars/540f47dd0fb41a572604e4bf8594c723.svg","fullname":"Mike","name":"mw44","type":"user","isPro":false,"isHf":false,"isMod":false},"createdAt":"2025-02-23T16:25:44.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"all good i appreciate the effort!","html":"
all good i appreciate the effort!
\n","updatedAt":"2025-02-23T16:25:44.117Z","author":{"_id":"6669bdc109fc336f73b4f177","avatarUrl":"/avatars/540f47dd0fb41a572604e4bf8594c723.svg","fullname":"Mike","name":"mw44","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8661757707595825},"editors":["mw44"],"editorAvatarUrls":["/avatars/540f47dd0fb41a572604e4bf8594c723.svg"],"reactions":[],"isReport":false,"parentCommentId":"67b9f8534a1e1803da36dde5"}},{"id":"67bc34db2281367a6b4a7afa","author":{"_id":"608aabf24955d2bfc3cd99c6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/608aabf24955d2bfc3cd99c6/T762Ut0Y-w0sZB2ynvfbJ.jpeg","fullname":"Aritra Roy Gosthipaty","name":"ariG23498","type":"user","isPro":true,"isHf":true,"isMod":false,"followerCount":130},"createdAt":"2025-02-24T08:59:07.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Could you install transformers from main?\n\n`pip install git+https://github.com/huggingface/transformers@main`\n\nLet us know if this solves the issue. 🤗","html":"
\n","updatedAt":"2025-02-24T08:59:07.292Z","author":{"_id":"608aabf24955d2bfc3cd99c6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/608aabf24955d2bfc3cd99c6/T762Ut0Y-w0sZB2ynvfbJ.jpeg","fullname":"Aritra Roy Gosthipaty","name":"ariG23498","type":"user","isPro":true,"isHf":true,"isMod":false,"followerCount":130}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7934895753860474},"editors":["ariG23498"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/608aabf24955d2bfc3cd99c6/T762Ut0Y-w0sZB2ynvfbJ.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"67b9f8534a1e1803da36dde5"}},{"id":"67bc6c025d9470ec64bded9c","author":{"_id":"6669bdc109fc336f73b4f177","avatarUrl":"/avatars/540f47dd0fb41a572604e4bf8594c723.svg","fullname":"Mike","name":"mw44","type":"user","isPro":false,"isHf":false,"isMod":false},"createdAt":"2025-02-24T12:54:26.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"> Could you install transformers from main?\n> \n> `pip install git+https://github.com/huggingface/transformers@main`\n> \n> Let us know if this solves the issue. 🤗\n\nHi @ariG23498 and @giffmana , this seems like it was probably the issue w/ the padding and I apologize for that - had to be on dev for another model to work recently. I will say, something still seems off. The scores (logits?) come back as tiny e-08 values. In order to make them look at all like probabilities I've had to scale them and apply softmax. It's also getting very easy questions wrong:\n\nPipeline output for a clear image of a dog:\n```\n[[{'score': 8.585556088291924e-08, 'label': 'plant'}, {'score': 6.025020837796546e-08, 'label': 'dog'}, {'score': 4.270815523454985e-08, 'label': 'woman'}, {'score': 2.5436479589302508e-08, 'label': 'cat'}, {'score': 1.708304253611459e-08, 'label': 'man'}]]\n```\n\nThis is using `google/siglip2-so400m-patch14-384`.\n\nAfter softmax:\n\n\nEven when it's correct, the confidence seems very low, which makes me think softmax is not how you are supposed to transform the results.\n\nPipeline output:\n```\n[[{'score': 6.025032917023054e-08, 'label': 'dog'}, {'score': 5.832493954471829e-08, 'label': 'horse'}, {'score': 2.5436429851311004e-08, 'label': 'cat'}]]\n```\n\nAfter softmax:\n\n\nThanks for any tips, sorry I'm kinda new at this!\n","html":"
Hi \n\n@ariG23498\n\t and \n\n@giffmana\n\t , this seems like it was probably the issue w/ the padding and I apologize for that - had to be on dev for another model to work recently. I will say, something still seems off. The scores (logits?) come back as tiny e-08 values. In order to make them look at all like probabilities I've had to scale them and apply softmax. It's also getting very easy questions wrong:
\n","updatedAt":"2025-02-24T12:55:13.526Z","author":{"_id":"6669bdc109fc336f73b4f177","avatarUrl":"/avatars/540f47dd0fb41a572604e4bf8594c723.svg","fullname":"Mike","name":"mw44","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.8773027062416077},"editors":["mw44"],"editorAvatarUrls":["/avatars/540f47dd0fb41a572604e4bf8594c723.svg"],"reactions":[],"isReport":false,"parentCommentId":"67b9f8534a1e1803da36dde5"}},{"id":"67bc71a26d5bfdc989e15bd9","author":{"_id":"6669bdc109fc336f73b4f177","avatarUrl":"/avatars/540f47dd0fb41a572604e4bf8594c723.svg","fullname":"Mike","name":"mw44","type":"user","isPro":false,"isHf":false,"isMod":false},"createdAt":"2025-02-24T13:18:26.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Just wanted to add that I found the [github readme](https://github.com/huggingface/transformers/blob/884a8ea1f058716c24adb52a0a6a0bf41fbb973d/docs/source/en/model_doc/siglip2.md) on `transformers` that shows how to perform pre and post-processing yourself using `AutoModel` and `AutoProcessor`. I noted that this example performs `torch.sigmoid` on the raw model output, which leaves the values looking similar to how they look when running it via the pipeline.\n\nFollowing the github example almost exactly with my dog image I can see, for labels:\nman, cat, horse, dog (with template This is a photo of a xxxx.)\nRaw logits: tensor([[-16.1657, -14.3962, -15.7023, -7.3122]])\nPost sigmoid: tensor([[9.5352e-08, 5.5954e-07, 1.5156e-07, 6.6692e-04]])\n0.0% that image 0 is 'man'\n0.0% that image 0 is 'cat'\n0.0% that image 0 is 'horse'\n0.1% that image 0 is 'dog'","html":"
Just wanted to add that I found the github readme on transformers that shows how to perform pre and post-processing yourself using AutoModel and AutoProcessor. I noted that this example performs torch.sigmoid on the raw model output, which leaves the values looking similar to how they look when running it via the pipeline.
\n
Following the github example almost exactly with my dog image I can see, for labels: man, cat, horse, dog (with template This is a photo of a xxxx.) Raw logits: tensor([[-16.1657, -14.3962, -15.7023, -7.3122]]) Post sigmoid: tensor([[9.5352e-08, 5.5954e-07, 1.5156e-07, 6.6692e-04]]) 0.0% that image 0 is 'man' 0.0% that image 0 is 'cat' 0.0% that image 0 is 'horse' 0.1% that image 0 is 'dog'
\n","updatedAt":"2025-02-24T18:42:54.545Z","author":{"_id":"6669bdc109fc336f73b4f177","avatarUrl":"/avatars/540f47dd0fb41a572604e4bf8594c723.svg","fullname":"Mike","name":"mw44","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":4,"identifiedLanguage":{"language":"en","probability":0.890521228313446},"editors":["mw44"],"editorAvatarUrls":["/avatars/540f47dd0fb41a572604e4bf8594c723.svg"],"reactions":[],"isReport":false,"parentCommentId":"67b9f8534a1e1803da36dde5"}},{"id":"67be38c8b425fcc71cf16be4","author":{"_id":"642d334ff65714b4585f2de4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642d334ff65714b4585f2de4/gxBynq5KyoUP0VlAQD3-w.jpeg","fullname":"Lucas Beyer","name":"giffmana","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":40},"createdAt":"2025-02-25T21:40:24.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Not sure what's up as I'm not familiar with this codebase (and no time to dig in), but for siglip what you're supposed to do is do sigmoid(zimg @ ztxt * temperature + bias)\n\nfrom what you describe, I would bet the bias and/or temperature are missing?\nThe ground-truth reference code is https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/SigLIP2_demo.ipynb","html":"
Not sure what's up as I'm not familiar with this codebase (and no time to dig in), but for siglip what you're supposed to do is do sigmoid(zimg @ ztxt * temperature + bias)
\n","updatedAt":"2025-02-25T21:40:24.829Z","author":{"_id":"642d334ff65714b4585f2de4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642d334ff65714b4585f2de4/gxBynq5KyoUP0VlAQD3-w.jpeg","fullname":"Lucas Beyer","name":"giffmana","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":40}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.907722532749176},"editors":["giffmana"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/642d334ff65714b4585f2de4/gxBynq5KyoUP0VlAQD3-w.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"67b9f8534a1e1803da36dde5"}},{"id":"67bf5d8503d9928774f28445","author":{"_id":"6669bdc109fc336f73b4f177","avatarUrl":"/avatars/540f47dd0fb41a572604e4bf8594c723.svg","fullname":"Mike","name":"mw44","type":"user","isPro":false,"isHf":false,"isMod":false},"createdAt":"2025-02-26T18:29:25.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"I think youre on the right track with this @giffmana . But it suggests the transformers implementation is doing this incorrectly, right?\n\nAs it stands, I'm getting really strange and objectively bad results when comparing siglip1 to siglip2, even when using the official Google space for it (https://huggingface.co/spaces/google/zero-shot-sg1-sg2/blob/main/app.py) which seems to stop after `sigmoid`. Siglip2 is consistently less confident in its predictions (when computing the confidences in this way) and is highly sensitive to, for instance, removing the '.' from the end of the label transform e.g. `texts = [f\"This is a photo of {label}\" for label in candidate_labels]` vs `texts = [f\"This is a photo of {label}.\" for label in candidate_labels]` which does not seem right at all.\n","html":"
I think youre on the right track with this \n\n@giffmana\n\t . But it suggests the transformers implementation is doing this incorrectly, right? As it stands, I'm getting really strange and objectively bad results when comparing siglip1 to siglip2, even when using the official Google space for it (https://huggingface.co/spaces/google/zero-shot-sg1-sg2/blob/main/app.py) which seems to stop after sigmoid. Siglip2 is consistently less confident in its predictions (when computing the confidences in this way) and is highly sensitive to, for instance, removing the '.' from the end of the label transform e.g. texts = [f\"This is a photo of {label}\" for label in candidate_labels] vs texts = [f\"This is a photo of {label}.\" for label in candidate_labels] which does not seem right at all.
\n","updatedAt":"2025-02-26T18:30:01.287Z","author":{"_id":"6669bdc109fc336f73b4f177","avatarUrl":"/avatars/540f47dd0fb41a572604e4bf8594c723.svg","fullname":"Mike","name":"mw44","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.8661419153213501},"editors":["mw44"],"editorAvatarUrls":["/avatars/540f47dd0fb41a572604e4bf8594c723.svg"],"reactions":[],"isReport":false,"parentCommentId":"67b9f8534a1e1803da36dde5"}},{"id":"67c03fc2b9b72f6a5ddec5da","author":{"_id":"660c2d134ba2fcc848b03e21","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/660c2d134ba2fcc848b03e21/oIxOALwKoNaNw3nh0bmrS.png","fullname":"Pavel Iakubovskii","name":"qubvel-hf","type":"user","isPro":false,"isHf":true,"isMod":false,"followerCount":70},"createdAt":"2025-02-27T10:34:42.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"> Not sure what's up as I'm not familiar with this codebase (and no time to dig in), but for siglip what you're supposed to do is do sigmoid(zimg @ ztxt * temperature + bias)\n> \n> from what you describe, I would bet the bias and/or temperature are missing?\n> The ground-truth reference code is https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/SigLIP2_demo.ipynb\n\nHey @giffmana, temperature and bias are applied under the hood, see\n\nSiglip\nhttps://github.com/huggingface/transformers/blob/17792556b21b4da0dbb9e4b59b39fb34aae4047c/src/transformers/models/siglip/modeling_siglip.py#L1411-L1417\n\nSiglip2\nhttps://github.com/huggingface/transformers/blob/17792556b21b4da0dbb9e4b59b39fb34aae4047c/src/transformers/models/siglip2/modeling_siglip2.py#L1459-L1465\n","html":"
\n
Not sure what's up as I'm not familiar with this codebase (and no time to dig in), but for siglip what you're supposed to do is do sigmoid(zimg @ ztxt * temperature + bias)
\n","updatedAt":"2025-02-27T10:34:42.173Z","author":{"_id":"660c2d134ba2fcc848b03e21","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/660c2d134ba2fcc848b03e21/oIxOALwKoNaNw3nh0bmrS.png","fullname":"Pavel Iakubovskii","name":"qubvel-hf","type":"user","isPro":false,"isHf":true,"isMod":false,"followerCount":70}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7868081331253052},"editors":["qubvel-hf"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/660c2d134ba2fcc848b03e21/oIxOALwKoNaNw3nh0bmrS.png"],"reactions":[{"reaction":"👍","users":["mw44"],"count":1}],"isReport":false,"parentCommentId":"67b9f8534a1e1803da36dde5"}},{"id":"67c040e7cc3883ce1ad204b3","author":{"_id":"660c2d134ba2fcc848b03e21","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/660c2d134ba2fcc848b03e21/oIxOALwKoNaNw3nh0bmrS.png","fullname":"Pavel Iakubovskii","name":"qubvel-hf","type":"user","isPro":false,"isHf":true,"isMod":false,"followerCount":70},"createdAt":"2025-02-27T10:39:35.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"btw, also observed \".\" and capitalized template influences the confidence quite a bit","html":"
btw, also observed \".\" and capitalized template influences the confidence quite a bit
\n","updatedAt":"2025-02-27T10:39:35.258Z","author":{"_id":"660c2d134ba2fcc848b03e21","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/660c2d134ba2fcc848b03e21/oIxOALwKoNaNw3nh0bmrS.png","fullname":"Pavel Iakubovskii","name":"qubvel-hf","type":"user","isPro":false,"isHf":true,"isMod":false,"followerCount":70}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.907932698726654},"editors":["qubvel-hf"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/660c2d134ba2fcc848b03e21/oIxOALwKoNaNw3nh0bmrS.png"],"reactions":[{"reaction":"🧠","users":["ariG23498"],"count":1},{"reaction":"👍","users":["mw44"],"count":1}],"isReport":false,"parentCommentId":"67b9f8534a1e1803da36dde5"}}]},{"id":"67bae4fc489cb4dc98a016ac","author":{"_id":"64084fa192033c150738e4f2","avatarUrl":"/avatars/dfff2216eb235c635e5abe6fda3084f0.svg","fullname":"Yu_xm","name":"Yu2020","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":1},"createdAt":"2025-02-23T09:06:04.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"\"When using the standalone GemmaTokenizerFast make sure to pass padding=\"max_length\" and max_length=64 as that’s how the model was trained.\" Does Siglip2 support longer text input? If the max_length is set to 256 or 512, will text exceeding 64 be truncated?","html":"
\"When using the standalone GemmaTokenizerFast make sure to pass padding=\"max_length\" and max_length=64 as that’s how the model was trained.\" Does Siglip2 support longer text input? If the max_length is set to 256 or 512, will text exceeding 64 be truncated?
\n","updatedAt":"2025-02-23T09:06:04.559Z","author":{"_id":"64084fa192033c150738e4f2","avatarUrl":"/avatars/dfff2216eb235c635e5abe6fda3084f0.svg","fullname":"Yu_xm","name":"Yu2020","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":1}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8881426453590393},"editors":["Yu2020"],"editorAvatarUrls":["/avatars/dfff2216eb235c635e5abe6fda3084f0.svg"],"reactions":[{"reaction":"🤝","users":["Yu2020"],"count":1}],"isReport":false},"replies":[{"id":"67baf87d343a0aacd50876bf","author":{"_id":"642d334ff65714b4585f2de4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642d334ff65714b4585f2de4/gxBynq5KyoUP0VlAQD3-w.jpeg","fullname":"Lucas Beyer","name":"giffmana","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":40},"createdAt":"2025-02-23T10:29:17.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"Yes. If you want longer text, what I'd do is chunk it into pieces of 64 tokens (possibly even overlapping), embed those separately, and either average their endings or dot them with the image embedding individually and take max or average score, depending on your use case.\n\nI'm actually curious what kind of queries you're dealing with that are longer than 64 tokens? All use cases of siglip i can think of almost always fit in way below 64.","html":"
Yes. If you want longer text, what I'd do is chunk it into pieces of 64 tokens (possibly even overlapping), embed those separately, and either average their endings or dot them with the image embedding individually and take max or average score, depending on your use case.
\n
I'm actually curious what kind of queries you're dealing with that are longer than 64 tokens? All use cases of siglip i can think of almost always fit in way below 64.
\n","updatedAt":"2025-02-23T12:16:06.286Z","author":{"_id":"642d334ff65714b4585f2de4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642d334ff65714b4585f2de4/gxBynq5KyoUP0VlAQD3-w.jpeg","fullname":"Lucas Beyer","name":"giffmana","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":40}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9640204906463623},"editors":["giffmana"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/642d334ff65714b4585f2de4/gxBynq5KyoUP0VlAQD3-w.jpeg"],"reactions":[{"reaction":"👍","users":["rwightman","ariG23498","mrdbourke"],"count":3}],"isReport":false,"parentCommentId":"67bae4fc489cb4dc98a016ac"}},{"id":"67bb1b676c77ed2b0b27aa3d","author":{"_id":"64084fa192033c150738e4f2","avatarUrl":"/avatars/dfff2216eb235c635e5abe6fda3084f0.svg","fullname":"Yu_xm","name":"Yu2020","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":1},"createdAt":"2025-02-23T12:58:15.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is related to one of my papers. If you are interested, we can discuss the details via email. 😄","html":"
This is related to one of my papers. If you are interested, we can discuss the details via email. 😄
\n","updatedAt":"2025-02-23T12:58:15.443Z","author":{"_id":"64084fa192033c150738e4f2","avatarUrl":"/avatars/dfff2216eb235c635e5abe6fda3084f0.svg","fullname":"Yu_xm","name":"Yu2020","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":1}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9137150049209595},"editors":["Yu2020"],"editorAvatarUrls":["/avatars/dfff2216eb235c635e5abe6fda3084f0.svg"],"reactions":[],"isReport":false,"parentCommentId":"67bae4fc489cb4dc98a016ac"}},{"id":"67bb23d3b537525c08b20e41","author":{"_id":"642d334ff65714b4585f2de4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642d334ff65714b4585f2de4/gxBynq5KyoUP0VlAQD3-w.jpeg","fullname":"Lucas Beyer","name":"giffmana","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":40},"createdAt":"2025-02-23T13:34:11.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Sorry i can't collaborate with individual's papers.","html":"
Sorry i can't collaborate with individual's papers.
\n","updatedAt":"2025-02-23T13:34:11.850Z","author":{"_id":"642d334ff65714b4585f2de4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642d334ff65714b4585f2de4/gxBynq5KyoUP0VlAQD3-w.jpeg","fullname":"Lucas Beyer","name":"giffmana","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":40}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9685197472572327},"editors":["giffmana"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/642d334ff65714b4585f2de4/gxBynq5KyoUP0VlAQD3-w.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"67bae4fc489cb4dc98a016ac"}}]},{"id":"67bc03ed047d98a0723cd8cb","author":{"_id":"664a2851ea529a27ec560b78","avatarUrl":"/avatars/633e5de807b03e39f0738e9fadbf94c0.svg","fullname":"krishna murali shyam gn","name":"kmsgn","type":"user","isPro":false,"isHf":false,"isMod":false},"createdAt":"2025-02-24T05:30:21.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"I get following error\n\n/lib/python3.10/site-packages/transformers/models/siglip/tokenization_siglip.py\", line 139, in get_spm_processor\n with open(self.vocab_file, \"rb\") as f:\nTypeError: expected str, bytes or os.PathLike object, not NoneType\n","html":"
I get following error
\n
/lib/python3.10/site-packages/transformers/models/siglip/tokenization_siglip.py\", line 139, in get_spm_processor with open(self.vocab_file, \"rb\") as f: TypeError: expected str, bytes or os.PathLike object, not NoneType
\n","updatedAt":"2025-02-24T05:30:21.551Z","author":{"_id":"664a2851ea529a27ec560b78","avatarUrl":"/avatars/633e5de807b03e39f0738e9fadbf94c0.svg","fullname":"krishna murali shyam gn","name":"kmsgn","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7334406971931458},"editors":["kmsgn"],"editorAvatarUrls":["/avatars/633e5de807b03e39f0738e9fadbf94c0.svg"],"reactions":[],"isReport":false},"replies":[{"id":"67bc350e0b34212eda51bc73","author":{"_id":"608aabf24955d2bfc3cd99c6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/608aabf24955d2bfc3cd99c6/T762Ut0Y-w0sZB2ynvfbJ.jpeg","fullname":"Aritra Roy Gosthipaty","name":"ariG23498","type":"user","isPro":true,"isHf":true,"isMod":false,"followerCount":130},"createdAt":"2025-02-24T08:59:58.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Could you install transformers from main?\n\n`pip install git+https://github.com/huggingface/transformers@main`\n\nLet us know if this solves the issue. 🤗","html":"
\n","updatedAt":"2025-02-24T08:59:58.978Z","author":{"_id":"608aabf24955d2bfc3cd99c6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/608aabf24955d2bfc3cd99c6/T762Ut0Y-w0sZB2ynvfbJ.jpeg","fullname":"Aritra Roy Gosthipaty","name":"ariG23498","type":"user","isPro":true,"isHf":true,"isMod":false,"followerCount":130}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7934895753860474},"editors":["ariG23498"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/608aabf24955d2bfc3cd99c6/T762Ut0Y-w0sZB2ynvfbJ.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"67bc03ed047d98a0723cd8cb"}},{"id":"67bde3fac73e0b462c33db64","author":{"_id":"664a2851ea529a27ec560b78","avatarUrl":"/avatars/633e5de807b03e39f0738e9fadbf94c0.svg","fullname":"krishna murali shyam gn","name":"kmsgn","type":"user","isPro":false,"isHf":false,"isMod":false},"createdAt":"2025-02-25T15:38:34.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"yes it works ! thank u","html":"
yes it works ! thank u
\n","updatedAt":"2025-02-25T15:38:34.472Z","author":{"_id":"664a2851ea529a27ec560b78","avatarUrl":"/avatars/633e5de807b03e39f0738e9fadbf94c0.svg","fullname":"krishna murali shyam gn","name":"kmsgn","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5435454249382019},"editors":["kmsgn"],"editorAvatarUrls":["/avatars/633e5de807b03e39f0738e9fadbf94c0.svg"],"reactions":[],"isReport":false,"parentCommentId":"67bc03ed047d98a0723cd8cb"}}]},{"id":"67bdd9e93dbc210658122a6c","author":{"_id":"662f713f97f2a61abe32afe2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/662f713f97f2a61abe32afe2/ePEtVjEmXv5rEU4wPt9Wz.png","fullname":"Fabian Sidonio Rodrigues","name":"FSR-telice","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":1},"createdAt":"2025-02-25T14:55:37.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"There are really 3 cats, it's just that one of them is pregnant","html":"
There are really 3 cats, it's just that one of them is pregnant
Might want to check something: The demo at the bottom hangs with "Uploading 1 file..." and none of the premade examples load or execute. 6 different Javascript console errors are thrown. (Chrome, Mac).
Siglip2 does not include 'spiece.model' as in Siglip. Is this expected? The processor raises an error indicating that the vocab file is a NoneType object, which may suggest a missing spiece model. Any ideas?
I tried running the zeroshot classification example and got ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (input_idsin this case) have excessive nesting (inputs typelistwhere typeintis expected). transformers version=4.49.0.dev0
I tried adding both padding=True and truncation=True to no avail. i also tried padding="max_length"
EDIT: it seems to work if my labels are all the same length. doing some debugging, i see that in zero_shot_image_classification.py, the padding provided to the tokenizer is forced to be max_length anyway here (L148-149)
and yet, if my labels have variable lengths, the outputs are not the same length, and so calling torch.tensor on that ultimately fails i did spot this warning in my terminal as well: Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
"When using the standalone GemmaTokenizerFast make sure to pass padding="max_length" and max_length=64 as that’s how the model was trained." Does Siglip2 support longer text input? If the max_length is set to 256 or 512, will text exceeding 64 be truncated?
Yes. If you want longer text, what I'd do is chunk it into pieces of 64 tokens (possibly even overlapping), embed those separately, and either average their endings or dot them with the image embedding individually and take max or average score, depending on your use case.
I'm actually curious what kind of queries you're dealing with that are longer than 64 tokens? All use cases of siglip i can think of almost always fit in way below 64.
/lib/python3.10/site-packages/transformers/models/siglip/tokenization_siglip.py", line 139, in get_spm_processor with open(self.vocab_file, "rb") as f: TypeError: expected str, bytes or os.PathLike object, not NoneType