\n\n@nguyenbh\n\t , I think this needs to be re-opened. There appears to be a bug here, and it's easily reproducable.
\noutput_imgs[-1].shape[1]
comes from a concatenation: torch.cat([sub_img, self.glb_GN, glb_img], dim=1)
, where:
- \n
sub_image.shape[1]
comes from a concatenation of something the size of (a) theuseful_height
anduseful_width
product (16 * 12), and (b)temp_sub_GN
(16) \nself.glb_GN.shape[1]
is1
\nglb_img.shape[1]
is16 * 16 + 16 = 272
\n
Summing up (in my case, with a small 448x448 image) to: 208 + 1 + 272 = 481
. This looks theoretically correct to me.
With the image attention mask on (default), temp_len
is defined as:
temp_len = int(image_attention_mask[_bs,:B_+1,0::2,0::2].sum().item()) + (useful_height+1) + base_feat_height//base_feat_height_reduction
In my case, the numbers for these are: 320 + 17 + 16 = 353
.
The whole code for Phi4MMImageEmbedding
is rather impenetrable, but I don't see how those are supposed to be equal. The culprit seems to be the logic around temp_len
calculation.
Even when I simply comment out the assertion statement, I get an error right below when I send a large image:
\n File \"/Users/myuser/repos/phi4model/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n File \"/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py\", line 769, in forward\n image_hidden_states = self.image_embed(\n File \"/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n File \"/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n File \"/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py\", line 448, in forward\n new_hidden_states = hidden_states.index_put(\nRuntimeError: shape mismatch: value tensor of shape [900, 3072] cannot be broadcast to indexing result of shape [1792, 3072]\n
\nInterestingly, both earlier variables that are asserted on, i.e.temp_len
(1332) and output_imgs[-1].shape[1]
(900), are not the expected shape (1792) here!
\n\n@nguyenbh\n\t have you been able to reproduce this?
\n","updatedAt":"2025-03-14T00:39:20.296Z","author":{"_id":"6304257c0907b9a115c718b1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6304257c0907b9a115c718b1/hiRz8prbr8viYUmxi3Woc.jpeg","fullname":"Eric Buehler","name":"EricB","type":"user","isPro":false,"isHf":true,"isMod":false,"followerCount":56}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9870848655700684},"editors":["EricB"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6304257c0907b9a115c718b1/hiRz8prbr8viYUmxi3Woc.jpeg"],"reactions":[],"isReport":false}}],"pinned":false,"locked":false,"collection":"discussions","isPullRequest":false,"isReport":false},"repo":{"name":"microsoft/Phi-4-multimodal-instruct","type":"model"},"activeTab":"discussion","discussionRole":0}">Error during inference with image and text.
\n\n@nguyenbh\n\t , I think this needs to be re-opened. There appears to be a bug here, and it's easily reproducable.
\noutput_imgs[-1].shape[1]
comes from a concatenation: torch.cat([sub_img, self.glb_GN, glb_img], dim=1)
, where:
- \n
sub_image.shape[1]
comes from a concatenation of something the size of (a) theuseful_height
anduseful_width
product (16 * 12), and (b)temp_sub_GN
(16) \nself.glb_GN.shape[1]
is1
\nglb_img.shape[1]
is16 * 16 + 16 = 272
\n
Summing up (in my case, with a small 448x448 image) to: 208 + 1 + 272 = 481
. This looks theoretically correct to me.
With the image attention mask on (default), temp_len
is defined as:
temp_len = int(image_attention_mask[_bs,:B_+1,0::2,0::2].sum().item()) + (useful_height+1) + base_feat_height//base_feat_height_reduction
In my case, the numbers for these are: 320 + 17 + 16 = 353
.
The whole code for Phi4MMImageEmbedding
is rather impenetrable, but I don't see how those are supposed to be equal. The culprit seems to be the logic around temp_len
calculation.
Even when I simply comment out the assertion statement, I get an error right below when I send a large image:
\n File \"/Users/myuser/repos/phi4model/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n File \"/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py\", line 769, in forward\n image_hidden_states = self.image_embed(\n File \"/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n File \"/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n File \"/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py\", line 448, in forward\n new_hidden_states = hidden_states.index_put(\nRuntimeError: shape mismatch: value tensor of shape [900, 3072] cannot be broadcast to indexing result of shape [1792, 3072]\n
\nInterestingly, both earlier variables that are asserted on, i.e.temp_len
(1332) and output_imgs[-1].shape[1]
(900), are not the expected shape (1792) here!
\n\n@nguyenbh\n\t have you been able to reproduce this?
\n","updatedAt":"2025-03-14T00:39:20.296Z","author":{"_id":"6304257c0907b9a115c718b1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6304257c0907b9a115c718b1/hiRz8prbr8viYUmxi3Woc.jpeg","fullname":"Eric Buehler","name":"EricB","type":"user","isPro":false,"isHf":true,"isMod":false,"followerCount":56}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9870848655700684},"editors":["EricB"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6304257c0907b9a115c718b1/hiRz8prbr8viYUmxi3Woc.jpeg"],"reactions":[],"isReport":false}}],"pinned":false,"locked":false,"collection":"discussions","isPullRequest":false,"isReport":false},"primaryEmailConfirmed":false,"repo":{"name":"microsoft/Phi-4-multimodal-instruct","type":"model"},"discussionRole":0,"acceptLanguages":["en","*"],"disableDiscussionClosingAndCommentHiding":false,"hideComments":true}">Running into the following error when trying inference with Image+Text
/home/.cache/huggingface/modules/transformers_modules/microsoft/Phi-4-multimodal-instruct/879783f7b23e43c12d1c682e3458f115f3a7718d/modeling_phi4mm.py", line 399, in forward
assert temp_len == output_imgs[-1].shape[1], f'temp_len: {temp_len}, output_imgs[-1].shape[1]: {output_imgs[-1].shape[1]}'
AssertionError: temp_len: 5409, output_imgs[-1].shape[1]: 5393
It doesn't happen for all images, just some.
Same error with:
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
# Define model path
model_path = "microsoft/Phi-4-multimodal-instruct"
# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="mps",
trust_remote_code=True,
_attn_implementation='eager',
).to("mps")
# Load generation config
generation_config = GenerationConfig.from_pretrained(model_path)
# Define prompt structure
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# Part 1: Image Processing
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')
# Download and open image
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to('mps')
# Generate response
generate_ids = model.generate(
**inputs,
max_new_tokens=256,
generation_config=generation_config,
)
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')
Error:
File "/Users/ericbuehler/.cache/huggingface/modules/transformers_modules/phi4_multimodal/modeling_phi4mm.py", line 399, in forward
assert temp_len == output_imgs[-1].shape[1], f'temp_len: {temp_len}, output_imgs[-1].shape[1]: {output_imgs[-1].shape[1]}'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: temp_len: 1381, output_imgs[-1].shape[1]: 933```
@nguyenbh , I think this needs to be re-opened. There appears to be a bug here, and it's easily reproducable.
output_imgs[-1].shape[1]
comes from a concatenation: torch.cat([sub_img, self.glb_GN, glb_img], dim=1)
, where:
sub_image.shape[1]
comes from a concatenation of something the size of (a) theuseful_height
anduseful_width
product (16 * 12), and (b)temp_sub_GN
(16)self.glb_GN.shape[1]
is1
glb_img.shape[1]
is16 * 16 + 16 = 272
Summing up (in my case, with a small 448x448 image) to: 208 + 1 + 272 = 481
. This looks theoretically correct to me.
With the image attention mask on (default), temp_len
is defined as:
temp_len = int(image_attention_mask[_bs,:B_+1,0::2,0::2].sum().item()) + (useful_height+1) + base_feat_height//base_feat_height_reduction
In my case, the numbers for these are: 320 + 17 + 16 = 353
.
The whole code for Phi4MMImageEmbedding
is rather impenetrable, but I don't see how those are supposed to be equal. The culprit seems to be the logic around temp_len
calculation.
Even when I simply comment out the assertion statement, I get an error right below when I send a large image:
File "/Users/myuser/repos/phi4model/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py", line 769, in forward
image_hidden_states = self.image_embed(
File "/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py", line 448, in forward
new_hidden_states = hidden_states.index_put(
RuntimeError: shape mismatch: value tensor of shape [900, 3072] cannot be broadcast to indexing result of shape [1792, 3072]
Interestingly, both earlier variables that are asserted on, i.e.temp_len
(1332) and output_imgs[-1].shape[1]
(900), are not the expected shape (1792) here!