microsoft/Phi-4-multimodal-instruct · Error during inference with image and text.

@aarbelle\n\t \n\n@EricB\n\t Did you find a solution to this problem yet? I am facing the same error...

\n","updatedAt":"2025-03-03T08:06:54.289Z","author":{"_id":"64e33d9d864e78652ee48234","avatarUrl":"/avatars/9380c5e2b95a8788d3a956e548de69df.svg","fullname":"Prateek","name":"PrateekTikku","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8610163331031799},"editors":["PrateekTikku"],"editorAvatarUrls":["/avatars/9380c5e2b95a8788d3a956e548de69df.svg"],"reactions":[],"isReport":false}},{"id":"67cdbba0a26e6d3a70c114c8","author":{"_id":"5f3ec133a4dd343b63a632dd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1597948137099-noauth.jpeg","fullname":"Nguyen Bach","name":"nguyenbh","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":31,"isOwner":false,"isOrgMember":true},"createdAt":"2025-03-09T16:02:40.000Z","type":"status-change","data":{"status":"closed"}},{"id":"67d358e6f11b117fd02ed581","author":{"_id":"67d33f44b57cfdb4f03ee36c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/As5_CyZNa1Z1i8cOMl6f5.png","fullname":"Nico de Vos","name":"DefOs9","type":"user","isPro":false,"isHf":false,"isMod":false,"isOwner":false,"isOrgMember":false},"createdAt":"2025-03-13T22:15:02.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"@nguyenbh , I think this needs to be re-opened. There appears to be a bug here, and it's easily reproducable.\n\n`output_imgs[-1].shape[1]` comes from a concatenation: `torch.cat([sub_img, self.glb_GN, glb_img], dim=1)`, where:\n\n- `sub_image.shape[1]` comes from a concatenation of something the size of (a) the `useful_height` and `useful_width` product (16 * 12), and (b) `temp_sub_GN` (16)\n- `self.glb_GN.shape[1]` is `1`\n- `glb_img.shape[1]` is `16 * 16 + 16 = 272`\n\nSumming up (in my case, with a small 448x448 image) to: `208 + 1 + 272 = 481`. This looks theoretically correct to me.\n\nWith the image attention mask on (default), `temp_len` is defined as:\n\n```temp_len = int(image_attention_mask[_bs,:B_+1,0::2,0::2].sum().item()) + (useful_height+1) + base_feat_height//base_feat_height_reduction```\n\nIn my case, the numbers for these are: `320 + 17 + 16 = 353`.\n\nThe whole code for `Phi4MMImageEmbedding` is rather impenetrable, but I don't see how those are supposed to be equal. The culprit seems to be the logic around `temp_len` calculation.","html":"

\n\n@nguyenbh\n\t , I think this needs to be re-opened. There appears to be a bug here, and it's easily reproducable.

output_imgs[-1].shape[1] comes from a concatenation: torch.cat([sub_img, self.glb_GN, glb_img], dim=1), where:

sub_image.shape[1] comes from a concatenation of something the size of (a) the useful_height and useful_width product (16 * 12), and (b) temp_sub_GN (16)
self.glb_GN.shape[1] is 1
glb_img.shape[1] is 16 * 16 + 16 = 272

Summing up (in my case, with a small 448x448 image) to: 208 + 1 + 272 = 481. This looks theoretically correct to me.

With the image attention mask on (default), temp_len is defined as:

temp_len = int(image_attention_mask[_bs,:B_+1,0::2,0::2].sum().item()) + (useful_height+1) + base_feat_height//base_feat_height_reduction

In my case, the numbers for these are: 320 + 17 + 16 = 353.

The whole code for Phi4MMImageEmbedding is rather impenetrable, but I don't see how those are supposed to be equal. The culprit seems to be the logic around temp_len calculation.

\n","updatedAt":"2025-03-13T22:15:02.882Z","author":{"_id":"67d33f44b57cfdb4f03ee36c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/As5_CyZNa1Z1i8cOMl6f5.png","fullname":"Nico de Vos","name":"DefOs9","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8305501937866211},"editors":["DefOs9"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/As5_CyZNa1Z1i8cOMl6f5.png"],"reactions":[],"isReport":false}},{"id":"67d35e777684f095bb2e0596","author":{"_id":"67d33f44b57cfdb4f03ee36c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/As5_CyZNa1Z1i8cOMl6f5.png","fullname":"Nico de Vos","name":"DefOs9","type":"user","isPro":false,"isHf":false,"isMod":false,"isOwner":false,"isOrgMember":false},"createdAt":"2025-03-13T22:38:47.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Even when I simply comment out the assertion statement, I get an error right below when I send a large image:\n\n```\n File \"/Users/myuser/repos/phi4model/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n File \"/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py\", line 769, in forward\n image_hidden_states = self.image_embed(\n File \"/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n File \"/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n File \"/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py\", line 448, in forward\n new_hidden_states = hidden_states.index_put(\nRuntimeError: shape mismatch: value tensor of shape [900, 3072] cannot be broadcast to indexing result of shape [1792, 3072]\n```\n\nInterestingly, both earlier variables that are asserted on, i.e.`temp_len` (1332) and `output_imgs[-1].shape[1]` (900), are not the expected shape (1792) here!","html":"

Even when I simply comment out the assertion statement, I get an error right below when I send a large image:

  File \"/Users/myuser/repos/phi4model/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1750, in _call_impl\n    return forward_call(*args, **kwargs)\n  File \"/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py\", line 769, in forward\n    image_hidden_states = self.image_embed(\n  File \"/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1739, in _wrapped_call_impl\n    return self._call_impl(*args, **kwargs)\n  File \"/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1750, in _call_impl\n    return forward_call(*args, **kwargs)\n  File \"/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py\", line 448, in forward\n    new_hidden_states = hidden_states.index_put(\nRuntimeError: shape mismatch: value tensor of shape [900, 3072] cannot be broadcast to indexing result of shape [1792, 3072]\n

Interestingly, both earlier variables that are asserted on, i.e.temp_len (1332) and output_imgs[-1].shape[1] (900), are not the expected shape (1792) here!

\n","updatedAt":"2025-03-13T22:38:47.645Z","author":{"_id":"67d33f44b57cfdb4f03ee36c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/As5_CyZNa1Z1i8cOMl6f5.png","fullname":"Nico de Vos","name":"DefOs9","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6205721497535706},"editors":["DefOs9"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/As5_CyZNa1Z1i8cOMl6f5.png"],"reactions":[{"reaction":"👀","users":["EricB"],"count":1}],"isReport":false}},{"id":"67d37a9329a092bdbb03e441","author":{"_id":"5f3ec133a4dd343b63a632dd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1597948137099-noauth.jpeg","fullname":"Nguyen Bach","name":"nguyenbh","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":31,"isOwner":false,"isOrgMember":true},"createdAt":"2025-03-14T00:38:43.000Z","type":"status-change","data":{"status":"open"}},{"id":"67d37ab853ffd6d75d8b7471","author":{"_id":"6304257c0907b9a115c718b1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6304257c0907b9a115c718b1/hiRz8prbr8viYUmxi3Woc.jpeg","fullname":"Eric Buehler","name":"EricB","type":"user","isPro":false,"isHf":true,"isMod":false,"followerCount":56,"isOwner":false,"isOrgMember":false},"createdAt":"2025-03-14T00:39:20.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"@nguyenbh have you been able to reproduce this?","html":"

\n\n@nguyenbh\n\t have you been able to reproduce this?

\n","updatedAt":"2025-03-14T00:39:20.296Z","author":{"_id":"6304257c0907b9a115c718b1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6304257c0907b9a115c718b1/hiRz8prbr8viYUmxi3Woc.jpeg","fullname":"Eric Buehler","name":"EricB","type":"user","isPro":false,"isHf":true,"isMod":false,"followerCount":56}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9870848655700684},"editors":["EricB"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6304257c0907b9a115c718b1/hiRz8prbr8viYUmxi3Woc.jpeg"],"reactions":[],"isReport":false}}],"pinned":false,"locked":false,"collection":"discussions","isPullRequest":false,"isReport":false},"repo":{"name":"microsoft/Phi-4-multimodal-instruct","type":"model"},"activeTab":"discussion","discussionRole":0}">

Error during inference with image and text.

#12

by aarbelle - opened 13 days ago

Discussion

@aarbelle\n\t \n\n@EricB\n\t Did you find a solution to this problem yet? I am facing the same error...

\n\n@nguyenbh\n\t , I think this needs to be re-opened. There appears to be a bug here, and it's easily reproducable.

output_imgs[-1].shape[1] comes from a concatenation: torch.cat([sub_img, self.glb_GN, glb_img], dim=1), where:

sub_image.shape[1] comes from a concatenation of something the size of (a) the useful_height and useful_width product (16 * 12), and (b) temp_sub_GN (16)
self.glb_GN.shape[1] is 1
glb_img.shape[1] is 16 * 16 + 16 = 272

Summing up (in my case, with a small 448x448 image) to: 208 + 1 + 272 = 481. This looks theoretically correct to me.

With the image attention mask on (default), temp_len is defined as:

temp_len = int(image_attention_mask[_bs,:B_+1,0::2,0::2].sum().item()) + (useful_height+1) + base_feat_height//base_feat_height_reduction

In my case, the numbers for these are: 320 + 17 + 16 = 353.

The whole code for Phi4MMImageEmbedding is rather impenetrable, but I don't see how those are supposed to be equal. The culprit seems to be the logic around temp_len calculation.

Even when I simply comment out the assertion statement, I get an error right below when I send a large image:

  File \"/Users/myuser/repos/phi4model/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1750, in _call_impl\n    return forward_call(*args, **kwargs)\n  File \"/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py\", line 769, in forward\n    image_hidden_states = self.image_embed(\n  File \"/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1739, in _wrapped_call_impl\n    return self._call_impl(*args, **kwargs)\n  File \"/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1750, in _call_impl\n    return forward_call(*args, **kwargs)\n  File \"/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py\", line 448, in forward\n    new_hidden_states = hidden_states.index_put(\nRuntimeError: shape mismatch: value tensor of shape [900, 3072] cannot be broadcast to indexing result of shape [1792, 3072]\n

Interestingly, both earlier variables that are asserted on, i.e.temp_len (1332) and output_imgs[-1].shape[1] (900), are not the expected shape (1792) here!

\n\n@nguyenbh\n\t have you been able to reproduce this?

\n","updatedAt":"2025-03-14T00:39:20.296Z","author":{"_id":"6304257c0907b9a115c718b1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6304257c0907b9a115c718b1/hiRz8prbr8viYUmxi3Woc.jpeg","fullname":"Eric Buehler","name":"EricB","type":"user","isPro":false,"isHf":true,"isMod":false,"followerCount":56}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9870848655700684},"editors":["EricB"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6304257c0907b9a115c718b1/hiRz8prbr8viYUmxi3Woc.jpeg"],"reactions":[],"isReport":false}}],"pinned":false,"locked":false,"collection":"discussions","isPullRequest":false,"isReport":false},"primaryEmailConfirmed":false,"repo":{"name":"microsoft/Phi-4-multimodal-instruct","type":"model"},"discussionRole":0,"acceptLanguages":["en","*"],"disableDiscussionClosingAndCommentHiding":false,"hideComments":true}">

aarbelle

13 days ago

Running into the following error when trying inference with Image+Text

/home/.cache/huggingface/modules/transformers_modules/microsoft/Phi-4-multimodal-instruct/879783f7b23e43c12d1c682e3458f115f3a7718d/modeling_phi4mm.py", line 399, in forward
    assert temp_len == output_imgs[-1].shape[1], f'temp_len: {temp_len}, output_imgs[-1].shape[1]: {output_imgs[-1].shape[1]}'
AssertionError: temp_len: 5409, output_imgs[-1].shape[1]: 5393

It doesn't happen for all images, just some.

EricB

13 days ago

Same error with:

import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

# Define model path
model_path = "microsoft/Phi-4-multimodal-instruct"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="mps", 
    trust_remote_code=True, 
    _attn_implementation='eager',
).to("mps")

# Load generation config
generation_config = GenerationConfig.from_pretrained(model_path)

# Define prompt structure
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

# Part 1: Image Processing
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Download and open image
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to('mps')

# Generate response
generate_ids = model.generate(
    **inputs,
    max_new_tokens=256,
    generation_config=generation_config,
)
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

Error:

File "/Users/ericbuehler/.cache/huggingface/modules/transformers_modules/phi4_multimodal/modeling_phi4mm.py", line 399, in forward
    assert temp_len == output_imgs[-1].shape[1], f'temp_len: {temp_len}, output_imgs[-1].shape[1]: {output_imgs[-1].shape[1]}'
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: temp_len: 1381, output_imgs[-1].shape[1]: 933```

PrateekTikku

11 days ago

@aarbelle @EricB Did you find a solution to this problem yet? I am facing the same error...

nguyenbh changed discussion status to closed 4 days ago

DefOs9

about 4 hours ago

@nguyenbh , I think this needs to be re-opened. There appears to be a bug here, and it's easily reproducable.

output_imgs[-1].shape[1] comes from a concatenation: torch.cat([sub_img, self.glb_GN, glb_img], dim=1), where:

sub_image.shape[1] comes from a concatenation of something the size of (a) the useful_height and useful_width product (16 * 12), and (b) temp_sub_GN (16)
self.glb_GN.shape[1] is 1
glb_img.shape[1] is 16 * 16 + 16 = 272

Summing up (in my case, with a small 448x448 image) to: 208 + 1 + 272 = 481. This looks theoretically correct to me.

With the image attention mask on (default), temp_len is defined as:

temp_len = int(image_attention_mask[_bs,:B_+1,0::2,0::2].sum().item()) + (useful_height+1) + base_feat_height//base_feat_height_reduction

In my case, the numbers for these are: 320 + 17 + 16 = 353.

The whole code for Phi4MMImageEmbedding is rather impenetrable, but I don't see how those are supposed to be equal. The culprit seems to be the logic around temp_len calculation.

DefOs9

about 4 hours ago

Even when I simply comment out the assertion statement, I get an error right below when I send a large image:

  File "/Users/myuser/repos/phi4model/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py", line 769, in forward
    image_hidden_states = self.image_embed(
  File "/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py", line 448, in forward
    new_hidden_states = hidden_states.index_put(
RuntimeError: shape mismatch: value tensor of shape [900, 3072] cannot be broadcast to indexing result of shape [1792, 3072]

Interestingly, both earlier variables that are asserted on, i.e.temp_len (1332) and output_imgs[-1].shape[1] (900), are not the expected shape (1792) here!

nguyenbh changed discussion status to open about 2 hours ago

EricB

about 2 hours ago

@nguyenbh have you been able to reproduce this?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment