@aarbelle\n\t \n\n@EricB\n\t Did you find a solution to this problem yet? I am facing the same error...

\n","updatedAt":"2025-03-03T08:06:54.289Z","author":{"_id":"64e33d9d864e78652ee48234","avatarUrl":"/avatars/9380c5e2b95a8788d3a956e548de69df.svg","fullname":"Prateek","name":"PrateekTikku","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8610163331031799},"editors":["PrateekTikku"],"editorAvatarUrls":["/avatars/9380c5e2b95a8788d3a956e548de69df.svg"],"reactions":[],"isReport":false}},{"id":"67cdbba0a26e6d3a70c114c8","author":{"_id":"5f3ec133a4dd343b63a632dd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1597948137099-noauth.jpeg","fullname":"Nguyen Bach","name":"nguyenbh","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":31,"isOwner":false,"isOrgMember":true},"createdAt":"2025-03-09T16:02:40.000Z","type":"status-change","data":{"status":"closed"}},{"id":"67d358e6f11b117fd02ed581","author":{"_id":"67d33f44b57cfdb4f03ee36c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/As5_CyZNa1Z1i8cOMl6f5.png","fullname":"Nico de Vos","name":"DefOs9","type":"user","isPro":false,"isHf":false,"isMod":false,"isOwner":false,"isOrgMember":false},"createdAt":"2025-03-13T22:15:02.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"@nguyenbh , I think this needs to be re-opened. There appears to be a bug here, and it's easily reproducable.\n\n`output_imgs[-1].shape[1]` comes from a concatenation: `torch.cat([sub_img, self.glb_GN, glb_img], dim=1)`, where:\n\n- `sub_image.shape[1]` comes from a concatenation of something the size of (a) the `useful_height` and `useful_width` product (16 * 12), and (b) `temp_sub_GN` (16)\n- `self.glb_GN.shape[1]` is `1`\n- `glb_img.shape[1]` is `16 * 16 + 16 = 272`\n\nSumming up (in my case, with a small 448x448 image) to: `208 + 1 + 272 = 481`. This looks theoretically correct to me.\n\nWith the image attention mask on (default), `temp_len` is defined as:\n\n```temp_len = int(image_attention_mask[_bs,:B_+1,0::2,0::2].sum().item()) + (useful_height+1) + base_feat_height//base_feat_height_reduction```\n\nIn my case, the numbers for these are: `320 + 17 + 16 = 353`.\n\nThe whole code for `Phi4MMImageEmbedding` is rather impenetrable, but I don't see how those are supposed to be equal. The culprit seems to be the logic around `temp_len` calculation.","html":"

\n\n@nguyenbh\n\t , I think this needs to be re-opened. There appears to be a bug here, and it's easily reproducable.

\n

output_imgs[-1].shape[1] comes from a concatenation: torch.cat([sub_img, self.glb_GN, glb_img], dim=1), where:

\n
    \n
  • sub_image.shape[1] comes from a concatenation of something the size of (a) the useful_height and useful_width product (16 * 12), and (b) temp_sub_GN (16)
  • \n
  • self.glb_GN.shape[1] is 1
  • \n
  • glb_img.shape[1] is 16 * 16 + 16 = 272
  • \n
\n

Summing up (in my case, with a small 448x448 image) to: 208 + 1 + 272 = 481. This looks theoretically correct to me.

\n

With the image attention mask on (default), temp_len is defined as:

\n

temp_len = int(image_attention_mask[_bs,:B_+1,0::2,0::2].sum().item()) + (useful_height+1) + base_feat_height//base_feat_height_reduction

\n

In my case, the numbers for these are: 320 + 17 + 16 = 353.

\n

The whole code for Phi4MMImageEmbedding is rather impenetrable, but I don't see how those are supposed to be equal. The culprit seems to be the logic around temp_len calculation.

\n","updatedAt":"2025-03-13T22:15:02.882Z","author":{"_id":"67d33f44b57cfdb4f03ee36c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/As5_CyZNa1Z1i8cOMl6f5.png","fullname":"Nico de Vos","name":"DefOs9","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8305501937866211},"editors":["DefOs9"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/As5_CyZNa1Z1i8cOMl6f5.png"],"reactions":[],"isReport":false}},{"id":"67d35e777684f095bb2e0596","author":{"_id":"67d33f44b57cfdb4f03ee36c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/As5_CyZNa1Z1i8cOMl6f5.png","fullname":"Nico de Vos","name":"DefOs9","type":"user","isPro":false,"isHf":false,"isMod":false,"isOwner":false,"isOrgMember":false},"createdAt":"2025-03-13T22:38:47.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Even when I simply comment out the assertion statement, I get an error right below when I send a large image:\n\n```\n File \"/Users/myuser/repos/phi4model/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n File \"/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py\", line 769, in forward\n image_hidden_states = self.image_embed(\n File \"/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n File \"/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n File \"/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py\", line 448, in forward\n new_hidden_states = hidden_states.index_put(\nRuntimeError: shape mismatch: value tensor of shape [900, 3072] cannot be broadcast to indexing result of shape [1792, 3072]\n```\n\nInterestingly, both earlier variables that are asserted on, i.e.`temp_len` (1332) and `output_imgs[-1].shape[1]` (900), are not the expected shape (1792) here!","html":"

Even when I simply comment out the assertion statement, I get an error right below when I send a large image:

\n
  File \"/Users/myuser/repos/phi4model/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1750, in _call_impl\n    return forward_call(*args, **kwargs)\n  File \"/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py\", line 769, in forward\n    image_hidden_states = self.image_embed(\n  File \"/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1739, in _wrapped_call_impl\n    return self._call_impl(*args, **kwargs)\n  File \"/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1750, in _call_impl\n    return forward_call(*args, **kwargs)\n  File \"/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py\", line 448, in forward\n    new_hidden_states = hidden_states.index_put(\nRuntimeError: shape mismatch: value tensor of shape [900, 3072] cannot be broadcast to indexing result of shape [1792, 3072]\n
\n

Interestingly, both earlier variables that are asserted on, i.e.temp_len (1332) and output_imgs[-1].shape[1] (900), are not the expected shape (1792) here!

\n","updatedAt":"2025-03-13T22:38:47.645Z","author":{"_id":"67d33f44b57cfdb4f03ee36c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/As5_CyZNa1Z1i8cOMl6f5.png","fullname":"Nico de Vos","name":"DefOs9","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6205721497535706},"editors":["DefOs9"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/As5_CyZNa1Z1i8cOMl6f5.png"],"reactions":[{"reaction":"👀","users":["EricB"],"count":1}],"isReport":false}},{"id":"67d37a9329a092bdbb03e441","author":{"_id":"5f3ec133a4dd343b63a632dd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1597948137099-noauth.jpeg","fullname":"Nguyen Bach","name":"nguyenbh","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":31,"isOwner":false,"isOrgMember":true},"createdAt":"2025-03-14T00:38:43.000Z","type":"status-change","data":{"status":"open"}},{"id":"67d37ab853ffd6d75d8b7471","author":{"_id":"6304257c0907b9a115c718b1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6304257c0907b9a115c718b1/hiRz8prbr8viYUmxi3Woc.jpeg","fullname":"Eric Buehler","name":"EricB","type":"user","isPro":false,"isHf":true,"isMod":false,"followerCount":56,"isOwner":false,"isOrgMember":false},"createdAt":"2025-03-14T00:39:20.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"@nguyenbh have you been able to reproduce this?","html":"

\n\n@nguyenbh\n\t have you been able to reproduce this?

\n","updatedAt":"2025-03-14T00:39:20.296Z","author":{"_id":"6304257c0907b9a115c718b1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6304257c0907b9a115c718b1/hiRz8prbr8viYUmxi3Woc.jpeg","fullname":"Eric Buehler","name":"EricB","type":"user","isPro":false,"isHf":true,"isMod":false,"followerCount":56}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9870848655700684},"editors":["EricB"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6304257c0907b9a115c718b1/hiRz8prbr8viYUmxi3Woc.jpeg"],"reactions":[],"isReport":false}}],"pinned":false,"locked":false,"collection":"discussions","isPullRequest":false,"isReport":false},"repo":{"name":"microsoft/Phi-4-multimodal-instruct","type":"model"},"activeTab":"discussion","discussionRole":0}">

Error during inference with image and text.

#12
by aarbelle - opened
@aarbelle\n\t \n\n@EricB\n\t Did you find a solution to this problem yet? I am facing the same error...

\n","updatedAt":"2025-03-03T08:06:54.289Z","author":{"_id":"64e33d9d864e78652ee48234","avatarUrl":"/avatars/9380c5e2b95a8788d3a956e548de69df.svg","fullname":"Prateek","name":"PrateekTikku","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8610163331031799},"editors":["PrateekTikku"],"editorAvatarUrls":["/avatars/9380c5e2b95a8788d3a956e548de69df.svg"],"reactions":[],"isReport":false}},{"id":"67cdbba0a26e6d3a70c114c8","author":{"_id":"5f3ec133a4dd343b63a632dd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1597948137099-noauth.jpeg","fullname":"Nguyen Bach","name":"nguyenbh","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":31,"isOwner":false,"isOrgMember":true},"createdAt":"2025-03-09T16:02:40.000Z","type":"status-change","data":{"status":"closed"}},{"id":"67d358e6f11b117fd02ed581","author":{"_id":"67d33f44b57cfdb4f03ee36c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/As5_CyZNa1Z1i8cOMl6f5.png","fullname":"Nico de Vos","name":"DefOs9","type":"user","isPro":false,"isHf":false,"isMod":false,"isOwner":false,"isOrgMember":false},"createdAt":"2025-03-13T22:15:02.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"@nguyenbh , I think this needs to be re-opened. There appears to be a bug here, and it's easily reproducable.\n\n`output_imgs[-1].shape[1]` comes from a concatenation: `torch.cat([sub_img, self.glb_GN, glb_img], dim=1)`, where:\n\n- `sub_image.shape[1]` comes from a concatenation of something the size of (a) the `useful_height` and `useful_width` product (16 * 12), and (b) `temp_sub_GN` (16)\n- `self.glb_GN.shape[1]` is `1`\n- `glb_img.shape[1]` is `16 * 16 + 16 = 272`\n\nSumming up (in my case, with a small 448x448 image) to: `208 + 1 + 272 = 481`. This looks theoretically correct to me.\n\nWith the image attention mask on (default), `temp_len` is defined as:\n\n```temp_len = int(image_attention_mask[_bs,:B_+1,0::2,0::2].sum().item()) + (useful_height+1) + base_feat_height//base_feat_height_reduction```\n\nIn my case, the numbers for these are: `320 + 17 + 16 = 353`.\n\nThe whole code for `Phi4MMImageEmbedding` is rather impenetrable, but I don't see how those are supposed to be equal. The culprit seems to be the logic around `temp_len` calculation.","html":"

\n\n@nguyenbh\n\t , I think this needs to be re-opened. There appears to be a bug here, and it's easily reproducable.

\n

output_imgs[-1].shape[1] comes from a concatenation: torch.cat([sub_img, self.glb_GN, glb_img], dim=1), where:

\n
    \n
  • sub_image.shape[1] comes from a concatenation of something the size of (a) the useful_height and useful_width product (16 * 12), and (b) temp_sub_GN (16)
  • \n
  • self.glb_GN.shape[1] is 1
  • \n
  • glb_img.shape[1] is 16 * 16 + 16 = 272
  • \n
\n

Summing up (in my case, with a small 448x448 image) to: 208 + 1 + 272 = 481. This looks theoretically correct to me.

\n

With the image attention mask on (default), temp_len is defined as:

\n

temp_len = int(image_attention_mask[_bs,:B_+1,0::2,0::2].sum().item()) + (useful_height+1) + base_feat_height//base_feat_height_reduction

\n

In my case, the numbers for these are: 320 + 17 + 16 = 353.

\n

The whole code for Phi4MMImageEmbedding is rather impenetrable, but I don't see how those are supposed to be equal. The culprit seems to be the logic around temp_len calculation.

\n","updatedAt":"2025-03-13T22:15:02.882Z","author":{"_id":"67d33f44b57cfdb4f03ee36c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/As5_CyZNa1Z1i8cOMl6f5.png","fullname":"Nico de Vos","name":"DefOs9","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8305501937866211},"editors":["DefOs9"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/As5_CyZNa1Z1i8cOMl6f5.png"],"reactions":[],"isReport":false}},{"id":"67d35e777684f095bb2e0596","author":{"_id":"67d33f44b57cfdb4f03ee36c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/As5_CyZNa1Z1i8cOMl6f5.png","fullname":"Nico de Vos","name":"DefOs9","type":"user","isPro":false,"isHf":false,"isMod":false,"isOwner":false,"isOrgMember":false},"createdAt":"2025-03-13T22:38:47.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Even when I simply comment out the assertion statement, I get an error right below when I send a large image:\n\n```\n File \"/Users/myuser/repos/phi4model/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n File \"/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py\", line 769, in forward\n image_hidden_states = self.image_embed(\n File \"/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n File \"/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n File \"/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py\", line 448, in forward\n new_hidden_states = hidden_states.index_put(\nRuntimeError: shape mismatch: value tensor of shape [900, 3072] cannot be broadcast to indexing result of shape [1792, 3072]\n```\n\nInterestingly, both earlier variables that are asserted on, i.e.`temp_len` (1332) and `output_imgs[-1].shape[1]` (900), are not the expected shape (1792) here!","html":"

Even when I simply comment out the assertion statement, I get an error right below when I send a large image:

\n
  File \"/Users/myuser/repos/phi4model/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1750, in _call_impl\n    return forward_call(*args, **kwargs)\n  File \"/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py\", line 769, in forward\n    image_hidden_states = self.image_embed(\n  File \"/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1739, in _wrapped_call_impl\n    return self._call_impl(*args, **kwargs)\n  File \"/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1750, in _call_impl\n    return forward_call(*args, **kwargs)\n  File \"/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py\", line 448, in forward\n    new_hidden_states = hidden_states.index_put(\nRuntimeError: shape mismatch: value tensor of shape [900, 3072] cannot be broadcast to indexing result of shape [1792, 3072]\n
\n

Interestingly, both earlier variables that are asserted on, i.e.temp_len (1332) and output_imgs[-1].shape[1] (900), are not the expected shape (1792) here!

\n","updatedAt":"2025-03-13T22:38:47.645Z","author":{"_id":"67d33f44b57cfdb4f03ee36c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/As5_CyZNa1Z1i8cOMl6f5.png","fullname":"Nico de Vos","name":"DefOs9","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6205721497535706},"editors":["DefOs9"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/As5_CyZNa1Z1i8cOMl6f5.png"],"reactions":[{"reaction":"👀","users":["EricB"],"count":1}],"isReport":false}},{"id":"67d37a9329a092bdbb03e441","author":{"_id":"5f3ec133a4dd343b63a632dd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1597948137099-noauth.jpeg","fullname":"Nguyen Bach","name":"nguyenbh","type":"user","isPro":false,"isHf":false,"isMod":false,"followerCount":31,"isOwner":false,"isOrgMember":true},"createdAt":"2025-03-14T00:38:43.000Z","type":"status-change","data":{"status":"open"}},{"id":"67d37ab853ffd6d75d8b7471","author":{"_id":"6304257c0907b9a115c718b1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6304257c0907b9a115c718b1/hiRz8prbr8viYUmxi3Woc.jpeg","fullname":"Eric Buehler","name":"EricB","type":"user","isPro":false,"isHf":true,"isMod":false,"followerCount":56,"isOwner":false,"isOrgMember":false},"createdAt":"2025-03-14T00:39:20.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"@nguyenbh have you been able to reproduce this?","html":"

\n\n@nguyenbh\n\t have you been able to reproduce this?

\n","updatedAt":"2025-03-14T00:39:20.296Z","author":{"_id":"6304257c0907b9a115c718b1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6304257c0907b9a115c718b1/hiRz8prbr8viYUmxi3Woc.jpeg","fullname":"Eric Buehler","name":"EricB","type":"user","isPro":false,"isHf":true,"isMod":false,"followerCount":56}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9870848655700684},"editors":["EricB"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6304257c0907b9a115c718b1/hiRz8prbr8viYUmxi3Woc.jpeg"],"reactions":[],"isReport":false}}],"pinned":false,"locked":false,"collection":"discussions","isPullRequest":false,"isReport":false},"primaryEmailConfirmed":false,"repo":{"name":"microsoft/Phi-4-multimodal-instruct","type":"model"},"discussionRole":0,"acceptLanguages":["en","*"],"disableDiscussionClosingAndCommentHiding":false,"hideComments":true}">

Running into the following error when trying inference with Image+Text

/home/.cache/huggingface/modules/transformers_modules/microsoft/Phi-4-multimodal-instruct/879783f7b23e43c12d1c682e3458f115f3a7718d/modeling_phi4mm.py", line 399, in forward
    assert temp_len == output_imgs[-1].shape[1], f'temp_len: {temp_len}, output_imgs[-1].shape[1]: {output_imgs[-1].shape[1]}'
AssertionError: temp_len: 5409, output_imgs[-1].shape[1]: 5393

It doesn't happen for all images, just some.

Same error with:

import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

# Define model path
model_path = "microsoft/Phi-4-multimodal-instruct"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="mps", 
    trust_remote_code=True, 
    _attn_implementation='eager',
).to("mps")

# Load generation config
generation_config = GenerationConfig.from_pretrained(model_path)

# Define prompt structure
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

# Part 1: Image Processing
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Download and open image
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to('mps')

# Generate response
generate_ids = model.generate(
    **inputs,
    max_new_tokens=256,
    generation_config=generation_config,
)
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

Error:

File "/Users/ericbuehler/.cache/huggingface/modules/transformers_modules/phi4_multimodal/modeling_phi4mm.py", line 399, in forward
    assert temp_len == output_imgs[-1].shape[1], f'temp_len: {temp_len}, output_imgs[-1].shape[1]: {output_imgs[-1].shape[1]}'
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: temp_len: 1381, output_imgs[-1].shape[1]: 933```

@aarbelle @EricB Did you find a solution to this problem yet? I am facing the same error...

nguyenbh changed discussion status to closed

@nguyenbh , I think this needs to be re-opened. There appears to be a bug here, and it's easily reproducable.

output_imgs[-1].shape[1] comes from a concatenation: torch.cat([sub_img, self.glb_GN, glb_img], dim=1), where:

  • sub_image.shape[1] comes from a concatenation of something the size of (a) the useful_height and useful_width product (16 * 12), and (b) temp_sub_GN (16)
  • self.glb_GN.shape[1] is 1
  • glb_img.shape[1] is 16 * 16 + 16 = 272

Summing up (in my case, with a small 448x448 image) to: 208 + 1 + 272 = 481. This looks theoretically correct to me.

With the image attention mask on (default), temp_len is defined as:

temp_len = int(image_attention_mask[_bs,:B_+1,0::2,0::2].sum().item()) + (useful_height+1) + base_feat_height//base_feat_height_reduction

In my case, the numbers for these are: 320 + 17 + 16 = 353.

The whole code for Phi4MMImageEmbedding is rather impenetrable, but I don't see how those are supposed to be equal. The culprit seems to be the logic around temp_len calculation.

Even when I simply comment out the assertion statement, I get an error right below when I send a large image:

  File "/Users/myuser/repos/phi4model/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py", line 769, in forward
    image_hidden_states = self.image_embed(
  File "/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/myuser/repos/a11y-models/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/myuser/.cache/huggingface/modules/transformers_modules/Phi-4-multimodal-instruct/modeling_phi4mm.py", line 448, in forward
    new_hidden_states = hidden_states.index_put(
RuntimeError: shape mismatch: value tensor of shape [900, 3072] cannot be broadcast to indexing result of shape [1792, 3072]

Interestingly, both earlier variables that are asserted on, i.e.temp_len (1332) and output_imgs[-1].shape[1] (900), are not the expected shape (1792) here!

nguyenbh changed discussion status to open

@nguyenbh have you been able to reproduce this?

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment