Jerry (Jiarui) Zhang

I am a second-year Computer Science Ph.D. student at USC. Previously, I received my bachelor's degree in electrical engineering from Tsinghua University. I grew up in Dalian, a beautiful coastal city in China. My Chinese name is 张家瑞.

My ambitious dream is to develop an AI that can observe and reason over the real physical world. One of the motivation is from the fact that AlphaGo Zero beats the world champion in Go game without learning from any human annotation, which means human's top strategy is far from optimal in a well-defined decision space with a clear goal. In the complex real world, I believe AI will be able to bring us much more surprises.

Email  /  Scholar  /  Twitter  /  Github  /  Wechat

profile photo

I'm currently looking for research intern during 2024 summer, working on multmodal LLMs, especially studying their behavior and capabilities or proposing new architectures. If you have any openings or opportunities to refer, please don't hesitate to contact me!

I'm very happy to chat about research ideas and collaborate with people. Please feel free to reach out to me if you are interested in discussing or working together!

Research

Currently, my research focuses on studying multimodal LLMs' (MLLM) properties (response to visual details, bias in object locations, calibration) and capabilities (VQA, nonverbal reasoning). I am also interested in developing new MLLMs by introducig new architecture and tasks.

Exploring Perceptual Limitation of Multimodal Large Language Models
Jiarui Zhang*, Jinyi Hu*, Mahyar Khayatkhoei, Filip Ilievski, Maosong Sun
arXiv, Github

We expose a limitation of several state-of-the-art multimodal LLMs in perceiving small visual objects. Then we identify four factors that influence this limitation, namely, object quality, size, distractor, and location. Through controlled intervention studies, we reveal the distinct impact caused by each factor. Our findings will potentially offer insights to improve visual processing capabilities of MLLMs.

Visual Cropping Improves Zero-Shot Question Answering of Multimodal Large Language Models
Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, Filip Ilievski
NeurIPS R0-FoMo Workshop, 2023
arXiv, Github

We qualitatively and quantitatively show the limitation of two state-of-the-art multimodal LLMs (MLLMs) in perceiving small visual details for zero-shot visual question answering. Then we found out this is mitigatable by visual cropping following internal attention of MLLM.

A Study of Situational Reasoning for Traffic Understanding
Jiarui Zhang, Filip Ilievski, Kaixin Ma, Aravinda Kollaa, Jonathan Francis, Alessandro Oltramari,
KDD, 2023
arXiv, Github

We formalize three novel text-based benchmarks on traffic domain, including decision making, real and hypothetical events casual reasoning, and knowledge testing. Then we study the ability of diverse knowledge-enhanced language models on our benckmarks.

Knowledge-enhanced Agents for Interactive Text Games
Prateek Chhikara, Jiarui Zhang, Filip Ilievski, Jonathan Francis, Kaixin Ma,
International Conference on Knowledge Capture (KCap), 2023,
🏆🏆 Best Student Paper Award 🏆🏆

We introduces a knowledge-injection framework to enhance the functional grounding of agents in text-based games, addressing existing limitations in coherence, contextual awareness, and learning. The framework employs strategies like knowledge graphs and input encoding augmentations. Tested on 10 tasks in the ScienceWorld environment, the study reveals how task properties, model architectures, and domain knowledge interact in interactive contexts.

A Study of Zero-shot Adaptation with Commonsense Knowledge
Jiarui Zhang, Filip Ilievski, Kaixin Ma, Jonathan Francis, Alessandro Oltramari,
AKBC, 2022
arXiv, Github

We train different sizes of language models using synthetic data from knowledge graphs. We observe significant zero-shot performance improvement different language tasks. We also study the effect of knowledge graph training data size and find out more data does not always lead to better performance, and the optimal data size grows with the model size.

Miscellanea

I enjoy weight lifting in my free time.

I also enjoy cooking recently.

I like eating burgers.


This website is adapted from here.