Xinpeng Wang

I am a PhD student at the Munich AI & NLP (MaiNLP) research lab at LMU Munich . My supervisor is Prof. Barbara Plank . I'm currently a visiting researcher at New York University advised by Prof. He He .

Previously, I completed my M.Sc. degree in Robitics, Cognition, Intelligence at Technical University of Munich, where I was a student researcher at Visual Computing & AI lab at TUM working on indoor scene synthesis. I was also a teaching assistant of the course Introduction to Deep Learning (IN2346).

My research currently focuses on Human-Centric AI and Alignment.

Email  /  GitHub  /  Google Scholar  /  LinkedIn  /  Twitter  /  CV

profile photo

Selected Research

These include selected papers that represent my research, including some I currently want to highlight.

project image

Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort


Xinpeng Wang*, Nitish Joshi*, Barbara Plank, Rico Angell, He He
preprint, 2025
arxiv /

TRACE detects implicit reward hacking by measuring how quickly truncated reasoning suffices to obtain the reward, outperforming CoT monitoring and enabling hidden loopholes discovery.

project image

Refusal Direction is Universal Across Safety-Aligned Languages


Xinpeng Wang*, Mingyang Wang*, Yihong Liu*, Hinrich Schuetze, Barbara Plank
NeurIPS, 2025
arxiv /

Refusal directions in LLMs work across languages, revealing shared jailbreak mechanisms and raising the need for stronger multilingual safety.

project image

Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation


Xinpeng Wang, Chengzhi Hu, Paul Röttger, Barbara Plank
ICLR, 2025
arxiv / code /

We propose a surgical and flexible approach to mitigate the false refusal in LLMs with minimal effect on performance and inference cost.

project image

Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think


Xinpeng Wang, Chengzhi Hu, Bolei Ma, Paul Röttger, Barbara Plank
COLM, 2024
arxiv / code /

We showed that text answers are more robust than first token answer in instruction-tuned language models, even debiased with SOTA first-token debiasing method.

project image

"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models


Xinpeng Wang, Bolei Ma, Chengzhi Hu, Leon Weber-Genzel, Paul Röttger, Frauke Kreuter, Dirk Hovy, Barbara Plank
ACL Findings, 2024
arxiv / code /

We showed that the first-token probability evaluation does not match text answers in instruction-tuned language models.

project image

How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives


Xinpeng Wang, Leonie Weissweiler, Hinrich Schütze, Barbara Plank
ACL, 2023
arxiv / code /

We showed that using lower teacher layers for pre-loading student model gives significant performance improvement compared to higher layers. We also studied the robustness of different distillation objectives under various initialisation choices.

project image

Sceneformer: Indoor Scene Generation with Transformers


Xinpeng Wang, Chandan Yeshwanth, Matthias Nießner
3DV, 2021
oral
arxiv / video / code /

We proposed a transformer model for scene generation conditioned on room layout and text description.




Projects

These include coursework and practical course projects.

project image

Domain Specific Multi-Lingually Aligned Word Embeddings


Machine Learning for Natural Language Processing Applications
2021-07
report /
project image

Curiosity Driven Reinforcement Learning


Advanced Deep Learning in Robotics
2021-03
report /

Teaching

project image

Introduction to Deep Learning (IN2346)


SS 2020, WS2020/2021
Teaching Assistant
website /


Design and source code from Jon Barron's website