Selected Research
These include selected papers that represent my research, including some I currently want to highlight.
|
|
Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
Xinpeng Wang*, Nitish Joshi*, Barbara Plank, Rico Angell, He He
preprint, 2025
arxiv /
TRACE detects implicit reward hacking by measuring how quickly truncated reasoning suffices to obtain the reward, outperforming CoT monitoring and enabling hidden loopholes discovery.
|
|
Refusal Direction is Universal Across Safety-Aligned Languages
Xinpeng Wang*, Mingyang Wang*, Yihong Liu*, Hinrich Schuetze, Barbara Plank
NeurIPS, 2025
arxiv /
Refusal directions in LLMs work across languages, revealing shared jailbreak mechanisms and raising the need for stronger multilingual safety.
|
|
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
Xinpeng Wang, Chengzhi Hu, Paul Röttger, Barbara Plank
ICLR, 2025
arxiv /
code /
We propose a surgical and flexible approach to mitigate the false refusal in LLMs with minimal effect on performance and inference cost.
|
|
Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think
Xinpeng Wang, Chengzhi Hu, Bolei Ma, Paul Röttger, Barbara Plank
COLM, 2024
arxiv /
code /
We showed that text answers are more robust than first token answer in instruction-tuned language models, even debiased with SOTA first-token debiasing method.
|
|
"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models
Xinpeng Wang, Bolei Ma, Chengzhi Hu, Leon Weber-Genzel, Paul Röttger, Frauke Kreuter, Dirk Hovy, Barbara Plank
ACL Findings, 2024
arxiv /
code /
We showed that the first-token probability evaluation does not match text answers in instruction-tuned language models.
|
|
How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives
Xinpeng Wang, Leonie Weissweiler, Hinrich Schütze, Barbara Plank
ACL, 2023
arxiv /
code /
We showed that using lower teacher layers for pre-loading student model gives significant performance improvement compared to higher layers.
We also studied the robustness of different distillation objectives under various initialisation choices.
|
|
Xinpeng Wang, Chandan Yeshwanth, Matthias Nießner
3DV, 2021
oral
arxiv /
video /
code /
We proposed a transformer model for scene generation conditioned on room layout and text description.
|
Projects
These include coursework and practical course projects.
|
|
Domain Specific Multi-Lingually Aligned Word Embeddings
Machine Learning for Natural Language Processing Applications
2021-07
report /
|
|
Curiosity Driven Reinforcement Learning
Advanced Deep Learning in Robotics
2021-03
report /
|
|
Introduction to Deep Learning (IN2346)
SS 2020, WS2020/2021
Teaching Assistant
website /
|
|