Peng Qi

齐鹏

(pinyin: /qí péng/; ipa: /tɕʰǐ pʰə̌ŋ/)

I am an AI researcher working on natural language processing, machine learning, and multimodal agents. I am currently leading research efforts at Orby AI.

My research is driven by the goal of bringing the world’s knowledge to the user’s assistance, which manifests itself in the following directions

How to effectively organize and use knowledge. This involves tasks like question answering (where I have co-lead the development of some benchmarks for complex reasoning: HotpotQA and BeerQA), information extraction, syntactic analysis for many languages (check out Stanza), etc.
How to effectively communicate knowledge. This mainly concerns interactive NLP systems such as conversational systems, where I am interested in theory-of-mind reasoning under information asymmetry (e.g., how to ask good questions and how to provide good answers beyond the literal answer), offline-to-online transfer, multi-modal interactions, etc. On the application side, I co-lead the founding research team that launched Amazon Q at Amazon.
How to leverage interactive knowledge to help users better perform tasks. This mainly concerns multimodal digital agents operating on real-world user interfaces and solving problems on behalf of users. Want to learn more? Consider joining our research team at Orby AI!

In all of these directions, I am also excited to explore data-efficient models and training techniques, model and system explainability, and self-supervised learning techniques that enable us to address these problems.

Before joining Orby, I worked for Amazon Web Services (AWS) as an senior applied scientist, and JD.com AI Research as a senior research scientist before that. I obtained my Ph.D. in Computer Science at Stanford University advised by Prof. Chris Manning, where I was a member of the NLP group and AI Lab. I also obtained two Master’s at Stanford (CS & Statistics), and my Bachelor’s at Tsinghua University.

[CV (slightly outdated)]

latest posts

Jul 01, 2025	Why You Should Stop Using HotpotQA for AI Agents Evaluation in 2025
Mar 12, 2025	AI is the New Rocket Science
Jan 02, 2025	What do industry researchers do, anyway? Part 2 -- What do they do when they are not publishing

selected publications

arXiv

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

Yu Gu, Boyuan Zheng, Boyu Gou, Kai Zhang, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, and Yu Su

2024

PDF
EMNLP

Dancing in Chains: Reconciling Instruction Following and Faithfulness in Language Models

Zhengxuan Wu, Yuhao Zhang*, Peng Qi*, Yumo Xu*, Rujun Han, Yian Zhang, Jifan Chen, Bonan Min, and Zhiheng Huang

In Empirical Methods in Natural Language Processing (EMNLP), 2024

PDF
EMNLP

RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering

Rujun Han, Yuhao Zhang, Peng Qi, Yumo Xu, Wang Jenyuan, Lan Liu, William Yang Wang, Bonan Min, and Vittorio Castelli

In Empirical Methods in Natural Language Processing (EMNLP), 2024

PDF
EMNLP Findings

Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks

Kaiser Sun, Peng Qi*, Yuhao Zhang*, Lan Liu, William Yang Wang, and Zhiheng Huang

In Findings of the Association for Computational Linguistics: EMNLP , 2023

PDF
ACL Findings

PragmatiCQA: A Dataset for Pragmatic Question Answering in Conversations

Peng Qi*, Nina Du*, Christopher D. Manning, and Jing Huang

In Findings of the Association for Computational Linguistics: ACL 2023, 2023

PDF Code
EMNLP

Answering Open-Domain Questions of Varying Reasoning Steps from Text

Peng Qi*, Haejun Lee*, Oghenetegiri "TG" Sido*, and Christopher D. Manning

In Empirical Methods for Natural Language Processing (EMNLP), 2021

PDF Code Poster Website
ACL (Demo)

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

Peng Qi*, Yuhao Zhang*, Yuhui Zhang, Jason Bolton, and Christopher D. Manning

In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2020

HTML PDF Video Code
EMNLP

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang*, Peng Qi*, Saizheng Zhang*, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning

In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018

PDF Blog