Wentao (Tony) Ma

@ BosonAI
@ University of Toronto

MLLM for Video / Audio Understanding and Generation

University of Toronto
College St, Toronto, ON, CA, M7A 1A2
Email: tonyyyma [at] gmail [dot] com

Open to [PhD / Research Engineer] positions


Introduction

The research areas I'm focusing on are Multi-Modal LLMs. I enjoy improving and exploring the ability of MLLMs and on Video and Audio, and applying them to other fields like Robotics.

Currently, I'm a Master's student at University of Toronto, and also a MLE at @BosonAI, adviced by Alex Smola and Mu Li. We are developing efficient and expressive foundation models for audio understanding and generation. Also, I'm working closely with Wenhu Chen on the Video understanding field.

Before that, I spent one fantastic year at Imperial College London, supervised by Edward Johns. We validate and improve the Multi-Modal pattern learning ability of VLMs and apply them to Robotics. I got my bachelor's degree from Beihang University, School of ShenYuan Honors College, and my major is Computer Science.

I like photographing and I'm one of the members of Toronto Photo Walk(ToPW). I'm also interested in all kinds of sports, including snowboarding and tennis.

News

Publications                         

VideoScore2: Think before You Score in Generative Video Evaluation

Xuan He*, Dongfu Jiang*, Ping Nie, Minghao Liu, Wentao Ma, Junru Lin, and Others

Preprint

[paper] [website]

StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

Jialin Yang*, Dongfu Jiang*, Lipeng He, Sherman Siu, Wentao Ma, Zhiheng Lyu, and Others

Preprint

[paper] [website] [benchmark]

VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation

Wentao Ma*, Weiming Ren*, Yiming Jia, Zhuofeng Li, Ping Nie, Ge Zhang, Wenhu Chen

Preprint

[paper] [website] [benchmark] [Leaderboard]

ProT-GFDM: A Generative Fractional Diffusion Model for Protein Generation

Xiao Liang*, Wentao Ma*, Eric Paquet, Herna Lydia Viktor, Wojtek Michalowski

Computational and Structural Biotechnology Journal(CSBJ), 2025

[paper]

Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers

Weiming Ren, Wentao Ma, Huan Yang, Cong Wei, Ge Zhang, Wenhu Chen

International Conference on Computer Vision (ICCV), 2025

[paper] [website]

Paint2Plan: Image Painting Enables Imitation Learning with VLMs

Tony Ma, Teyun Kwon, Edward Johns

Preprint, 2024

[paper] [website]

LLM Echo Chamber: personalized and automated disinformation

Tony Ma, Yves-Alexandre de Montjoye

Machine Leanrning and Cyber Security Symposium (MLCSS), Imperial, 2024

[paper] [code] [video]

Boosting Transferability of Adversarial Patches with Visual Relations

Tony Ma, Songze Li, Yisong Xiao, Shunchang Liu

Conference on Computer Vision and Pattern Recognition (CVPR), AdvVision Workshop, 2023

[paper]

Experience             

Boson AI

Machine Learning Engineer Intern

Alignment for Audio Understanding and Generation models

May.2025 - Present [website]

Vector Institute

Machine Learning Associate

Designed a Geo-filtering RAG system with Global Spatial Technology Solutions(GSTS)

Jan.2025 - Apr.2025 [website]

SONY

Edge AI Engineer Intern

Video Object Tracking / Model Qutilization / Edge Computing

Sep.2022 - Feb.2023 [website] [Project]

TikTok

Software Engineer Intern

IOS developing for TikTok Pay

May.2022 - Aug.2022 [website]

Selected Certifications and Awards

AWS Certified Solution Architect (Associate) --- 2026
Mitacs Research Funding --- 2025-2026
Distinction @ Imperial College London --- 2024
Outstanding Graduates --- 2023
Scholarship for Academic Excellence --- 2020/2021/2022
Scholarship for Discipline Competitions --- 2020/2021/2022
Excellent Student Leader --- 2020

Collaborate With (with no order)

@ Canada: Wenhu Chen, Weiming Ren, Yiming Jia, Xiao Liang, Yuzhi Tang,
@ UK: Edward Johns, Teyun Kwon, Sarthak Das, Wanru Zhao,
@ China: Xianglong Liu, Aishan Liu, Shunchang Liu, Bojie Zhang, Eric Gao


© Wentao Ma | Template From Dr.YueMing Jin | Last updated: Oct 2025