BERT-Style Pretraining on Convnets? Peking U, ByteDance & Oxford U’s Sparse Masked Modelling With Hierarchy Leads the Way

By Ember Recon · March 16, 2026 · 1 min read

ai
machine learning & data science
research
ai
artificial intelligence

In the new paper Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling, a research team from Peking University, ByteDance, and the University of Oxford presents Sparse Masked Modelling with Hierarchy (SparK), the first BERT-style pretraining approach that can be used on convolutional models without any backbone modifications.