论文阅读——DistilBERT
- IT业界
- 2025-08-15 07:51:01

ArXiv: arxiv.org/abs/1910.01108
Train Loss:
DistilBERT:
DistilBERT具有与BERT相同的一般结构,层数减少2倍,移除token类型嵌入和pooler。从老师那里取一层来初始化学生。
The token-type embeddings and the pooler are removed while the number of layers is reduced by a factor of 2. Most of the operations used in the Transformer architecture (linear layer and layer normalisation) are highly optimized in modern linear algebra frameworks。
we initialize the student from the teacher by taking one layer out of two.
大batch,4k,动态mask,去掉NSP
训练数据:和BERT一样
论文阅读——DistilBERT由讯客互联IT业界栏目发布,感谢您对讯客互联的认可,以及对我们原创作品以及文章的青睐,非常欢迎各位朋友分享到个人网站或者朋友圈,但转载请说明文章出处“论文阅读——DistilBERT”