8000
Skip to content

MrBananaHuman/KalBert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
< 8000 div data-testid="latest-commit-details" class="d-none d-sm-flex flex-items-center">

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KalBert

Korean ALBERT (A Lite BERT for Self-supervised Learning of Language Representations) language model

Training based on albert_zh (https://github.com/brightmart/albert_zh)

512 sequences, Large KalBert: https://drive.google.com/drive/folders/1a_yZIidugit3TxF__f8LSRPc8gfO2CV-?usp=sharing

  • Training data: ~6GB

    • Korean Wiki
    • KAIST Book corpus
    • Saejong corpus
  • Morph tokenizing without tag + BPE

    • (e.g. 이순신은 조선 중기의 무신이다. -> 이순신 은 조선 중기 의 무신 이 다 .)
  • Training steps: 191,000 (128 batch size)

  • KorQuAD v 1.0 Dev set

    • f1: 90.01, em: 81.26

About

Korean ALBERT

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

0