sswoo123 commited on
Commit
d3f144a
Β·
verified Β·
1 Parent(s): 7392eed

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -7
README.md CHANGED
@@ -1,10 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- title: README
3
- emoji: πŸ‘€
4
- colorFrom: green
5
- colorTo: blue
6
- sdk: static
7
- pinned: false
 
 
 
 
 
 
8
  ---
9
 
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <p align="center">
2
+ <img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" width="100" />
3
+ </p>
4
+
5
+ <h1 align="center">πŸ‡°πŸ‡· KORMo Research</h1>
6
+ <p align="center">
7
+ κ³ ν’ˆμ§ˆ ν•œκ΅­μ–΄ 데이터와 μ–Έμ–΄λͺ¨λΈ 연ꡬλ₯Ό μœ„ν•œ μ˜€ν”ˆμ†ŒμŠ€ ν—ˆλΈŒ
8
+ This is the home for <b>KORMo models</b> and <b>high-quality Korean pre-training datasets</b>.
9
+ </p>
10
+
11
+ ---
12
+
13
+ ## 🧠 곡개 λͺ¨λΈ
14
+
15
+ - 🧹 **Tokenizer** β€” ν•œκ΅­μ–΄ μ „μš© EPK ν† ν¬λ‚˜μ΄μ €
16
+ β†’ ν•œκ΅­μ–΄ ν‘œν˜„ μ΅œμ ν™” 및 λ‹€μš΄μŠ€νŠΈλ¦Ό μ„±λŠ₯ κ°œμ„ 
17
+
18
+ - πŸ‹οΈ **PT Model (Pretraining)** β€” 40B+ 토큰 λ°μ΄ν„°λ‘œ ν•™μŠ΅λœ <b>KORMo-10B</b> μ‚¬μ „ν•™μŠ΅ λͺ¨λΈ
19
+ β†’ Old-both deduplication + ν’ˆμ§ˆ 필터링 적용:contentReference[oaicite:1]{index=1}
20
+
21
+ - 🧭 **Mid-train Model** β€” 쀑간 μŠ€ν… ν•™μŠ΅ 체크포인트 곡개
22
+ β†’ ν•™μŠ΅ 곑선 및 μ„±λŠ₯ 뢄석에 ν™œμš© κ°€λŠ₯
23
+
24
+ - 🧠 **SFT Model** β€” instruction λ°μ΄ν„°μ…‹μœΌλ‘œ λ―Έμ„Έμ‘°μ •λœ λͺ¨λΈ
25
+ β†’ κ³ μ„±λŠ₯ μ§€μ‹œ λ”°λ₯΄κΈ°(following) λͺ¨λΈ
26
+
27
+ > πŸ’‘ **λͺ¨λΈμ˜ λͺ¨λ“  체크포인트λ₯Ό ν™•μΈν•˜λ €λ©΄** 각 λͺ¨λΈ νŽ˜μ΄μ§€ μƒλ‹¨μ˜ `Revisions` 탭을 μ°Έκ³ ν•˜μ„Έμš”.
28
+
29
  ---
30
+
31
+ ## πŸ“š 곡개 데이터셋
32
+
33
+ - 🧹 **KOR-Clean** β€” Old-both 쀑볡 제거 및 ν’ˆμ§ˆ ν•„ν„°λ§λœ 40B+ 토큰 ν•œκ΅­μ–΄ μ½”νΌμŠ€
34
+ β†’ λΆˆλŸ‰Β·μ €μ •λ³΄Β·λ…μ„± 데이터λ₯Ό μ œκ±°ν•˜μ—¬ ν•™μŠ΅ ν’ˆμ§ˆ ν–₯상:contentReference[oaicite:2]{index=2}
35
+
36
+ - 🧾 **Instruction 데이터셋** β€” νŒŒμΈνŠœλ‹μš© λͺ…λ Ήμ–΄ 기반 데이터셋
37
+ β†’ 싀세계 μž‘μ—…κ³Ό μœ μ‚¬ν•œ λŒ€ν™”Β·μ§ˆλ¬Έμ‘λ‹΅ 데이터 ꡬ성:contentReference[oaicite:3]{index=3}
38
+
39
+ - 🧠 **Synthetic 데이터셋** β€” λŒ€κ·œλͺ¨ 생성 데이터 기반 ν•™μŠ΅ μžμ›
40
+ β†’ μ•ˆμ •μ μΈ μ„±λŠ₯ ν–₯상 및 λ‹€μ–‘μ„± 확보:contentReference[oaicite:4]{index=4}
41
+
42
  ---
43
 
44
+ ## πŸ†• λ‰΄μŠ€ πŸ—žοΈ
45
+
46
+ - πŸͺ„ **ν•œκ΅­μ–΄ 졜초 LLM ν•™μŠ΅ μ½”λ“œ 및 데이터 곡개**
47
+ - πŸš€ <b>KORMo-10B</b> 릴리즈 πŸŽ‰
48
+
49
+ ---
50
+
51
+ ## 🌐 링크
52
+ <p align="center">
53
+ <a href="https://github.com/kormo-project"><img src="https://img.shields.io/badge/GitHub-black?logo=github&style=for-the-badge"></a>
54
+ <a href="https://huggingface.co/kormo-project"><img src="https://img.shields.io/badge/HuggingFace-orange?logo=huggingface&logoColor=white&style=for-the-badge"></a>
55
+ <a href="https://kormo.ai"><img src="https://img.shields.io/badge/Website-blue?logo=web&style=for-the-badge"></a>
56
+ </p>
57
+
58
+ ---