cristinaimprota commited on
Commit
c457a1d
·
verified ·
1 Parent(s): e8f0503

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -3
README.md CHANGED
@@ -1,3 +1,57 @@
1
- ---
2
- license: cc-by-sa-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-sa-4.0
3
+ ---
4
+
5
+ # DeepSeek-Coder-1.3B – Clean DSC Model (DSCc)
6
+
7
+ This repository hosts **DSCc**, a fine-tuned version of **DeepSeek-Coder-1.3B** trained for **Python function generation** from docstrings and function signatures, using a *cleaned* subset of The Stack.
8
+
9
+ The model is part of the study:
10
+
11
+ > **Quality In, Quality Out: Investigating Training Data’s Role in AI Code Generation**
12
+ > 33rd IEEE/ACM International Conference on Program Comprehension (ICPC 2025)
13
+
14
+ DSCc is specifically trained on a **Semgrep-filtered dataset** that removes many low-quality and syntactically incorrect functions, allowing us to study how training data quality impacts code generation performance.
15
+
16
+ ---
17
+
18
+ ## Model description
19
+
20
+ - **Base model:** DeepSeek-Coder-1.3B (Python-focused code LLM)
21
+ - **Task:** Python code generation
22
+ - **Input:** Python function **docstring + signature**
23
+ - **Output:** The corresponding **function body** in Python
24
+
25
+ In our experiments, the model is conditioned on a prompt consisting of:
26
+ - A natural-language docstring describing the function behavior
27
+ - The Python function signature
28
+
29
+ and is then asked to generate the rest of the function body.
30
+
31
+ ---
32
+
33
+ ## What does the model do?
34
+
35
+ The model generates **Python functions** that implement the behavior described in the docstring and implied by the signature. Typical use cases:
36
+
37
+ - Synthesizing a function implementation from a high-level description
38
+ - Suggesting implementations for partially specified functions
39
+ - Exploring how training data quality affects generated code (correctness, style, quality issues)
40
+
41
+
42
+ ### “Clean” training set (for DSCc)
43
+
44
+ The initial training set contains ~4.4M pairs. To construct the **clean dataset**:
45
+
46
+ - We run **Semgrep** (static analysis) on all training functions.
47
+ - Semgrep detects:
48
+ - Low-quality patterns
49
+ - Potentially problematic constructs
50
+ - Syntactically incorrect functions
51
+ - All flagged low-quality / invalid functions are removed.
52
+
53
+ This yields:
54
+
55
+ - **`clean_training_set.json` — ~4.2M pairs**
56
+ - Derived from The Stack
57
+ - But with many quality issues and syntax errors filtered out.