--- license: cc-by-sa-4.0 --- # DeepSeek-Coder-1.3B – Clean DSC Model (DSCc) This repository hosts **DSCc**, a fine-tuned version of **DeepSeek-Coder-1.3B** trained for **Python function generation** from docstrings and function signatures, using a *cleaned* subset of The Stack. The model is part of the study: > **Quality In, Quality Out: Investigating Training Data’s Role in AI Code Generation** > 33rd IEEE/ACM International Conference on Program Comprehension (ICPC 2025) DSCc is specifically trained on a **Semgrep-filtered dataset** that removes many low-quality and syntactically incorrect functions, allowing us to study how training data quality impacts code generation performance. --- ## Model description - **Base model:** DeepSeek-Coder-1.3B (Python-focused code LLM) - **Task:** Python code generation - **Input:** Python function **docstring + signature** - **Output:** The corresponding **function body** in Python In our experiments, the model is conditioned on a prompt consisting of: - A natural-language docstring describing the function behavior - The Python function signature and is then asked to generate the rest of the function body. --- ## What does the model do? The model generates **Python functions** that implement the behavior described in the docstring and implied by the signature. Typical use cases: - Synthesizing a function implementation from a high-level description - Suggesting implementations for partially specified functions - Exploring how training data quality affects generated code (correctness, style, quality issues) ### “Clean” training set (for DSCc) The initial training set contains ~4.4M pairs. To construct the **clean dataset**: - We run **Semgrep** (static analysis) on all training functions. - Semgrep detects: - Low-quality patterns - Potentially problematic constructs - Syntactically incorrect functions - All flagged low-quality / invalid functions are removed. This yields: - **`clean_training_set.json` — ~4.2M pairs** - Derived from The Stack - But with many quality issues and syntax errors filtered out. --- ## Citation If you use this model, please cite the corresponding publication. ```bibtex @inproceedings{improta2025quality, title={Quality In, Quality Out: Investigating Training Data's Role in AI Code Generation}, author={Improta, Cristina and Tufano, Rosalia and Liguori, Pietro and Cotroneo, Domenico and Bavota, Gabriele}, booktitle={2025 IEEE/ACM 33rd International Conference on Program Comprehension (ICPC)}, pages={454--465}, year={2025}, organization={IEEE Computer Society} }