CodeT5 Syllabus Generator for Educational Content Creation

Model Description

This is a fine-tuned Salesforce/codet5-small model trained to generate structured markdown syllabi with component selection indices. The model takes course requirements as input and generates markdown-formatted syllabi with index-based references to pre-defined educational components.

Key Features:

  • Generates well-structured markdown syllabi
  • Selects appropriate components using index notation [0], [1], [2]
  • Understands educational domain concepts (learning objectives, Bloom's taxonomy, difficulty progression)
  • Produces prerequisite-aware module sequences
  • Trained with pedagogical quality metrics

Training Data

  • Training Examples: 1300 curated course-to-code pairs
  • Epochs: 20
  • Data Quality: High-quality examples covering:
    • Multiple difficulty levels (beginner, intermediate, advanced)
    • Various domains (computer science, data science, business, arts)
    • Diverse course structures and pedagogical approaches
    • Bloom's taxonomy alignment
    • Assessment types and learning activities

Training Configuration

Model: Salesforce/codet5-small (60M parameters)
Tokenizer: RobertaTokenizer
Batch Size: 16
Gradient Accumulation: 2 (effective batch size: 32)
Learning Rate: 3e-4
Weight Decay: 0.01
Label Smoothing: 0.1
Max Input Length: 640 tokens
Max Output Length: 536 tokens

Usage

Quick Start

from transformers import RobertaTokenizer, T5ForConditionalGeneration
import torch

# Load model and tokenizer
model_id = "dewyn/educraft-t5-function-call"
tokenizer = RobertaTokenizer.from_pretrained(model_id)
model = T5ForConditionalGeneration.from_pretrained(model_id)

# Prepare input
requirements = {
    "title": "Machine Learning Fundamentals",
    "domain": "computer_science",
    "level": "intermediate",
    "duration": "semester",
    "description": "Introduction to machine learning algorithms and applications",
    "learning_objectives": [
        "Understand supervised learning algorithms",
        "Implement neural networks",
        "Evaluate model performance"
    ]
}

import json
input_text = f"Generate course syllabus: {json.dumps(requirements)}"

# Generate function calls
input_ids = tokenizer(
    input_text,
    return_tensors="pt",
    max_length=640,
    truncation=True,
    padding=True
).input_ids

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_length=536,
        num_beams=4,
        early_stopping=False,
        no_repeat_ngram_size=2,
    )

generated_code = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_code)

Expected Output

# Machine Learning Fundamentals

**Domain:** Computer Science
**Level:** Intermediate
**Duration:** Semester

## Course Description
Introduction to machine learning algorithms and applications

## Learning Objectives
- Understand supervised learning algorithms
- Implement neural networks
- Evaluate model performance

## Modules

### Module 1: Introduction to Machine Learning [0]
**Duration:** 8 weeks

### Module 2: Supervised Learning Algorithms [1]
**Duration:** 12 weeks
**Prerequisites:** Module 1

### Module 3: Neural Networks [2]
**Duration:** 16 weeks
**Prerequisites:** Module 2

## Activities
- Hands-on ML Exercise [0] - Apply level
- Neural Network Workshop [1] - Create level

## Assessments
- Final Project [0] - Project type

Integration with Parsing Pipeline

from scripts.markdown_syllabus_parser import MarkdownSyllabusParser

# Parse generated markdown
parser = MarkdownSyllabusParser(
    modules_file="data/components/modules.json",
    activities_file="data/components/activities.json",
    assessments_file="data/components/assessments.json"
)

syllabus = parser.parse_markdown(generated_markdown)

# Result is a complete syllabus dictionary with resolved components

Model Details

Base Model: Salesforce/codet5-small

  • Pre-trained on 8.35M code functions (Python, Java, Go, JavaScript, Ruby, PHP)
  • 60M parameters
  • Encoder-decoder transformer architecture

Why CodeT5 vs T5:

  • CodeT5 is pre-trained on code, not natural language
  • Understands programming syntax and patterns
  • Better at generating valid Python function calls
  • Less prone to hallucination or syntax errors

Limitations

  • Optimized for educational content generation specifically
  • Requires structured input format (JSON with specific keys)
  • Generated code assumes SyllabusBuilder API availability
  • May need post-processing for edge cases or unusual course structures

Citation

If you use this model, please cite:

@misc{codet5-syllabus-generator,
  author = {EduCraft MSc AI Capstone Project},
  title = {CodeT5 Function Call Generator for Educational Syllabus Creation},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{dewyn/educraft-t5-function-call}}
}

License

Apache 2.0 (same as base CodeT5 model)

Training Framework

  • PyTorch
  • Transformers (Hugging Face)
  • Trained on CPU (WSL2) with gradient checkpointing
  • Training time: ~2.5 hours (20 epochs)

Contact

For questions or issues, please open an issue on the project repository.

Downloads last month
6
Safetensors
Model size
60.5M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support