Dataset Validation and the Syllabic Edge Case
Heuristics and language models clash when compiling the Gemmaiku training corpus: building a pipeline to enforce strict 5-7-5 metrics across 500 conversational turns.
English is a language of historical residue, not logical design. Standard orthography is a terrible guide to phonology: vowels cluster together, silent letters linger, and the same combination of letters shifts its weight depending on context. When trying to train a model to speak exclusively in haikus, this orthographic friction becomes a bottleneck. If the training data contains even a single syllable mismatch, the model’s structural discipline collapses. To build a corpus of 500 perfect 5-7-5 conversational turns, a validator had to be constructed to sit between the raw, messy internet text and the target dataset.
The Fragility of Heuristics
The initial validation attempt used a basic Python script built around the syllables library. The helper function strips punctuation and splits sentences into words, estimating the syllable count of each word. While this works for standard English vocabularies, it fails on technical terms and names. The word Django is estimated by simple dictionary rules as three syllables because the prefix silent D throws off the vowel grouping heuristic, even though a native speaker pronounces it as two. The syllables package is a collection of regex patterns masquerading as a linguistic engine; it cannot understand pronunciation, it can only guess at spelling patterns.
Interrogating the Corpus
The raw dataset (haikus.json) was run through a validation notebook (dataset_fixer.ipynb). The script evaluated each conversation, checking both user queries and assistant responses. Any entry that did not return a strict [5, 7, 5] count on the assistant’s lines was flagged. The failure rate was high. Out of the raw corpus, dozens of entries failed the check due to compound words, slang, or formatting issues. The model needs to see clean boundaries. In the ShareGPT format, conversations are structured as a list of message dictionaries:
{
"conversations": [
{"from": "human", "value": "What is Django?"},
{"from": "gpt", "value": "Python web tool set,\nBuilds clean sites with database,\nFast and secure flow."}
]
}
If the assistant’s value does not align with the 5-7-5 constraint, the fine-tuning loop will learn the wrong boundaries.
The Syllable Correction Loop
Instead of manually rewriting hundreds of lines, a model-in-the-loop correction pipeline was built. The bulk of the workload—roughly 90% of the dataset corrections—was handled locally by running google/gemma-3-1b-it. The local model struggled with the constraints, often requiring up to 30 attempts per topic, varying the generation temperature dynamically between 0.3 and 0.8 and swapping between few-shot prompt configurations to guide the formatting. For the remaining 10% of recalcitrant edge cases where the local 1B model repeatedly failed to output a valid 5-7-5 metric, the pipeline routed the topic to Claude 3.5 Sonnet via the OpenRouter API. The script enforced a simple verification cycle: the model generated a response, the local syllables function evaluated the line counts, and the script updated the dataset only when a perfect 5-7-5 pattern was achieved. The resulting dataset, haikus_dataset.json, provides 500 clean, validated conversational turns ready for SFT.
What Next
- Launch the Metal-accelerated SFT training loop on the local M1 GPU using the newly compiled dataset.
- Configure the LoRA adapters to target the self-attention and MLP layers of the Gemma 3 270M and 1B models.
- Track the training and validation loss curves to identify the optimal stopping point before overfitting occurs.