Requirement
From a large block of text, extract the Front Matter metadata required for markdown.
Front Matter Parsing Logic
The logic relying on the </think> tag or specific starting positions is very fragile because some outputs from large models do not include the think tag. Therefore, a more robust “Seek and Extract” (Search and Extraction) strategy is adopted.
New Logic Details
Cleaning and Preprocessing (Pre-clean):
- Normalize all newline characters to
\n. - Check and remove outer Markdown code block wrappers (such as ```markdown … ``), retaining only the internal content.
- Normalize all newline characters to
Locate Start Marker (Scan Start):
- Scan line by line to find the first line that strictly equals
---(ignoring trailing spaces). - Key Point: This means any前置废话 (such as “Thinking Process…”, “Here is the file:”,
</think>...</think>) from LLM outputs will be automatically skipped and treated as noise.
- Scan line by line to find the first line that strictly equals
Locate End Marker (Scan End):
- From the line after the start marker, continue searching for the next
---line. - If no paired end marker is found, it is deemed a parsing failure (to avoid incorrect truncation).
- From the line after the start marker, continue searching for the next
Precise Extraction (Extract):
- YAML: Extract all lines between the two
---. - Body: Extract all lines after the end marker and remove leading extra blank lines.
- Noise: All content before the start marker is discarded.
- YAML: Extract all lines between the two
Validation (Validate):
- Attempt to parse the extracted YAML. If parsing fails (non-valid YAML object), fallback to avoid crashing due to accidental
---appearances.
- Attempt to parse the extracted YAML. If parsing fails (non-valid YAML object), fallback to avoid crashing due to accidental
Advantages
- Zero Dependency: No need for
gray-matter, keeping the code lightweight. - Anti-interference: Perfectly immune to LLM’s thinking chain outputs, opening remarks, Markdown formatting wrappers, and other common noises.
- Compatibility: As long as the output contains standard Front Matter blocks, regardless of where they are hidden, they can be precisely extracted.
