Front Matter Extraction Plan V1

Requirement

From a large block of text, extract the Front Matter metadata required for markdown.

Front Matter Parsing Logic

The logic relying on the </think> tag or specific starting positions is very fragile because some outputs from large models do not include the think tag. Therefore, a more robust “Seek and Extract” (Search and Extraction) strategy is adopted.

New Logic Details

Cleaning and Preprocessing (Pre-clean):
- Normalize all newline characters to \n.
- Check and remove outer Markdown code block wrappers (such as ```markdown … ``), retaining only the internal content.
Locate Start Marker (Scan Start):
- Scan line by line to find the first line that strictly equals --- (ignoring trailing spaces).
- Key Point: This means any前置废话 (such as “Thinking Process…”, “Here is the file:”, </think>...</think>) from LLM outputs will be automatically skipped and treated as noise.
Locate End Marker (Scan End):
- From the line after the start marker, continue searching for the next --- line.
- If no paired end marker is found, it is deemed a parsing failure (to avoid incorrect truncation).
Precise Extraction (Extract):
- YAML: Extract all lines between the two ---.
- Body: Extract all lines after the end marker and remove leading extra blank lines.
- Noise: All content before the start marker is discarded.
Validation (Validate):
- Attempt to parse the extracted YAML. If parsing fails (non-valid YAML object), fallback to avoid crashing due to accidental --- appearances.

Advantages

Zero Dependency: No need for gray-matter, keeping the code lightweight.
Anti-interference: Perfectly immune to LLM’s thinking chain outputs, opening remarks, Markdown formatting wrappers, and other common noises.
Compatibility: As long as the output contains standard Front Matter blocks, regardless of where they are hidden, they can be precisely extracted.

Requirement

Front Matter Parsing Logic

New Logic Details

Advantages

相关文章

关注公众号