Requirement

From a large block of text, extract the Front Matter metadata required for markdown.

Front Matter Parsing Logic

The logic relying on the </think> tag or specific starting positions is very fragile because some outputs from large models do not include the think tag. Therefore, a more robust “Seek and Extract” (Search and Extraction) strategy is adopted.

New Logic Details

  1. Cleaning and Preprocessing (Pre-clean):

    • Normalize all newline characters to \n.
    • Check and remove outer Markdown code block wrappers (such as ```markdown … ``), retaining only the internal content.
  2. Locate Start Marker (Scan Start):

    • Scan line by line to find the first line that strictly equals --- (ignoring trailing spaces).
    • Key Point: This means any前置废话 (such as “Thinking Process…”, “Here is the file:”, </think>...</think>) from LLM outputs will be automatically skipped and treated as noise.
  3. Locate End Marker (Scan End):

    • From the line after the start marker, continue searching for the next --- line.
    • If no paired end marker is found, it is deemed a parsing failure (to avoid incorrect truncation).
  4. Precise Extraction (Extract):

    • YAML: Extract all lines between the two ---.
    • Body: Extract all lines after the end marker and remove leading extra blank lines.
    • Noise: All content before the start marker is discarded.
  5. Validation (Validate):

    • Attempt to parse the extracted YAML. If parsing fails (non-valid YAML object), fallback to avoid crashing due to accidental --- appearances.

Advantages

  • Zero Dependency: No need for gray-matter, keeping the code lightweight.
  • Anti-interference: Perfectly immune to LLM’s thinking chain outputs, opening remarks, Markdown formatting wrappers, and other common noises.
  • Compatibility: As long as the output contains standard Front Matter blocks, regardless of where they are hidden, they can be precisely extracted.