swiftでMarkdownとHTMLの特殊文字エスケープを行う方法

これはInfocom Advent Calendar 2023 10日目の記事です．

swiftでMarkdownとHTMLの特殊文字をエスケープする方法を，String型のextensionで実現してみます．

環境

macOS: 13.5 (Ventura)
swift: 5.9

Markdownの特殊文字をエスケープするextension

こちらの特殊文字をエスケープします．

extension String {
  func markdownEscaping() -> Self {
    return self
    .replacingOccurrences(of: "\\", with: "\\\\")
    .replacingOccurrences(of: "`", with: "\\`")
    .replacingOccurrences(of: "*", with: "\\*")
    .replacingOccurrences(of: "_", with: "\\_")
    .replacingOccurrences(of: "{", with: "\\{")
    .replacingOccurrences(of: "}", with: "\\}")
    .replacingOccurrences(of: "[", with: "\\[")
    .replacingOccurrences(of: "]", with: "\\]")
    .replacingOccurrences(of: "<", with: "\\<")
    .replacingOccurrences(of: ">", with: "\\>")
    .replacingOccurrences(of: "(", with: "\\(")
    .replacingOccurrences(of: ")", with: "\\)")
    .replacingOccurrences(of: "#", with: "\\#")
    .replacingOccurrences(of: "+", with: "\\+")
    .replacingOccurrences(of: "-", with: "\\-")
    .replacingOccurrences(of: ".", with: "\\.")
    .replacingOccurrences(of: "!", with: "\\!")
    .replacingOccurrences(of: "|", with: "\\|")
  }
}

後続のエスケープしたものが置換されないように，\\をはじめに処理しています．

HTMLの特殊文字をエスケープするextension

こちらの特殊文字をエスケープします．

cf. PHP: htmlspecialchars - Manual

extension String {
  func htmlEscaping() -> Self {
    return self
    .replacingOccurrences(of: "&", with: "&amp;")
    .replacingOccurrences(of: "\"", with: "&quot;")
    .replacingOccurrences(of: "'", with: "&#39;")
    .replacingOccurrences(of: "<", with: "&lt;")
    .replacingOccurrences(of: ">", with: "&gt;")
  }
}

後続のエスケープしたものが置換されないように，&をはじめに処理しています．

作成したextensionの利用例

main.swfit

// escape markdown: `*_{}[]<>()#+-.!|\
print("`*_{}[]<>()#+-.!|\\".markdownEscaping())

// escape html: &"'<>
print("&\"'<>".htmlEscaping())

これを実行して出力された文字列（エスケープされたもの）をpandocでプレーンテキストに変換してみました．

$ swift main.swift
\`\*\_\{\}\[\]\<\>\(\)\#\+\-\.\!\|\\
&amp;&quot;&#39;&lt;&gt;

$ pandoc -f markdown -t plain <<< '\`\*\_\{\}\[\]\<\>\(\)\#\+\-\.\!\|\\'
`*_{}[]<>()#+-.!|\

$ pandoc -f html -t plain <<< '&amp;&quot;&#39;&lt;&gt;'
&"'<>

エスケープされたものがうまく解釈されて，元の文字列になっていることがわかります．