md4c

C Markdown parser. Fast. SAX-like interface. Compliant to CommonMark specification.
git clone https://noulin.net/git/md4c.git
Log | Files | Refs | README | LICENSE

commit a4d4f4638f2d3b9db1a0ee0a898ed0355777509a
parent 09ae86095f902c672e408ffba710138612ad4f1d
Author: Martin Mitas <mity@morous.org>
Date:   Mon, 12 Dec 2016 18:04:14 +0100

README.md: Improve wording.

Diffstat:
MREADME.md | 49+++++++++++++++++++++++++++----------------------
1 file changed, 27 insertions(+), 22 deletions(-)

diff --git a/README.md b/README.md @@ -103,39 +103,44 @@ some extensions or allowing some deviations from the specification. ## Input/Output Encoding The CommonMark specification generally assumes UTF-8 input, but under closer -inspection Unicode is actually used on very few occasions: +inspection, Unicode plays any role in few very specific situations when parsing +Markdown documents: - * Classification of Unicode character as a Unicode whitespace or Unicode - punctuation. This is used for detection of word boundary when processing - emphasis and strong emphasis. + * For detection of word boundary when processing emphasis and strong emphasis, + some classification of Unicode character (whitespace, punctuation) is used. - * Unicode case folding. This is used to perform case-independent matching - of link labels when resolving reference links. + * For (case-insensitive) matching of a link reference with corresponding link + reference definition, Unicode case folding is used. - * Translating HTML entities and numeric character references (e.g. `&amp;`, - `&#35;`). However MD4C leaves the translation on the renderer/application; - as the renderer is supposed to really know output encoding. + * For translating HTML entities (e.g. `&amp;`) and numeric character + references (e.g. `&#35;` or `&#xcab;`) into their Unicode equivalents. + However MD4C leaves this translation on the renderer/application; as the + renderer is supposed to really know output encoding and whether it really + needs to perform this kind of translation. (Consider that a renderer + converting Markdown to HTML may leave the entities untranslated and defer + the work to a web browser.) -MD4C uses this property of the standard and its implementation is, to a large -degree, encoding-agnostic. Most of the code only assumes that the encoding of -your choice is compatible with ASCII, i.e. that the codepoints below 128 have -the same numeric values as ASCII. +MD4C relies on this property of the CommonMark and the implementation is, to +a large degree, encoding-agnostic. Most of MD4C code only assumes that the +encoding of your choice is compatible with ASCII, i.e. that the codepoints +below 128 have the same numeric values as ASCII. -All input MD4C does not understand is seen as a text and sent to the callbacks -unchanged. +Any input MD4C does not understand is simply seen as part of the document text +and sent to the renderer's callback functions unchanged. -The behavior of MD4C in the isolated listed situations where the encoding -really matters is determined by preprocessor macros: +The two situations where MD4C has to understand Unicode are handled accordingly +to the following preprocessor macros: * If preprocessor macro `MD4C_USE_UTF8` is defined, MD4C assumes UTF-8 - in the specific situations. + for word boundary detection and case-folding. - * On Windows, if preprocessor macro `MD4C_USE_UTF16` is defined, MD4C assumes - UTF-16 and uses `WCHAR` instead of `char`. (UTF-16 is what Windows - developers usually call just "Unicode" and what Win32API works with.) + * On Windows, if preprocessor macro `MD4C_USE_UTF16` is defined, MD4C uses + `WCHAR` instead of `char` and assumes UTF-16 encoding in those situations. + (UTF-16 is what Windows developers usually call just "Unicode" and what + Win32API works with.) * By default (when none of the macros is defined), ASCII-only mode is used - even in the situations listed above. This effectively means that non-ASCII + even in the specific situations. That effectively means that non-ASCII whitespace or punctuation characters won't be recognized as such and that case-folding is performed only on ASCII letters (i.e. `[a-zA-Z]`).