commit a4d4f4638f2d3b9db1a0ee0a898ed0355777509a
parent 09ae86095f902c672e408ffba710138612ad4f1d
Author: Martin Mitas <mity@morous.org>
Date: Mon, 12 Dec 2016 18:04:14 +0100
README.md: Improve wording.
Diffstat:
| M | README.md | | | 49 | +++++++++++++++++++++++++++---------------------- |
1 file changed, 27 insertions(+), 22 deletions(-)
diff --git a/README.md b/README.md
@@ -103,39 +103,44 @@ some extensions or allowing some deviations from the specification.
## Input/Output Encoding
The CommonMark specification generally assumes UTF-8 input, but under closer
-inspection Unicode is actually used on very few occasions:
+inspection, Unicode plays any role in few very specific situations when parsing
+Markdown documents:
- * Classification of Unicode character as a Unicode whitespace or Unicode
- punctuation. This is used for detection of word boundary when processing
- emphasis and strong emphasis.
+ * For detection of word boundary when processing emphasis and strong emphasis,
+ some classification of Unicode character (whitespace, punctuation) is used.
- * Unicode case folding. This is used to perform case-independent matching
- of link labels when resolving reference links.
+ * For (case-insensitive) matching of a link reference with corresponding link
+ reference definition, Unicode case folding is used.
- * Translating HTML entities and numeric character references (e.g. `&`,
- `#`). However MD4C leaves the translation on the renderer/application;
- as the renderer is supposed to really know output encoding.
+ * For translating HTML entities (e.g. `&`) and numeric character
+ references (e.g. `#` or `ಫ`) into their Unicode equivalents.
+ However MD4C leaves this translation on the renderer/application; as the
+ renderer is supposed to really know output encoding and whether it really
+ needs to perform this kind of translation. (Consider that a renderer
+ converting Markdown to HTML may leave the entities untranslated and defer
+ the work to a web browser.)
-MD4C uses this property of the standard and its implementation is, to a large
-degree, encoding-agnostic. Most of the code only assumes that the encoding of
-your choice is compatible with ASCII, i.e. that the codepoints below 128 have
-the same numeric values as ASCII.
+MD4C relies on this property of the CommonMark and the implementation is, to
+a large degree, encoding-agnostic. Most of MD4C code only assumes that the
+encoding of your choice is compatible with ASCII, i.e. that the codepoints
+below 128 have the same numeric values as ASCII.
-All input MD4C does not understand is seen as a text and sent to the callbacks
-unchanged.
+Any input MD4C does not understand is simply seen as part of the document text
+and sent to the renderer's callback functions unchanged.
-The behavior of MD4C in the isolated listed situations where the encoding
-really matters is determined by preprocessor macros:
+The two situations where MD4C has to understand Unicode are handled accordingly
+to the following preprocessor macros:
* If preprocessor macro `MD4C_USE_UTF8` is defined, MD4C assumes UTF-8
- in the specific situations.
+ for word boundary detection and case-folding.
- * On Windows, if preprocessor macro `MD4C_USE_UTF16` is defined, MD4C assumes
- UTF-16 and uses `WCHAR` instead of `char`. (UTF-16 is what Windows
- developers usually call just "Unicode" and what Win32API works with.)
+ * On Windows, if preprocessor macro `MD4C_USE_UTF16` is defined, MD4C uses
+ `WCHAR` instead of `char` and assumes UTF-16 encoding in those situations.
+ (UTF-16 is what Windows developers usually call just "Unicode" and what
+ Win32API works with.)
* By default (when none of the macros is defined), ASCII-only mode is used
- even in the situations listed above. This effectively means that non-ASCII
+ even in the specific situations. That effectively means that non-ASCII
whitespace or punctuation characters won't be recognized as such and that
case-folding is performed only on ASCII letters (i.e. `[a-zA-Z]`).