## Unicode Technical Standard #35 # Unicode Locale Data Markup Language (LDML)
Version | 40 |
Editors | Mark Davis (markdavis@google.com) and other CLDR committee members |
Date | 2021-10-26 |
This Version | https://www.unicode.org/reports/tr35/tr35-65/tr35.html |
Previous Version | https://www.unicode.org/reports/tr35/tr35-63/tr35.html |
Latest Version | https://www.unicode.org/reports/tr35/ |
Corrigenda | https://cldr.unicode.org/index/corrigenda-and-errata |
Latest Proposed Update | https://unicode-org.github.io/cldr/ldml/tr35.html |
Namespace | https://unicode.org/cldr/ |
DTDs | http://www.unicode.org/cldr/dtd/40/ |
Revision | 64 |
EBNF | Validity / Comments | |
---|---|---|
unicode_language_id |
|
"root" is treated as a special unicode_language_subtag |
unicode_language_subtag |
= alpha{2,3} | alpha{5,8}; |
validity latest-data |
unicode_script_subtag |
= alpha{4} ; |
validity latest-data |
unicode_region_subtag
| = (alpha{2} | digit{3}) ; |
validity latest-data |
unicode_variant_subtag
| = (alphanum{5,8} |
validity latest-data |
sep | = [-_] ; | |
digit | = [0-9] ; | |
alpha | = [A-Z a-z] ; | |
alphanum | = [0-9 A-Z a-z] ; |
Qaag |
Zawgyi | Qaag is a special script code for identifying the non-standard use of Myanmar characters for display with the Zawgyi font. The purpose of the code is to enable migration to standard, interoperable use of Unicode by providing an identifier for Zawgyi for tagging text, applications, input methods, font tables, transformations, and other mechanisms used for migration. | |
Qaai |
Inherited | deprecated: the canonicalized form is Zinh | |
Zinh |
Inherited | ||
Zsye |
Emoji Style | Prefer emoji style for characters that have both text and emoji styles available. | |
Zsym |
Text Style | Prefer text style for characters that have both text and emoji styles available. | |
Zxxx |
Unwritten | Indicates spoken or otherwise unwritten content. For example: | |
Sample(s) | Description | ||
---|---|---|---|
uz | either written or spoken content | ||
uz-Latn or uz-Arab | written-only content (particular script) | ||
uz-Zyyy | written-only content (unspecified script) | ||
uz-Zxxx | spoken-only content | ||
uz-Latn, uz-Zxxx | both specific written and spoken content (using a language list) | ||
Zyyy |
Common | ||
Zzzz |
Unknown |
key (old key name) | key description | example type (old type name) | type description |
---|---|---|---|
A Unicode Calendar Identifier defines a type of calendar. The valid values are those name attribute values in the type elements of key name="ca" in bcp47/calendar.xml. | |||
"ca" (calendar) |
Calendar algorithm (For information on the calendar algorithms associated with the data used with these, see [Calendars].) |
"buddhist" | Thai Buddhist calendar (same as Gregorian except for the year) |
"chinese" | Traditional Chinese calendar | ||
… | |||
"gregory" (gregorian) |
Gregorian calendar | ||
… | |||
"islamic" | Islamic calendar | ||
"islamic-civil" | Islamic calendar, tabular (intercalary years [2,5,7,10,13,16,18,21,24,26,29] - civil epoch) | ||
"islamic-umalqura" | Islamic calendar, Umm al-Qura | ||
… | |||
Note: Some calendar types are represented by two subtags. In such cases, the first subtag specifies a generic calendar type and the second subtag specifies a calendar algorithm variant. The CLDR uses generic calendar types (single subtag types) for tagging data when calendar algorithm variations within a generic calendar type are irrelevant. For example, type "islamic" is used for specifying Islamic calendar formatting data for all Islamic calendar types, including "islamic-civil" and "islamic-umalqura". | |||
A Unicode Currency Format Identifier defines a style for currency formatting. The valid values are those name attribute values in the type elements of key name="cf" in bcp47/currency.xml. | |||
"cf" | Currency Format style | "standard" | Negative numbers use the minusSign symbol (the default). |
"account" | Negative numbers use parentheses or equivalent. | ||
A Unicode Collation Identifier defines a type of collation (sort order). The valid values are those name attribute values in the type elements of bcp47/collation.xml. | |||
For information on each collation setting parameter, from ka to vt, see Setting Options | |||
"co" (collation) |
Collation type | "standard" | The default ordering for each language. For root it is based on the [DUCET] (Default Unicode Collation Element Table): see Root Collation. Each other locale is based on that, except for appropriate modifications to certain characters for that language. |
"search" | A special collation type dedicated for string search—it is not used to determine the relative order of two strings, but only to determine whether they should be considered equivalent for the specified strength, using the string search matching rules appropriate for the language. Compared to the normal collator for the language, this may add or remove primary equivalences, may make additional characters ignorable or change secondary equivalences, and may modify contractions to allow matching within them, depending on the desired behavior. For example, in Czech, the distinction between ‘a’ and ‘á’ is secondary for normal collation, but primary for search; a search for ‘a’ should never match ‘á’ and vice versa. A search collator is normally used with strength set to PRIMARY or SECONDARY (should be SECONDARY if using “asymmetric” search as described in the [UCA] section Asymmetric Search). The search collator in root supplies matching rules that are appropriate for most languages (and which are different than the root collation behavior); language-specific search collators may be provided to override the matching rules for a given language as necessary. | ||
Other keywords provide additional choices for certain locales; they only have effect in certain locales. | |||
… | |||
"phonetic" | Requests a phonetic variant if available, where text is sorted based on pronunciation. It may interleave different scripts, if multiple scripts are in common use. | ||
"pinyin" | Pinyin ordering for Latin and for CJK characters; that is, an ordering for CJK characters based on a character-by-character transliteration into a pinyin. (used in Chinese) | ||
"reformed" | Reformed collation (such as in Swedish) | ||
"searchjl" | Special collation type for a modified string search in which a pattern consisting of a sequence of Hangul initial consonants (jamo lead consonants) will match a sequence of Hangul syllable characters whose initial consonants match the pattern. The jamo lead consonants can be represented using conjoining or compatibility jamo. This search collator is best used at SECONDARY strength with an "asymmetric" search as described in the [UCA] section Asymmetric Search and obtained, for example, using ICU4C's usearch facility with attribute USEARCH_ELEMENT_COMPARISON set to value USEARCH_PATTERN_BASE_WEIGHT_IS_WILDCARD; this ensures that a full Hangul syllable in the search pattern will only match the same syllable in the searched text (instead of matching any syllable with the same initial consonant), while a Hangul initial consonant in the search pattern will match any Hangul syllable in the searched text with the same initial consonant. | ||
… | |||
A Unicode Currency Identifier defines a type of currency. The valid values are those name attribute values in the type elements of key name="cu" in bcp47/currency.xml. | |||
"cu" (currency) |
Currency type | ISO 4217 code, plus others in common use |
Codes consisting of 3 ASCII letters that are or have been valid in ISO 4217, plus certain additional codes that are or have been in common use. The list of countries and time periods associated with each currency value is available in Supplemental Currency Data, plus the default number of decimals. The XXX code is given a broader interpretation as Unknown or Invalid Currency. |
A Unicode Dictionary Break Exclusion Identifier specifies scripts to be excluded from dictionary-based text break (for words and lines). The valid values are of one or more items of type SCRIPT_CODE as specified in the name attribute value in the type element of key name="dx" in bcp47/segmentation.xml. | |||
"dx" | Dictionary break script exclusions | unicode_script_subtag values |
One or more items of type SCRIPT_CODE, which are valid The code Zyyy (Common) can be specified to exclude all scripts, in which case it should be the only SCRIPT_CODE value specified. |
A Unicode Emoji Presentation Style Identifier specifies a request for the preferred emoji presentation style. This can be used as part of the value for an HTML lang attribute, for example <html lang="sr-Latn-u-em-emoji"> . The valid values are those name attribute values in the type elements of key name="em" in bcp47/variant.xml. | |||
"em" | Emoji presentation style | "emoji" | Use an emoji presentation for emoji characters if possible. |
"text" | Use a text presentation for emoji characters if possible. | ||
"default" | Use the default presentation for emoji characters as specified in UTR #51 Section 4, Presentation Style. | ||
A Unicode First Day Identifier defines the preferred first day of the week for calendar display. Specifying "fw" in a locale identifier overrides the default value specified by supplemental week data (see Part 4 Dates, section 4.3 Week Data). The valid values are those name attribute values in the type elements of key name="fw" in bcp47/calendar.xml. | |||
"fw" | First day of week | "sun" | Sunday |
"mon" | Monday | ||
… | |||
"sat" | Saturday | ||
A Unicode Hour Cycle Identifier defines the preferred time cycle. Specifying "hc" in a locale identifier overrides the default value specified by supplemental time data (see Part 4 Dates, section 4.4 Time Data). The valid values are those name attribute values in the type elements of key name="hc" in bcp47/calendar.xml. | |||
"hc" | Hour cycle | "h12" | Hour system using 1–12; corresponds to 'h' in patterns |
"h23" | Hour system using 0–23; corresponds to 'H' in patterns | ||
"h11" | Hour system using 0–11; corresponds to 'K' in patterns | ||
"h24" | Hour system using 1–24; corresponds to 'k' in pattern | ||
A Unicode Line Break Style Identifier defines a preferred line break style corresponding to the CSS level 3 line-break option. Specifying "lb" in a locale identifier overrides the locale‘s default style (which may correspond to "normal" or "strict"). The valid values are those name attribute values in the type elements of key name="lb" in bcp47/segmentation.xml. | |||
"lb" | Line break style | "strict" | CSS level 3 line-break=strict, e.g. treat CJ as NS |
"normal" | CSS level 3 line-break=normal, e.g. treat CJ as ID, break before hyphens for ja,zh | ||
"loose" | CSS lev 3 line-break=loose | ||
A Unicode Line Break Word Identifier defines preferred line break word handling behavior corresponding to the CSS level 3 word-break option. The valid values are those name attribute values in the type elements of key name="lw" in bcp47/segmentation.xml. | |||
"lw" | Line break word handling | "normal" | CSS level 3 word-break=normal, normal script/language behavior for midword breaks |
"breakall" | CSS level 3 word-break=break-all, allow midword breaks unless forbidden by lb setting | ||
"keepall" | CSS level 3 word-break=keep-all, prohibit midword breaks except for dictionary breaks | ||
A Unicode Measurement System Identifier defines a preferred measurement system. Specifying "ms" in a locale identifier overrides the default value specified by supplemental measurement system data (see Part 2 General, section 5 Measurement System Data). The valid values are those name attribute values in the type elements of key name="ms" in bcp47/measure.xml. | |||
"ms" | Measurement system | "metric" | Metric System |
"ussystem" | US System of measurement: feet, pints, etc.; pints are 16oz | ||
"uksystem" | UK System of measurement: feet, pints, etc.; pints are 20oz | ||
A Unicode Number System Identifier defines a type of number system. The valid values are those name attribute values in the type elements of bcp47/number.xml. | |||
"nu" (numbers) |
Numbering system | Unicode script subtag | Four-letter types indicating the primary numbering system for the corresponding script represented in Unicode. Unless otherwise specified, it is a decimal numbering system using digits [:GeneralCategory=Nd:]. For example, "latn" refers to the ASCII / Western digits 0-9, while "taml" is an algorithmic (non-decimal) numbering system. (The code "tamldec" is indicates the "modern Tamil decimal digits".) For more information, see Numbering Systems. |
"arabext" | Extended Arabic-Indic digits ("arab" means the base Arabic-Indic digits) | ||
"armnlow" | Armenian lowercase numerals | ||
… | |||
"roman" | Roman numerals | ||
"romanlow" | Roman lowercase numerals | ||
"tamldec" | Modern Tamil decimal digits | ||
A Region Override specifies an alternate region to use for obtaining certain region-specific default values (those specified by the <rgScope> element), instead of using the region specified by the unicode_region_subtag in the Unicode Language Identifier (or inferred from the unicode_language_subtag). | |||
"rg" | Region Override | "uszzzz" | The value is a unicode_subdivision_id of type “unknown” or “regular”; this consists of a unicode_region_subtag for a regular region (not a macroregion), suffixed either by “zzzz” (case is not significant) to designate the region as a whole, or by a unicode_subdivision_suffix to provide more specificity. For example, “en-GB-u-rg-uszzzz” represents a locale for British English but with region-specific defaults set to US for items such as default currency, default calendar and week data, default time cycle, and default measurement system and unit preferences. |
… | |||
A Unicode Subdivision Identifier defines a regional subdivision used for locales. The valid values are based on the subdivisionContainment element as described in Section 3.6.5 Subdivision Codes. | |||
"sd" | Regional Subdivision | "gbsct" | A unicode_subdivision_id, which is a unicode_region_subtag concatenated with a unicode_subdivision_suffix. For example, gbsct is “gb”+“sct” (where sct represents the subdivision code for Scotland). Thus “en-GB-u-sd-gbsct” represents the language variant “English as used in Scotland”. And both “en-u-sd-usca” and “en-US-u-sd-usca” represent “English as used in California”. See 3.6.5 Subdivision Codes. |
… | |||
A Unicode Sentence Break Suppressions Identifier defines a set of data to be used for suppressing certain sentence breaks that would otherwise be found by UAX #14 rules. The valid values are those name attribute values in the type elements of key name="ss" in bcp47/segmentation.xml. | |||
"ss" | Sentence break suppressions | "none" | Don’t use sentence break suppressions data (the default). |
"standard" | Use sentence break suppressions data of type "standard" | ||
A Unicode Timezone Identifier defines a timezone. The valid values are those name attribute values in the type elements of bcp47/timezone.xml. | |||
"tz" (timezone) |
Time zone | Unicode short time zone IDs | Short identifiers defined in terms of a TZ time zone database [Olson] identifier in the file common/bcp47/timezone.xml file, plus a few extra values. For more information, see Section 3.6.3 Time Zone Identifiers. CLDR provides data for normalizing timezone codes. |
A Unicode Variant Identifier defines a special variant used for locales. The valid values are those name attribute values in the type elements of bcp47/variant.xml. | |||
"va" | Common variant type | "posix" | POSIX style locale variant. About handling of the "POSIX" variant see Section 3.8.2, Legacy Variants. |
On the 24th of May, 1863, my uncle, Professor Liedenbrock, rushed into his little house, No. 19 Königstrasse, one of the oldest streets in the oldest portion of the city of Hamburg… | Le 24 mai 1863, un dimanche, mon oncle, le professeur Lidenbrock, revint précipitamment vers sa petite maison située au numéro 19 de Königstrasse, l’une des plus anciennes rues du vieux quartier de Hambourg… |
hi-t-en-h0-hybrid | Hinglish | Hindi-English hybrid locale |
ta-t-en-h0-hybrid | Tanglish | Tamil-English hybrid locale |
ba-t-en-h0-hybrid | Banglish | Bangla-English hybrid locale |
… | ||
en-t-hi-h0-hybrid | Hinglish | English-Hindi hybrid locale |
en-t-zh-h0-hybrid | Chinglish | English-Chinese hybrid locale |
… |
ru-t-en-latn-gb-h0-hybrid | Runglish | Russian with an admixture of British English in Latin script |
ru-t-en-cyrl-gb-h0-hybrid | Runglish | Russian with an admixture of British English in Cyrillic script |
Lookup Type | Example | Comments |
---|---|---|
Resource bundle lookup |
se-FI → se → default‑locale* → root |
* The default-locale may have its own inheritance change; for example, it may be "en-GB → en" In that case, the chain is expanded by inserting the chain, resulting in:
se-FI → |
Inherited item lookup |
se-FI+key → se+key → root_alias*+key → root+key |
* If there is a root_alias to another key or locale, then insert that entire chain. For example, suppose that months for another calendar system have a root alias to Gregorian months. In that case, the root alias would change the key, and retry from se-FI downward. This can happen multiple times.
se-FI+key → |
User Input | Lookup in Locale | For | Comment |
---|---|---|---|
de_CH no keyword | de_CH | default collation type | finds "B" |
de_CH | collation type=B | not found | |
de | collation type=B | found | |
de no keyword | de | default collation type | not found |
root | default collation type | finds "standard" | |
de | collation type=standard | not found | |
root | collation type=standard | found | |
de_u_co_A | de | collation type=A | found |
de_u_co_standard | de | collation type=standard | not found |
root | collation type=standard | found | |
de_u_co_foobar | de | collation type=foobar | not found |
root | collation type=foobar | not found, starts looking for default | |
de | default collation type | not found | |
root | default collation type | finds "standard" | |
de | collation type=standard | not found | |
root | collation type=standard | found |
User Input | Lookup in Locale | For | Comment |
---|---|---|---|
de_CH_u_co_search | de_CH | collation type=search | not found |
de | collation type=search | found | |
en_US_u_co_search | en_US | collation type=search | not found |
en | collation type=search | not found | |
root | collation type=search | found |
User Input | Lookup in Locale | For | Comment |
---|---|---|---|
zh_Hant no keyword | zh_Hant | default collation type | finds "stroke" |
zh_Hant | collation type=stroke | not found | |
zh | collation type=stroke | found | |
zh_Hant_HK_u_co_pinyin | zh_Hant_HK | collation type=pinyin | not found |
zh_Hant | collation type=pinyin | not found | |
zh | collation type=pinyin | found | |
zh no keyword | zh | default collation type | finds "pinyin" |
zh | collation type=pinyin | found |
Inheritance | Part of the internal mechanism used by CLDR to organize and manage locale data. This is used to share common resources, and ease maintenance, and provide the best fallback behavior in the absence of data. Should not be used for locale matching or likely subtags. | |
---|---|---|
Example: | parent(en_AU) ⇒ en_001 parent(en_001) ⇒ en parent(en) ⇒ root | |
Data: | supplementalData.xml <parentLocale> | |
Spec: | Section 4.2 Inheritance and Validity | |
DefaultContent | Part of the internal mechanism used by CLDR to manage locale data. A particular sublocale is designated the defaultContent for a parent, so that the parent exhibits consistent behavior. Should not be used for locale matching or likely subtags. | |
Example: | addLikelySubtags(sr-ME) ⇒ sr-Latn-ME, minimize(de-Latn-DE) ⇒ de | |
Data: | supplementalMetadata.xml <defaultContent> | |
Spec: | Part 6: Section 9.3 Default Content | |
LikelySubtags | Provides most likely full subtag (script and region) in the absence of other information. A core component of LocaleMatching. | |
Example: | addLikelySubtags(zh) ⇒ zh-Hans-CN addLikelySubtags(zh-TW) ⇒ zh-Hant-TW minimize(zh-Hans, favorRegion) ⇒ zh-TW | |
Data: | likelySubtags.xml <likelySubtags> | |
Spec: | Section 4.3 Likely Subtags | |
LocaleMatching | Provides the best match for the user’s language(s) among an application’s supported languages. | |
Example: | bestLocale(userLangs=<en, fr>, appLangs=<fr-CA, ru>) ⇒ fr-CA | |
Data: | languageInfo.xml <languageMatching> | |
Spec: | Section 4.4 Language Matching |
Root | de | Resolved |
---|---|---|
```xml
|
```xml
|
```xml
|
```xml
|
```xml
|
```xml
|
= prop| \\p{x=y},
\| '[-]'
\| '[' [\\-\\^]? s seq+ ']'
= root (s [\\&\\-] s root)* s| [abc]-[cde], a | | `range` |
\| range s
= char ('-' char)?| a, a-c, \{abc} | | `prop` |
\| '{' (s char)+ s '}'
= '\\' [pP] '{' propName ([≠=] s value1+)? '}'| \\p\{x=y}, [:x=y:]
\| '[:' '^'? propName ([≠=] s value2+)? ':]'
= s [A-Za-z0-9] [A-Za-z0-9_\\x20]* s| General_Category,
= [^\\}]| Lm,
\| '\\' quoted
= [^:]| Lm,
\| '\\' quoted
= [^\\& \\- \\[ \\[ \\] \\\\ \\} \\{ [:Pat_WS:]]| a, b, c, \\n | | `quoted` |
\| '\\' quoted
= 'u' (hex{4} \| bracketedHex)| _**error** if lengths not exact_ | | `charName` |
\| 'x' (hex{2} \| bracketedHex)
\| 'U00' ('0' hex{5} \| '10' hex{4})
\| 'N{' propName '}'
\| [[\u0000-\U00010FFFF]-[uxUN]]
= s [A-Za-z0-9] [-A-Za-z0-9_\x20]* s| TIBETAN LETTER -A | | `bracketedHex` |
= '{' s hexCodePoint (s hexCodePoint)* s '}'| \{61 2019 62} | | `hexCodePoint` |
= hex{1,5} \| '10' hex{4}| | | `hex` |
= [0-9A-Fa-f]| | | `s` |
= [:Pattern_White_Space:]*| optional whitespace | Some constraints on UnicodeSet syntax are not captured by this EBNF. Notably, property names and values are restricted to those supported by the implementation, and have additional constraints imposed by [[UAX44](https://unicode.org/reports/tr41/#UAX44)]. In addition, quoted values that resolve to more than one code point are disallowed in ranges of the form `char '-' char`. The syntax characters are listed in the table below: | Char | Hex | Name | Usage | | ---- | ------ | -------------------- | ------------------------------------------ | | $ | U+0024 | DOLLAR SIGN | Equivalent of \\uFFFF (This is for implementations that return \\uFFFF when accessing before the first or after the last character) | | & | U+0026 | AMPERSAND | Intersecting UnicodeSets | | - | U+002D | HYPHEN-MINUS | Ranges of characters; also set difference. | | : | U+003A | COLON | POSIX-style property syntax | | [ | U+005B | LEFT SQUARE BRACKET | Grouping; POSIX property syntax | | ] | U+005D | RIGHT SQUARE BRACKET | Grouping; POSIX property syntax | | \\ | U+005C | REVERSE SOLIDUS | Escaping | | ^ | U+005E | CIRCUMFLEX ACCENT | Posix negation syntax | | { | U+007B | LEFT CURLY BRACKET | Strings in set; Perl property syntax | | } | U+007D | RIGHT CURLY BRACKET | Strings in set; Perl property syntax | | | U+0020 U+0009..U+000D U+0085
ab-ad | → | ab ac ad |
ab-d | → | ab ac ad |
ab-cd | → | ab ac ad bb bc bd cb cc cd |
👦🏻-👦🏿 | → | 👦🏻 👦🏼 👦🏽 👦🏾 👦🏿 |
👦🏻-🏿 | → | 👦🏻 👦🏼 👦🏽 👦🏾 👦🏿 |
Pattern | N | Result |
---|---|---|
0≤Rf|1≤Ru|1<Re | -∞, -3, -1, -0.000001 | Rf (defaulted to first string) |
0, 0.01, 0.9999 | Rf | |
1 | Ru | |
1.00001, 5, 99, ∞ | Re |
Source | Fields | |||
---|---|---|---|---|
Language | Script | Region | Variants | |
en-GB | {en} | {} | {GB} | {} |
und-GB | {} | {} | {GB} | {} |
ja-Latn-YU-hepburn-heploc | {ja} | {Latn} | {YU} | {hepburn, heploc} |
{ja} ⊇ {} | success, und = {} |
{hepburn, heploc} ⊇ {hepburn} | success |
{ja} ⊇ {} | success, und = {} |
{hepburn} ⊉ {hepburn, heploc} | failure |