165 lines
6.6 KiB
Markdown
165 lines
6.6 KiB
Markdown
Introduction
|
||
===============================================================
|
||
|
||
This document will as the C++ port matures serve as a log to how
|
||
different parts of the library work. As of today, there is some general
|
||
info but mostly CMap specific details.
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
Font Data Tables
|
||
===========================================================================
|
||
|
||
One of the important goals in `sfntly` is thread safety which is why
|
||
tables can only be created with their nested `Builder` class and are
|
||
immutable after creation.
|
||
|
||
`CMapTable`
|
||
--------------------------------------------------------
|
||
|
||
*CMap* = character map; it converts *code points* in a *code page* to
|
||
*glyph IDs*.
|
||
|
||
The CMapTable is a table of CMaps (CMaps are also tables; one for every
|
||
encoding supported by the font). Representing an encoding-dependent
|
||
character map is in one of 14 formats, out of which formats 0 and 4 are
|
||
the most used; sfntly/C++ will initially only support formats 0, 2, 4
|
||
and 12.
|
||
|
||
### `CMapFormat0` Byte encoding table
|
||
|
||
Format 0 is a basic table where a character’s glyph ID is looked up in a
|
||
glyphIdArray256. As it only supports 256 characters it can only encode
|
||
ASCII and ISO 8859-x (alphabet-based languages).
|
||
|
||
### `CMapFormat2` High-byte mapping through table
|
||
|
||
Chinese, Japanese and Korean (CJK) need special 2 byte encodings for
|
||
each code point like Shift-JIS.
|
||
|
||
### `CMapFormat4` Segment mapping to delta values
|
||
|
||
This is the preferred format for Unicode Basic Multilingual Plane (BMP)
|
||
encodings according to the Microsoft spec. Format 4 defines segments
|
||
(contiguous ranges of characters; variable length). Finding a
|
||
character’s glyph id first means finding the segment it is part of using
|
||
a binary search (the segments are sorted). A segment has a
|
||
**`startCode`**, an **`endCode`** (the minimum and maximum code points
|
||
in the segment), an **`idDelta`** (delta for all code points in the
|
||
segment) and an **`idRangeOffset`** (offset into glyphIdArray or 0).
|
||
|
||
`idDelta` and `idRangeOffset` seem to be the same thing, offsets. In
|
||
fact, `idRangeOffset` uses the glyph array to get the index by relying
|
||
on the fact that the array is immediately after the `idRangeOffset`
|
||
table in the font file. So, the segment’s offset is `idRangeOffset[i]`
|
||
but since the `idRangeOffset` table contains words and not bytes, the
|
||
value is divided by 2.
|
||
|
||
``` {.prettyprint}
|
||
glyphIndex = *(&idRangeOffset[i] + idRangeOffset[i] / 2 + (c - startCode[i]))
|
||
```
|
||
|
||
`idDelta[i]` is another kind of segment offset used when
|
||
`idRangeOffset[i] = 0`, in which case it is added directly to the
|
||
character code.
|
||
|
||
``` {.prettyprint}
|
||
glyphIndex = idDelta[i] + c
|
||
```
|
||
|
||
### Class Hierarchy
|
||
|
||
`CMapTable` is the main class and the container for all other CMap
|
||
related classes.
|
||
|
||
#### Utility classes
|
||
|
||
- `CMapTable::CMapId` describes a pair of IDs, platform ID and
|
||
encoding ID that form the CMaps ID. The ID a CMap has is usually a
|
||
good indicator as to what kind of format the CMap uses (Unicode
|
||
CMaps are usually either format 4 or format 12).
|
||
- `CMapTable::CMapIdComparator`
|
||
- `CMapTable::CMapIterator` iteration through the CMapTable is
|
||
supported through a Java-style iterator.
|
||
- `CMapTable::CMapFilter` Java-style filter; CMapIterator supports
|
||
filtering CMaps. By default, it accepts everything CMap.
|
||
- `CMapTable::CMapIdFilter` extends CMapFilter; only accepts one type
|
||
of CMap. Used in conjunction with CMapIterator, this is how the CMap
|
||
getters are implemented.
|
||
- **`CMapTable::Builder`** is the only way to create a CMapTable.
|
||
|
||
#### CMaps
|
||
|
||
- **`CMapTable::CMap`** is the abstract base class that all
|
||
`CMapFormat*` derive. It defines basic functions and the abstract
|
||
`CMapTable::CMap::CharacterIterator` class to iterate through the
|
||
characters in the map. The basic implementation just loops through
|
||
every character between a start and an end. This is overridden so
|
||
that format specific iteration is performed.
|
||
- `CMapFormat0` (mostly done?)
|
||
- `CMapFormat2` (needs builders)
|
||
- ... coming soon
|
||
|
||
`[todo: will add images soon; need to upload to svn]`
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
# Table Building Pipeline
|
||
|
||
Building a data table in sfntly is done by the
|
||
`FontDataTable::Builder::build` method which defines the general
|
||
pipeline and leaves the details to each implementing subclass
|
||
(`CMapTable::Builder` for example). Note: **`sub*`** methods are table
|
||
specific
|
||
|
||
**`ReadableFontDataPtr data = internalReadData()`**
|
||
> There are 2 private fields in the `FontDataTable::Builder` class:
|
||
> `rData` and `wData` for `ReadableFontData` and `WritableFontData`.
|
||
> This function returns `rData` if there is any or `wData` (it is cast
|
||
> to readable font data) if `rData` is null. *They hold the same data!*
|
||
|
||
**`if (model_changed_)`**
|
||
> A font is essentially a binary blob when loaded inside a `FontData`
|
||
> object. A *model* is the Java/C++ collection of objects that represent
|
||
> the same data in a manipulable format. If you ask for the model (even
|
||
> if you dont write to it), it will count as changed and the underlying
|
||
> raw data will get updated.
|
||
|
||
**`if (!subReadyToSerialize())`**
|
||
**`return NULL`**
|
||
`else`
|
||
1. **`size = subDataToSerialize()`**
|
||
2. **`WritableDataPtr new_data = container_->getNewData(size)`**
|
||
3. **`subSerialize(new_data)`**
|
||
4. **`data = new_data`**
|
||
|
||
**`FontDataTablePtr table = subBuildTable(data)`**
|
||
> The table is actually built, where `subBuildTable` is overridden by
|
||
> every class of table but a table header is always added.
|
||
|
||
Subtable Builders
|
||
------------------------------------------------------------------------------
|
||
|
||
Subtables are lazily built
|
||
|
||
When creating the object view of the font and dealing with lots of
|
||
tables, it would be wasteful to create builders for every subtable there
|
||
is since most users only do fairly high level manipulation of the font.
|
||
Instead, **only the tables at font level are fully built**.
|
||
|
||
All other subtables have builders that contain valid FontData but the
|
||
object view is not created by default. For the `CMapTable`, this means
|
||
that if you don’t go through the `getCMapBuilders()` method, the CMap
|
||
builders are not initialized. So, the builder map would seem to be empty
|
||
when calling its `size()` method but there are CMaps in the font when
|
||
calling `numCMaps(internalReadFont())`.
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
Character encoders
|
||
---------------------------------------------------------------------------------
|
||
|
||
Sfntly/Java uses a native ICU-based API for encoding characters.
|
||
Sfntly/C++ uses ICU directly. In unit tests we assume text is encoded in
|
||
UTF16. Public APIs will use ICU classes like `UnicodeString`.
|