139 lines
5.6 KiB
ReStructuredText
139 lines
5.6 KiB
ReStructuredText
.. _module-pw_tokenizer-proto:
|
|
|
|
------------------------------------
|
|
Tokenized fields in protocol buffers
|
|
------------------------------------
|
|
Text may be represented in a few different ways:
|
|
|
|
- Plain ASCII or UTF-8 text (``This is plain text``)
|
|
- Base64-encoded tokenized message (``$ibafcA==``)
|
|
- Binary-encoded tokenized message (``89 b6 9f 70``)
|
|
- Little-endian 32-bit integer token (``0x709fb689``)
|
|
|
|
``pw_tokenizer`` provides tools for working with protobuf fields that may
|
|
contain tokenized text.
|
|
|
|
Tokenized field protobuf option
|
|
===============================
|
|
``pw_tokenizer`` provides the ``pw.tokenizer.format`` protobuf field option.
|
|
This option may be applied to a protobuf field to indicate that it may contain a
|
|
tokenized string. A string that is optionally tokenized is represented with a
|
|
single ``bytes`` field annotated with ``(pw.tokenizer.format) =
|
|
TOKENIZATION_OPTIONAL``.
|
|
|
|
For example, the following protobuf has one field that may contain a tokenized
|
|
string.
|
|
|
|
.. code-block:: protobuf
|
|
|
|
message MessageWithOptionallyTokenizedField {
|
|
bytes just_bytes = 1;
|
|
bytes maybe_tokenized = 2 [(pw.tokenizer.format) = TOKENIZATION_OPTIONAL];
|
|
string just_text = 3;
|
|
}
|
|
|
|
Decoding optionally tokenized strings
|
|
=====================================
|
|
The encoding used for an optionally tokenized field is not recorded in the
|
|
protobuf. Despite this, the text can reliably be decoded. This is accomplished
|
|
by attempting to decode the field as binary or Base64 tokenized data before
|
|
treating it like plain text.
|
|
|
|
The following diagram describes the decoding process for optionally tokenized
|
|
fields in detail.
|
|
|
|
.. mermaid::
|
|
|
|
flowchart TD
|
|
start([Received bytes]) --> binary
|
|
|
|
binary[Decode as<br>binary tokenized] --> binary_ok
|
|
binary_ok{Detokenizes<br>successfully?} -->|no| utf8
|
|
binary_ok -->|yes| done_binary([Display decoded binary])
|
|
|
|
utf8[Decode as UTF-8] --> utf8_ok
|
|
utf8_ok{Valid UTF-8?} -->|no| base64_encode
|
|
utf8_ok -->|yes| base64
|
|
|
|
base64_encode[Encode as<br>tokenized Base64] --> display
|
|
display([Display encoded Base64])
|
|
|
|
base64[Decode as<br>Base64 tokenized] --> base64_ok
|
|
|
|
base64_ok{Fully<br>or partially<br>detokenized?} -->|no| is_plain_text
|
|
base64_ok -->|yes| base64_results
|
|
|
|
is_plain_text{Text is<br>printable?} -->|no| base64_encode
|
|
is_plain_text-->|yes| plain_text
|
|
|
|
base64_results([Display decoded Base64])
|
|
plain_text([Display text])
|
|
|
|
Potential decoding problems
|
|
---------------------------
|
|
The decoding process for optionally tokenized fields will yield correct results
|
|
in almost every situation. In rare circumstances, it is possible for it to fail,
|
|
but these can be avoided with a low-overhead mitigation if desired.
|
|
|
|
There are two ways in which the decoding process may fail.
|
|
|
|
Accidentally interpreting plain text as tokenized binary
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
If a plain-text string happens to decode as a binary tokenized message, the
|
|
incorrect message could be displayed. This is very unlikely to occur. While many
|
|
tokens will incidentally end up being valid UTF-8 strings, it is highly unlikely
|
|
that a device will happen to log one of these strings as plain text. The
|
|
overwhelming majority of these strings will be nonsense.
|
|
|
|
If an implementation wishes to guard against this extremely improbable
|
|
situation, it is possible to prevent it. This situation is prevented by
|
|
appending 0xFF (or another byte never valid in UTF-8) to binary tokenized data
|
|
that happens to be valid UTF-8 (or all binary tokenized messages, if desired).
|
|
When decoding, if there is an extra 0xFF byte, it is discarded.
|
|
|
|
Displaying undecoded binary as plain text instead of Base64
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
If a message fails to decode as binary tokenized and it is not valid UTF-8, it
|
|
is displayed as tokenized Base64. This makes it easily recognizable as a
|
|
tokenized message and makes it simple to decode later from the text output (for
|
|
example, with an updated token database).
|
|
|
|
A binary message for which the token is not known may coincidentally be valid
|
|
UTF-8 or ASCII. 6.25% of 4-byte sequences are composed only of ASCII characters.
|
|
When decoding with an out-of-date token database, it is possible that some
|
|
binary tokenized messages will be displayed as plain text rather than tokenized
|
|
Base64.
|
|
|
|
This situation is likely to occur, but should be infrequent. Even if it does
|
|
happen, it is not a serious issue. A very small number of strings will be
|
|
displayed incorrectly, but these strings cannot be decoded anyway. One nonsense
|
|
string (e.g. ``a-D1``) would be displayed instead of another (``$YS1EMQ==``).
|
|
Updating the token database would resolve the issue, though the non-Base64 logs
|
|
would be difficult decode later from a log file.
|
|
|
|
This situation can be avoided with the same approach described in
|
|
`Accidentally interpreting plain text as tokenized binary`_. Appending
|
|
an invalid UTF-8 character prevents the undecoded binary message from being
|
|
interpreted as plain text.
|
|
|
|
Python library
|
|
==============
|
|
The ``pw_tokenizer.proto`` module defines functions that may be used to
|
|
detokenize protobuf objects in Python. The function
|
|
:func:`pw_tokenizer.proto.detokenize_fields` detokenizes all fields annotated as
|
|
tokenized, replacing them with their detokenized version. For example:
|
|
|
|
.. code-block:: python
|
|
|
|
my_detokenizer = pw_tokenizer.Detokenizer(some_database)
|
|
|
|
my_message = SomeMessage(tokenized_field=b'$YS1EMQ==')
|
|
pw_tokenizer.proto.detokenize_fields(my_detokenizer, my_message)
|
|
|
|
assert my_message.tokenized_field == b'The detokenized string! Cool!'
|
|
|
|
pw_tokenizer.proto
|
|
------------------
|
|
.. automodule:: pw_tokenizer.proto
|
|
:members:
|