396 lines
9.1 KiB
Markdown
396 lines
9.1 KiB
Markdown
# Nom Recipes
|
|
|
|
These are short recipes for accomplishing common tasks with nom.
|
|
|
|
* [Whitespace](#whitespace)
|
|
+ [Wrapper combinators that eat whitespace before and after a parser](#wrapper-combinators-that-eat-whitespace-before-and-after-a-parser)
|
|
* [Comments](#comments)
|
|
+ [`// C++/EOL-style comments`](#-ceol-style-comments)
|
|
+ [`/* C-style comments */`](#-c-style-comments-)
|
|
* [Identifiers](#identifiers)
|
|
+ [`Rust-Style Identifiers`](#rust-style-identifiers)
|
|
* [Literal Values](#literal-values)
|
|
+ [Escaped Strings](#escaped-strings)
|
|
+ [Integers](#integers)
|
|
- [Hexadecimal](#hexadecimal)
|
|
- [Octal](#octal)
|
|
- [Binary](#binary)
|
|
- [Decimal](#decimal)
|
|
+ [Floating Point Numbers](#floating-point-numbers)
|
|
|
|
## Whitespace
|
|
|
|
|
|
|
|
### Wrapper combinators that eat whitespace before and after a parser
|
|
|
|
```rust
|
|
use nom::{
|
|
IResult,
|
|
error::ParseError,
|
|
combinator::value,
|
|
sequence::delimited,
|
|
character::complete::multispace0,
|
|
};
|
|
|
|
/// A combinator that takes a parser `inner` and produces a parser that also consumes both leading and
|
|
/// trailing whitespace, returning the output of `inner`.
|
|
fn ws<'a, F: 'a, O, E: ParseError<&'a str>>(inner: F) -> impl FnMut(&'a str) -> IResult<&'a str, O, E>
|
|
where
|
|
F: Fn(&'a str) -> IResult<&'a str, O, E>,
|
|
{
|
|
delimited(
|
|
multispace0,
|
|
inner,
|
|
multispace0
|
|
)
|
|
}
|
|
```
|
|
|
|
To eat only trailing whitespace, replace `delimited(...)` with `terminated(&inner, multispace0)`.
|
|
Likewise, the eat only leading whitespace, replace `delimited(...)` with `preceded(multispace0,
|
|
&inner)`. You can use your own parser instead of `multispace0` if you want to skip a different set
|
|
of lexemes.
|
|
|
|
## Comments
|
|
|
|
### `// C++/EOL-style comments`
|
|
|
|
This version uses `%` to start a comment, does not consume the newline character, and returns an
|
|
output of `()`.
|
|
|
|
```rust
|
|
use nom::{
|
|
IResult,
|
|
error::ParseError,
|
|
combinator::value,
|
|
sequence::pair,
|
|
bytes::complete::is_not,
|
|
character::complete::char,
|
|
};
|
|
|
|
pub fn peol_comment<'a, E: ParseError<&'a str>>(i: &'a str) -> IResult<&'a str, (), E>
|
|
{
|
|
value(
|
|
(), // Output is thrown away.
|
|
pair(char('%'), is_not("\n\r"))
|
|
)(i)
|
|
}
|
|
```
|
|
|
|
### `/* C-style comments */`
|
|
|
|
Inline comments surrounded with sentinel tags `(*` and `*)`. This version returns an output of `()`
|
|
and does not handle nested comments.
|
|
|
|
```rust
|
|
use nom::{
|
|
IResult,
|
|
error::ParseError,
|
|
combinator::value,
|
|
sequence::tuple,
|
|
bytes::complete::{tag, take_until},
|
|
};
|
|
|
|
pub fn pinline_comment<'a, E: ParseError<&'a str>>(i: &'a str) -> IResult<&'a str, (), E> {
|
|
value(
|
|
(), // Output is thrown away.
|
|
tuple((
|
|
tag("(*"),
|
|
take_until("*)"),
|
|
tag("*)")
|
|
))
|
|
)(i)
|
|
}
|
|
```
|
|
|
|
## Identifiers
|
|
|
|
### `Rust-Style Identifiers`
|
|
|
|
Parsing identifiers that may start with a letter (or underscore) and may contain underscores,
|
|
letters and numbers may be parsed like this:
|
|
|
|
```rust
|
|
use nom::{
|
|
IResult,
|
|
branch::alt,
|
|
multi::many0,
|
|
combinator::recognize,
|
|
sequence::pair,
|
|
character::complete::{alpha1, alphanumeric1},
|
|
bytes::complete::tag,
|
|
};
|
|
|
|
pub fn identifier(input: &str) -> IResult<&str, &str> {
|
|
recognize(
|
|
pair(
|
|
alt((alpha1, tag("_"))),
|
|
many0(alt((alphanumeric1, tag("_"))))
|
|
)
|
|
)(input)
|
|
}
|
|
```
|
|
|
|
Let's say we apply this to the identifier `hello_world123abc`. The first `alt` parser would
|
|
recognize `h`. The `pair` combinator ensures that `ello_world123abc` will be piped to the next
|
|
`alphanumeric0` parser, which recognizes every remaining character. However, the `pair` combinator
|
|
returns a tuple of the results of its sub-parsers. The `recognize` parser produces a `&str` of the
|
|
input text that was parsed, which in this case is the entire `&str` `hello_world123abc`.
|
|
|
|
## Literal Values
|
|
|
|
### Escaped Strings
|
|
|
|
This is [one of the examples](https://github.com/Geal/nom/blob/master/examples/string.rs) in the
|
|
examples directory.
|
|
|
|
### Integers
|
|
|
|
The following recipes all return string slices rather than integer values. How to obtain an
|
|
integer value instead is demonstrated for hexadecimal integers. The others are similar.
|
|
|
|
The parsers allow the grouping character `_`, which allows one to group the digits by byte, for
|
|
example: `0xA4_3F_11_28`. If you prefer to exclude the `_` character, the lambda to convert from a
|
|
string slice to an integer value is slightly simpler. You can also strip the `_` from the string
|
|
slice that is returned, which is demonstrated in the second hexdecimal number parser.
|
|
|
|
If you wish to limit the number of digits in a valid integer literal, replace `many1` with
|
|
`many_m_n` in the recipes.
|
|
|
|
#### Hexadecimal
|
|
|
|
The parser outputs the string slice of the digits without the leading `0x`/`0X`.
|
|
|
|
```rust
|
|
use nom::{
|
|
IResult,
|
|
branch::alt,
|
|
multi::{many0, many1},
|
|
combinator::recognize,
|
|
sequence::{preceded, terminated},
|
|
character::complete::{char, one_of},
|
|
bytes::complete::tag,
|
|
};
|
|
|
|
fn hexadecimal(input: &str) -> IResult<&str, &str> { // <'a, E: ParseError<&'a str>>
|
|
preceded(
|
|
alt((tag("0x"), tag("0X"))),
|
|
recognize(
|
|
many1(
|
|
terminated(one_of("0123456789abcdefABCDEF"), many0(char('_')))
|
|
)
|
|
)
|
|
)(input)
|
|
}
|
|
```
|
|
|
|
If you want it to return the integer value instead, use map:
|
|
|
|
```rust
|
|
use nom::{
|
|
IResult,
|
|
branch::alt,
|
|
multi::{many0, many1},
|
|
combinator::{map_res, recognize},
|
|
sequence::{preceded, terminated},
|
|
character::complete::{char, one_of},
|
|
bytes::complete::tag,
|
|
};
|
|
|
|
fn hexadecimal_value(input: &str) -> IResult<&str, i64> {
|
|
map_res(
|
|
preceded(
|
|
alt((tag("0x"), tag("0X"))),
|
|
recognize(
|
|
many1(
|
|
terminated(one_of("0123456789abcdefABCDEF"), many0(char('_')))
|
|
)
|
|
)
|
|
),
|
|
|out: &str| i64::from_str_radix(&str::replace(&out, "_", ""), 16)
|
|
)(input)
|
|
}
|
|
```
|
|
|
|
#### Octal
|
|
|
|
```rust
|
|
use nom::{
|
|
IResult,
|
|
branch::alt,
|
|
multi::{many0, many1},
|
|
combinator::recognize,
|
|
sequence::{preceded, terminated},
|
|
character::complete::{char, one_of},
|
|
bytes::complete::tag,
|
|
};
|
|
|
|
fn octal(input: &str) -> IResult<&str, &str> {
|
|
preceded(
|
|
alt((tag("0o"), tag("0O"))),
|
|
recognize(
|
|
many1(
|
|
terminated(one_of("01234567"), many0(char('_')))
|
|
)
|
|
)
|
|
)(input)
|
|
}
|
|
```
|
|
|
|
#### Binary
|
|
|
|
```rust
|
|
use nom::{
|
|
IResult,
|
|
branch::alt,
|
|
multi::{many0, many1},
|
|
combinator::recognize,
|
|
sequence::{preceded, terminated},
|
|
character::complete::{char, one_of},
|
|
bytes::complete::tag,
|
|
};
|
|
|
|
fn binary(input: &str) -> IResult<&str, &str> {
|
|
preceded(
|
|
alt((tag("0b"), tag("0B"))),
|
|
recognize(
|
|
many1(
|
|
terminated(one_of("01"), many0(char('_')))
|
|
)
|
|
)
|
|
)(input)
|
|
}
|
|
```
|
|
|
|
#### Decimal
|
|
|
|
```rust
|
|
use nom::{
|
|
IResult,
|
|
multi::{many0, many1},
|
|
combinator::recognize,
|
|
sequence::terminated,
|
|
character::complete::{char, one_of},
|
|
};
|
|
|
|
fn decimal(input: &str) -> IResult<&str, &str> {
|
|
recognize(
|
|
many1(
|
|
terminated(one_of("0123456789"), many0(char('_')))
|
|
)
|
|
)(input)
|
|
}
|
|
```
|
|
|
|
### Floating Point Numbers
|
|
|
|
The following is adapted from [the Python parser by Valentin Lorentz (ProgVal)](https://github.com/ProgVal/rust-python-parser/blob/master/src/numbers.rs).
|
|
|
|
```rust
|
|
use nom::{
|
|
IResult,
|
|
branch::alt,
|
|
multi::{many0, many1},
|
|
combinator::{opt, recognize},
|
|
sequence::{preceded, terminated, tuple},
|
|
character::complete::{char, one_of},
|
|
};
|
|
|
|
fn float(input: &str) -> IResult<&str, &str> {
|
|
alt((
|
|
// Case one: .42
|
|
recognize(
|
|
tuple((
|
|
char('.'),
|
|
decimal,
|
|
opt(tuple((
|
|
one_of("eE"),
|
|
opt(one_of("+-")),
|
|
decimal
|
|
)))
|
|
))
|
|
)
|
|
, // Case two: 42e42 and 42.42e42
|
|
recognize(
|
|
tuple((
|
|
decimal,
|
|
opt(preceded(
|
|
char('.'),
|
|
decimal,
|
|
)),
|
|
one_of("eE"),
|
|
opt(one_of("+-")),
|
|
decimal
|
|
))
|
|
)
|
|
, // Case three: 42. and 42.42
|
|
recognize(
|
|
tuple((
|
|
decimal,
|
|
char('.'),
|
|
opt(decimal)
|
|
))
|
|
)
|
|
))(input)
|
|
}
|
|
|
|
fn decimal(input: &str) -> IResult<&str, &str> {
|
|
recognize(
|
|
many1(
|
|
terminated(one_of("0123456789"), many0(char('_')))
|
|
)
|
|
)(input)
|
|
}
|
|
```
|
|
|
|
# implementing FromStr
|
|
|
|
The [FromStr trait](https://doc.rust-lang.org/std/str/trait.FromStr.html) provides
|
|
a common interface to parse from a string.
|
|
|
|
```rust
|
|
use nom::{
|
|
IResult, Finish, error::Error,
|
|
bytes::complete::{tag, take_while},
|
|
};
|
|
use std::str::FromStr;
|
|
|
|
// will recognize the name in "Hello, name!"
|
|
fn parse_name(input: &str) -> IResult<&str, &str> {
|
|
let (i, _) = tag("Hello, ")(input)?;
|
|
let (i, name) = take_while(|c:char| c.is_alphabetic())(i)?;
|
|
let (i, _) = tag("!")(i)?;
|
|
|
|
Ok((i, name))
|
|
}
|
|
|
|
// with FromStr, the result cannot be a reference to the input, it must be owned
|
|
#[derive(Debug)]
|
|
pub struct Name(pub String);
|
|
|
|
impl FromStr for Name {
|
|
// the error must be owned as well
|
|
type Err = Error<String>;
|
|
|
|
fn from_str(s: &str) -> Result<Self, Self::Err> {
|
|
match parse_name(s).finish() {
|
|
Ok((_remaining, name)) => Ok(Name(name.to_string())),
|
|
Err(Error { input, code }) => Err(Error {
|
|
input: input.to_string(),
|
|
code,
|
|
})
|
|
}
|
|
}
|
|
}
|
|
|
|
fn main() {
|
|
// parsed: Ok(Name("nom"))
|
|
println!("parsed: {:?}", "Hello, nom!".parse::<Name>());
|
|
|
|
// parsed: Err(Error { input: "123!", code: Tag })
|
|
println!("parsed: {:?}", "Hello, 123!".parse::<Name>());
|
|
}
|
|
```
|
|
|