100 lines
4.0 KiB
Markdown
100 lines
4.0 KiB
Markdown
regex-syntax
|
|
============
|
|
This crate provides a robust regular expression parser.
|
|
|
|
[![Build status](https://travis-ci.com/rust-lang/regex.svg?branch=master)](https://travis-ci.com/rust-lang/regex)
|
|
[![Build status](https://ci.appveyor.com/api/projects/status/github/rust-lang/regex?svg=true)](https://ci.appveyor.com/project/rust-lang-libs/regex)
|
|
[![](https://meritbadge.herokuapp.com/regex-syntax)](https://crates.io/crates/regex-syntax)
|
|
[![Rust](https://img.shields.io/badge/rust-1.28.0%2B-blue.svg?maxAge=3600)](https://github.com/rust-lang/regex)
|
|
|
|
|
|
### Documentation
|
|
|
|
https://docs.rs/regex-syntax
|
|
|
|
|
|
### Overview
|
|
|
|
There are two primary types exported by this crate: `Ast` and `Hir`. The former
|
|
is a faithful abstract syntax of a regular expression, and can convert regular
|
|
expressions back to their concrete syntax while mostly preserving its original
|
|
form. The latter type is a high level intermediate representation of a regular
|
|
expression that is amenable to analysis and compilation into byte codes or
|
|
automata. An `Hir` achieves this by drastically simplifying the syntactic
|
|
structure of the regular expression. While an `Hir` can be converted back to
|
|
its equivalent concrete syntax, the result is unlikely to resemble the original
|
|
concrete syntax that produced the `Hir`.
|
|
|
|
|
|
### Example
|
|
|
|
This example shows how to parse a pattern string into its HIR:
|
|
|
|
```rust
|
|
use regex_syntax::Parser;
|
|
use regex_syntax::hir::{self, Hir};
|
|
|
|
let hir = Parser::new().parse("a|b").unwrap();
|
|
assert_eq!(hir, Hir::alternation(vec![
|
|
Hir::literal(hir::Literal::Unicode('a')),
|
|
Hir::literal(hir::Literal::Unicode('b')),
|
|
]));
|
|
```
|
|
|
|
|
|
### Safety
|
|
|
|
This crate has no `unsafe` code and sets `forbid(unsafe_code)`. While it's
|
|
possible this crate could use `unsafe` code in the future, the standard
|
|
for doing so is extremely high. In general, most code in this crate is not
|
|
performance critical, since it tends to be dwarfed by the time it takes to
|
|
compile a regular expression into an automaton. Therefore, there is little need
|
|
for extreme optimization, and therefore, use of `unsafe`.
|
|
|
|
The standard for using `unsafe` in this crate is extremely high because this
|
|
crate is intended to be reasonably safe to use with user supplied regular
|
|
expressions. Therefore, while their may be bugs in the regex parser itself,
|
|
they should _never_ result in memory unsafety unless there is either a bug
|
|
in the compiler or the standard library. (Since `regex-syntax` has zero
|
|
dependencies.)
|
|
|
|
|
|
### Crate features
|
|
|
|
By default, this crate bundles a fairly large amount of Unicode data tables
|
|
(a source size of ~750KB). Because of their large size, one can disable some
|
|
or all of these data tables. If a regular expression attempts to use Unicode
|
|
data that is not available, then an error will occur when translating the `Ast`
|
|
to the `Hir`.
|
|
|
|
The full set of features one can disable are
|
|
[in the "Crate features" section of the documentation](https://docs.rs/regex-syntax/*/#crate-features).
|
|
|
|
|
|
### Testing
|
|
|
|
Simply running `cargo test` will give you very good coverage. However, because
|
|
of the large number of features exposed by this crate, a `test` script is
|
|
included in this directory which will test several feature combinations. This
|
|
is the same script that is run in CI.
|
|
|
|
|
|
### Motivation
|
|
|
|
The primary purpose of this crate is to provide the parser used by `regex`.
|
|
Specifically, this crate is treated as an implementation detail of the `regex`,
|
|
and is primarily developed for the needs of `regex`.
|
|
|
|
Since this crate is an implementation detail of `regex`, it may experience
|
|
breaking change releases at a different cadence from `regex`. This is only
|
|
possible because this crate is _not_ a public dependency of `regex`.
|
|
|
|
Another consequence of this de-coupling is that there is no direct way to
|
|
compile a `regex::Regex` from a `regex_syntax::hir::Hir`. Instead, one must
|
|
first convert the `Hir` to a string (via its `std::fmt::Display`) and then
|
|
compile that via `Regex::new`. While this does repeat some work, compilation
|
|
typically takes much longer than parsing.
|
|
|
|
Stated differently, the coupling between `regex` and `regex-syntax` exists only
|
|
at the level of the concrete syntax.
|