diff options
| author | Shipwreckt <me@shipwreckt.co.uk> | 2025-10-31 20:02:14 +0000 |
|---|---|---|
| committer | Shipwreckt <me@shipwreckt.co.uk> | 2025-10-31 20:02:14 +0000 |
| commit | 7a52ddeba2a68388b544f529d2d92104420f77b0 (patch) | |
| tree | 15ddd47457a2cb4a96060747437d36474e4f6b4e /node_modules/moo/README.md | |
| parent | 53d6ae2b5568437afa5e4995580a3fb679b7b91b (diff) | |
Changed from static to 11ty!
Diffstat (limited to 'node_modules/moo/README.md')
| -rw-r--r-- | node_modules/moo/README.md | 383 |
1 files changed, 383 insertions, 0 deletions
diff --git a/node_modules/moo/README.md b/node_modules/moo/README.md new file mode 100644 index 0000000..52b985e --- /dev/null +++ b/node_modules/moo/README.md @@ -0,0 +1,383 @@ + + +Moo! +==== + +Moo is a highly-optimised tokenizer/lexer generator. Use it to tokenize your strings, before parsing 'em with a parser like [nearley](https://github.com/hardmath123/nearley) or whatever else you're into. + +* [Fast](#is-it-fast) +* [Convenient](#usage) +* uses [Regular Expressions](#on-regular-expressions) +* tracks [Line Numbers](#line-numbers) +* handles [Keywords](#keywords) +* supports [States](#states) +* custom [Errors](#errors) +* is even [Iterable](#iteration) +* has no dependencies +* 4KB minified + gzipped +* Moo! + +Is it fast? +----------- + +Yup! Flying-cows-and-singed-steak fast. + +Moo is the fastest JS tokenizer around. It's **~2–10x** faster than most other tokenizers; it's a **couple orders of magnitude** faster than some of the slower ones. + +Define your tokens **using regular expressions**. Moo will compile 'em down to a **single RegExp for performance**. It uses the new ES6 **sticky flag** where possible to make things faster; otherwise it falls back to an almost-as-efficient workaround. (For more than you ever wanted to know about this, read [adventures in the land of substrings and RegExps](http://mrale.ph/blog/2016/11/23/making-less-dart-faster.html).) + +You _might_ be able to go faster still by writing your lexer by hand rather than using RegExps, but that's icky. + +Oh, and it [avoids parsing RegExps by itself](https://hackernoon.com/the-madness-of-parsing-real-world-javascript-regexps-d9ee336df983#.2l8qu3l76). Because that would be horrible. + + +Usage +----- + +First, you need to do the needful: `$ npm install moo`, or whatever will ship this code to your computer. Alternatively, grab the `moo.js` file by itself and slap it into your web page via a `<script>` tag; moo is completely standalone. + +Then you can start roasting your very own lexer/tokenizer: + +```js + const moo = require('moo') + + let lexer = moo.compile({ + WS: /[ \t]+/, + comment: /\/\/.*?$/, + number: /0|[1-9][0-9]*/, + string: /"(?:\\["\\]|[^\n"\\])*"/, + lparen: '(', + rparen: ')', + keyword: ['while', 'if', 'else', 'moo', 'cows'], + NL: { match: /\n/, lineBreaks: true }, + }) +``` + +And now throw some text at it: + +```js + lexer.reset('while (10) cows\nmoo') + lexer.next() // -> { type: 'keyword', value: 'while' } + lexer.next() // -> { type: 'WS', value: ' ' } + lexer.next() // -> { type: 'lparen', value: '(' } + lexer.next() // -> { type: 'number', value: '10' } + // ... +``` + +When you reach the end of Moo's internal buffer, next() will return `undefined`. You can always `reset()` it and feed it more data when that happens. + + +On Regular Expressions +---------------------- + +RegExps are nifty for making tokenizers, but they can be a bit of a pain. Here are some things to be aware of: + +* You often want to use **non-greedy quantifiers**: e.g. `*?` instead of `*`. Otherwise your tokens will be longer than you expect: + + ```js + let lexer = moo.compile({ + string: /".*"/, // greedy quantifier * + // ... + }) + + lexer.reset('"foo" "bar"') + lexer.next() // -> { type: 'string', value: 'foo" "bar' } + ``` + + Better: + + ```js + let lexer = moo.compile({ + string: /".*?"/, // non-greedy quantifier *? + // ... + }) + + lexer.reset('"foo" "bar"') + lexer.next() // -> { type: 'string', value: 'foo' } + lexer.next() // -> { type: 'space', value: ' ' } + lexer.next() // -> { type: 'string', value: 'bar' } + ``` + +* The **order of your rules** matters. Earlier ones will take precedence. + + ```js + moo.compile({ + identifier: /[a-z0-9]+/, + number: /[0-9]+/, + }).reset('42').next() // -> { type: 'identifier', value: '42' } + + moo.compile({ + number: /[0-9]+/, + identifier: /[a-z0-9]+/, + }).reset('42').next() // -> { type: 'number', value: '42' } + ``` + +* Moo uses **multiline RegExps**. This has a few quirks: for example, the **dot `/./` doesn't include newlines**. Use `[^]` instead if you want to match newlines too. + +* Since an excluding character ranges like `/[^ ]/` (which matches anything but a space) _will_ include newlines, you have to be careful not to include them by accident! In particular, the whitespace metacharacter `\s` includes newlines. + + +Line Numbers +------------ + +Moo tracks detailed information about the input for you. + +It will track line numbers, as long as you **apply the `lineBreaks: true` option to any rules which might contain newlines**. Moo will try to warn you if you forget to do this. + +Note that this is `false` by default, for performance reasons: counting the number of lines in a matched token has a small cost. For optimal performance, only match newlines inside a dedicated token: + +```js + newline: {match: '\n', lineBreaks: true}, +``` + + +### Token Info ### + +Token objects (returned from `next()`) have the following attributes: + +* **`type`**: the name of the group, as passed to compile. +* **`text`**: the string that was matched. +* **`value`**: the string that was matched, transformed by your `value` function (if any). +* **`offset`**: the number of bytes from the start of the buffer where the match starts. +* **`lineBreaks`**: the number of line breaks found in the match. (Always zero if this rule has `lineBreaks: false`.) +* **`line`**: the line number of the beginning of the match, starting from 1. +* **`col`**: the column where the match begins, starting from 1. + + +### Value vs. Text ### + +The `value` is the same as the `text`, unless you provide a [value transform](#transform). + +```js +const moo = require('moo') + +const lexer = moo.compile({ + ws: /[ \t]+/, + string: {match: /"(?:\\["\\]|[^\n"\\])*"/, value: s => s.slice(1, -1)}, +}) + +lexer.reset('"test"') +lexer.next() /* { value: 'test', text: '"test"', ... } */ +``` + + +### Reset ### + +Calling `reset()` on your lexer will empty its internal buffer, and set the line, column, and offset counts back to their initial value. + +If you don't want this, you can `save()` the state, and later pass it as the second argument to `reset()` to explicitly control the internal state of the lexer. + +```js + lexer.reset('some line\n') + let info = lexer.save() // -> { line: 10 } + lexer.next() // -> { line: 10 } + lexer.next() // -> { line: 11 } + // ... + lexer.reset('a different line\n', info) + lexer.next() // -> { line: 10 } +``` + + +Keywords +-------- + +Moo makes it convenient to define literals. + +```js + moo.compile({ + lparen: '(', + rparen: ')', + keyword: ['while', 'if', 'else', 'moo', 'cows'], + }) +``` + +It'll automatically compile them into regular expressions, escaping them where necessary. + +**Keywords** should be written using the `keywords` transform. + +```js + moo.compile({ + IDEN: {match: /[a-zA-Z]+/, type: moo.keywords({ + KW: ['while', 'if', 'else', 'moo', 'cows'], + })}, + SPACE: {match: /\s+/, lineBreaks: true}, + }) +``` + + +### Why? ### + +You need to do this to ensure the **longest match** principle applies, even in edge cases. + +Imagine trying to parse the input `className` with the following rules: + +```js + keyword: ['class'], + identifier: /[a-zA-Z]+/, +``` + +You'll get _two_ tokens — `['class', 'Name']` -- which is _not_ what you want! If you swap the order of the rules, you'll fix this example; but now you'll lex `class` wrong (as an `identifier`). + +The keywords helper checks matches against the list of keywords; if any of them match, it uses the type `'keyword'` instead of `'identifier'` (for this example). + + +### Keyword Types ### + +Keywords can also have **individual types**. + +```js + let lexer = moo.compile({ + name: {match: /[a-zA-Z]+/, type: moo.keywords({ + 'kw-class': 'class', + 'kw-def': 'def', + 'kw-if': 'if', + })}, + // ... + }) + lexer.reset('def foo') + lexer.next() // -> { type: 'kw-def', value: 'def' } + lexer.next() // space + lexer.next() // -> { type: 'name', value: 'foo' } +``` + +You can use `Object.fromEntries` to easily construct keyword objects: + +```js +Object.fromEntries(['class', 'def', 'if'].map(k => ['kw-' + k, k])) +``` + + +States +------ + +Moo allows you to define multiple lexer **states**. Each state defines its own separate set of token rules. Your lexer will start off in the first state given to `moo.states({})`. + +Rules can be annotated with `next`, `push`, and `pop`, to change the current state after that token is matched. A "stack" of past states is kept, which is used by `push` and `pop`. + +* **`next: 'bar'`** moves to the state named `bar`. (The stack is not changed.) +* **`push: 'bar'`** moves to the state named `bar`, and pushes the old state onto the stack. +* **`pop: 1`** removes one state from the top of the stack, and moves to that state. (Only `1` is supported.) + +Only rules from the current state can be matched. You need to copy your rule into all the states you want it to be matched in. + +For example, to tokenize JS-style string interpolation such as `a${{c: d}}e`, you might use: + +```js + let lexer = moo.states({ + main: { + strstart: {match: '`', push: 'lit'}, + ident: /\w+/, + lbrace: {match: '{', push: 'main'}, + rbrace: {match: '}', pop: 1}, + colon: ':', + space: {match: /\s+/, lineBreaks: true}, + }, + lit: { + interp: {match: '${', push: 'main'}, + escape: /\\./, + strend: {match: '`', pop: 1}, + const: {match: /(?:[^$`]|\$(?!\{))+/, lineBreaks: true}, + }, + }) + // <= `a${{c: d}}e` + // => strstart const interp lbrace ident colon space ident rbrace rbrace const strend +``` + +The `rbrace` rule is annotated with `pop`, so it moves from the `main` state into either `lit` or `main`, depending on the stack. + + +Errors +------ + +If none of your rules match, Moo will throw an Error; since it doesn't know what else to do. + +If you prefer, you can have moo return an error token instead of throwing an exception. The error token will contain the whole of the rest of the buffer. + +```js + moo.compile({ + // ... + myError: moo.error, + }) + + moo.reset('invalid') + moo.next() // -> { type: 'myError', value: 'invalid', text: 'invalid', offset: 0, lineBreaks: 0, line: 1, col: 1 } + moo.next() // -> undefined +``` + +You can have a token type that both matches tokens _and_ contains error values. + +```js + moo.compile({ + // ... + myError: {match: /[\$?`]/, error: true}, + }) +``` + +### Formatting errors ### + +If you want to throw an error from your parser, you might find `formatError` helpful. Call it with the offending token: + +```js +throw new Error(lexer.formatError(token, "invalid syntax")) +``` + +It returns a string with a pretty error message. + +``` +Error: invalid syntax at line 2 col 15: + + totally valid `syntax` + ^ +``` + + +Iteration +--------- + +Iterators: we got 'em. + +```js + for (let here of lexer) { + // here = { type: 'number', value: '123', ... } + } +``` + +Create an array of tokens. + +```js + let tokens = Array.from(lexer); +``` + +Use [itt](https://www.npmjs.com/package/itt)'s iteration tools with Moo. + +```js + for (let [here, next] of itt(lexer).lookahead()) { // pass a number if you need more tokens + // enjoy! + } +``` + + +Transform +--------- + +Moo doesn't allow capturing groups, but you can supply a transform function, `value()`, which will be called on the value before storing it in the Token object. + +```js + moo.compile({ + STRING: [ + {match: /"""[^]*?"""/, lineBreaks: true, value: x => x.slice(3, -3)}, + {match: /"(?:\\["\\rn]|[^"\\])*?"/, lineBreaks: true, value: x => x.slice(1, -1)}, + {match: /'(?:\\['\\rn]|[^'\\])*?'/, lineBreaks: true, value: x => x.slice(1, -1)}, + ], + // ... + }) +``` + + +Contributing +------------ + +Do check the [FAQ](https://github.com/tjvr/moo/issues?q=label%3Aquestion). + +Before submitting an issue, [remember...](https://github.com/tjvr/moo/blob/master/.github/CONTRIBUTING.md) + |
