|
|
- Overview [![Build Status](https://travis-ci.org/lydell/js-tokens.svg?branch=master)](https://travis-ci.org/lydell/js-tokens)
- ========
-
- A regex that tokenizes JavaScript.
-
- ```js
- var jsTokens = require("js-tokens").default
-
- var jsString = "var foo=opts.foo;\n..."
-
- jsString.match(jsTokens)
- // ["var", " ", "foo", "=", "opts", ".", "foo", ";", "\n", ...]
- ```
-
-
- Installation
- ============
-
- `npm install js-tokens`
-
- ```js
- import jsTokens from "js-tokens"
- // or:
- var jsTokens = require("js-tokens").default
- ```
-
-
- Usage
- =====
-
- ### `jsTokens` ###
-
- A regex with the `g` flag that matches JavaScript tokens.
-
- The regex _always_ matches, even invalid JavaScript and the empty string.
-
- The next match is always directly after the previous.
-
- ### `var token = matchToToken(match)` ###
-
- ```js
- import {matchToToken} from "js-tokens"
- // or:
- var matchToToken = require("js-tokens").matchToToken
- ```
-
- Takes a `match` returned by `jsTokens.exec(string)`, and returns a `{type:
- String, value: String}` object. The following types are available:
-
- - string
- - comment
- - regex
- - number
- - name
- - punctuator
- - whitespace
- - invalid
-
- Multi-line comments and strings also have a `closed` property indicating if the
- token was closed or not (see below).
-
- Comments and strings both come in several flavors. To distinguish them, check if
- the token starts with `//`, `/*`, `'`, `"` or `` ` ``.
-
- Names are ECMAScript IdentifierNames, that is, including both identifiers and
- keywords. You may use [is-keyword-js] to tell them apart.
-
- Whitespace includes both line terminators and other whitespace.
-
- [is-keyword-js]: https://github.com/crissdev/is-keyword-js
-
-
- ECMAScript support
- ==================
-
- The intention is to always support the latest ECMAScript version whose feature
- set has been finalized.
-
- If adding support for a newer version requires changes, a new version with a
- major verion bump will be released.
-
- Currently, ECMAScript 2018 is supported.
-
-
- Invalid code handling
- =====================
-
- Unterminated strings are still matched as strings. JavaScript strings cannot
- contain (unescaped) newlines, so unterminated strings simply end at the end of
- the line. Unterminated template strings can contain unescaped newlines, though,
- so they go on to the end of input.
-
- Unterminated multi-line comments are also still matched as comments. They
- simply go on to the end of the input.
-
- Unterminated regex literals are likely matched as division and whatever is
- inside the regex.
-
- Invalid ASCII characters have their own capturing group.
-
- Invalid non-ASCII characters are treated as names, to simplify the matching of
- names (except unicode spaces which are treated as whitespace). Note: See also
- the [ES2018](#es2018) section.
-
- Regex literals may contain invalid regex syntax. They are still matched as
- regex literals. They may also contain repeated regex flags, to keep the regex
- simple.
-
- Strings may contain invalid escape sequences.
-
-
- Limitations
- ===========
-
- Tokenizing JavaScript using regexes—in fact, _one single regex_—won’t be
- perfect. But that’s not the point either.
-
- You may compare jsTokens with [esprima] by using `esprima-compare.js`.
- See `npm run esprima-compare`!
-
- [esprima]: http://esprima.org/
-
- ### Template string interpolation ###
-
- Template strings are matched as single tokens, from the starting `` ` `` to the
- ending `` ` ``, including interpolations (whose tokens are not matched
- individually).
-
- Matching template string interpolations requires recursive balancing of `{` and
- `}`—something that JavaScript regexes cannot do. Only one level of nesting is
- supported.
-
- ### Division and regex literals collision ###
-
- Consider this example:
-
- ```js
- var g = 9.82
- var number = bar / 2/g
-
- var regex = / 2/g
- ```
-
- A human can easily understand that in the `number` line we’re dealing with
- division, and in the `regex` line we’re dealing with a regex literal. How come?
- Because humans can look at the whole code to put the `/` characters in context.
- A JavaScript regex cannot. It only sees forwards. (Well, ES2018 regexes can also
- look backwards. See the [ES2018](#es2018) section).
-
- When the `jsTokens` regex scans throught the above, it will see the following
- at the end of both the `number` and `regex` rows:
-
- ```js
- / 2/g
- ```
-
- It is then impossible to know if that is a regex literal, or part of an
- expression dealing with division.
-
- Here is a similar case:
-
- ```js
- foo /= 2/g
- foo(/= 2/g)
- ```
-
- The first line divides the `foo` variable with `2/g`. The second line calls the
- `foo` function with the regex literal `/= 2/g`. Again, since `jsTokens` only
- sees forwards, it cannot tell the two cases apart.
-
- There are some cases where we _can_ tell division and regex literals apart,
- though.
-
- First off, we have the simple cases where there’s only one slash in the line:
-
- ```js
- var foo = 2/g
- foo /= 2
- ```
-
- Regex literals cannot contain newlines, so the above cases are correctly
- identified as division. Things are only problematic when there are more than
- one non-comment slash in a single line.
-
- Secondly, not every character is a valid regex flag.
-
- ```js
- var number = bar / 2/e
- ```
-
- The above example is also correctly identified as division, because `e` is not a
- valid regex flag. I initially wanted to future-proof by allowing `[a-zA-Z]*`
- (any letter) as flags, but it is not worth it since it increases the amount of
- ambigous cases. So only the standard `g`, `m`, `i`, `y` and `u` flags are
- allowed. This means that the above example will be identified as division as
- long as you don’t rename the `e` variable to some permutation of `gmiyus` 1 to 6
- characters long.
-
- Lastly, we can look _forward_ for information.
-
- - If the token following what looks like a regex literal is not valid after a
- regex literal, but is valid in a division expression, then the regex literal
- is treated as division instead. For example, a flagless regex cannot be
- followed by a string, number or name, but all of those three can be the
- denominator of a division.
- - Generally, if what looks like a regex literal is followed by an operator, the
- regex literal is treated as division instead. This is because regexes are
- seldomly used with operators (such as `+`, `*`, `&&` and `==`), but division
- could likely be part of such an expression.
-
- Please consult the regex source and the test cases for precise information on
- when regex or division is matched (should you need to know). In short, you
- could sum it up as:
-
- If the end of a statement looks like a regex literal (even if it isn’t), it
- will be treated as one. Otherwise it should work as expected (if you write sane
- code).
-
- ### ES2018 ###
-
- ES2018 added some nice regex improvements to the language.
-
- - [Unicode property escapes] should allow telling names and invalid non-ASCII
- characters apart without blowing up the regex size.
- - [Lookbehind assertions] should allow matching telling division and regex
- literals apart in more cases.
- - [Named capture groups] might simplify some things.
-
- These things would be nice to do, but are not critical. They probably have to
- wait until the oldest maintained Node.js LTS release supports those features.
-
- [Unicode property escapes]: http://2ality.com/2017/07/regexp-unicode-property-escapes.html
- [Lookbehind assertions]: http://2ality.com/2017/05/regexp-lookbehind-assertions.html
- [Named capture groups]: http://2ality.com/2017/05/regexp-named-capture-groups.html
-
-
- License
- =======
-
- [MIT](LICENSE).
|