You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

344 lines
12 KiB

4 years ago
  1. # Regenerate [![Build status](https://travis-ci.org/mathiasbynens/regenerate.svg?branch=master)](https://travis-ci.org/mathiasbynens/regenerate) [![Code coverage status](https://img.shields.io/codecov/c/github/mathiasbynens/regenerate.svg)](https://codecov.io/gh/mathiasbynens/regenerate) [![Dependency status](https://gemnasium.com/mathiasbynens/regenerate.svg)](https://gemnasium.com/mathiasbynens/regenerate)
  2. _Regenerate_ is a Unicode-aware regex generator for JavaScript. It allows you to easily generate ES5-compatible regular expressions based on a given set of Unicode symbols or code points. (This is trickier than you might think, because of [how JavaScript deals with astral symbols](https://mathiasbynens.be/notes/javascript-unicode).)
  3. ## Installation
  4. Via [npm](https://npmjs.org/):
  5. ```bash
  6. npm install regenerate
  7. ```
  8. Via [Bower](http://bower.io/):
  9. ```bash
  10. bower install regenerate
  11. ```
  12. Via [Component](https://github.com/component/component):
  13. ```bash
  14. component install mathiasbynens/regenerate
  15. ```
  16. In a browser:
  17. ```html
  18. <script src="regenerate.js"></script>
  19. ```
  20. In [Node.js](https://nodejs.org/), [io.js](https://iojs.org/), and [RingoJS ≥ v0.8.0](http://ringojs.org/):
  21. ```js
  22. var regenerate = require('regenerate');
  23. ```
  24. In [Narwhal](http://narwhaljs.org/) and [RingoJS ≤ v0.7.0](http://ringojs.org/):
  25. ```js
  26. var regenerate = require('regenerate').regenerate;
  27. ```
  28. In [Rhino](http://www.mozilla.org/rhino/):
  29. ```js
  30. load('regenerate.js');
  31. ```
  32. Using an AMD loader like [RequireJS](http://requirejs.org/):
  33. ```js
  34. require(
  35. {
  36. 'paths': {
  37. 'regenerate': 'path/to/regenerate'
  38. }
  39. },
  40. ['regenerate'],
  41. function(regenerate) {
  42. console.log(regenerate);
  43. }
  44. );
  45. ```
  46. ## API
  47. ### `regenerate(value1, value2, value3, ...)`
  48. The main Regenerate function. Calling this function creates a new set that gets a chainable API.
  49. ```js
  50. var set = regenerate()
  51. .addRange(0x60, 0x69) // add U+0060 to U+0069
  52. .remove(0x62, 0x64) // remove U+0062 and U+0064
  53. .add(0x1D306); // add U+1D306
  54. set.valueOf();
  55. // → [0x60, 0x61, 0x63, 0x65, 0x66, 0x67, 0x68, 0x69, 0x1D306]
  56. set.toString();
  57. // → '[`ace-i]|\\uD834\\uDF06'
  58. set.toRegExp();
  59. // → /[`ace-i]|\uD834\uDF06/
  60. ```
  61. Any arguments passed to `regenerate()` will be added to the set right away. Both code points (numbers) and symbols (strings consisting of a single Unicode symbol) are accepted, as well as arrays containing values of these types.
  62. ```js
  63. regenerate(0x1D306, 'A', '©', 0x2603).toString();
  64. // → '[A\\xA9\\u2603]|\\uD834\\uDF06'
  65. var items = [0x1D306, 'A', '©', 0x2603];
  66. regenerate(items).toString();
  67. // → '[A\\xA9\\u2603]|\\uD834\\uDF06'
  68. ```
  69. ### `regenerate.prototype.add(value1, value2, value3, ...)`
  70. Any arguments passed to `add()` are added to the set. Both code points (numbers) and symbols (strings consisting of a single Unicode symbol) are accepted, as well as arrays containing values of these types.
  71. ```js
  72. regenerate().add(0x1D306, 'A', '©', 0x2603).toString();
  73. // → '[A\\xA9\\u2603]|\\uD834\\uDF06'
  74. var items = [0x1D306, 'A', '©', 0x2603];
  75. regenerate().add(items).toString();
  76. // → '[A\\xA9\\u2603]|\\uD834\\uDF06'
  77. ```
  78. It’s also possible to pass in a Regenerate instance. Doing so adds all code points in that instance to the current set.
  79. ```js
  80. var set = regenerate(0x1D306, 'A');
  81. regenerate().add('©', 0x2603).add(set).toString();
  82. // → '[A\\xA9\\u2603]|\\uD834\\uDF06'
  83. ```
  84. Note that the initial call to `regenerate()` acts like `add()`. This allows you to create a new Regenerate instance and add some code points to it in one go:
  85. ```js
  86. regenerate(0x1D306, 'A', '©', 0x2603).toString();
  87. // → '[A\\xA9\\u2603]|\\uD834\\uDF06'
  88. ```
  89. ### `regenerate.prototype.remove(value1, value2, value3, ...)`
  90. Any arguments passed to `remove()` are removed to the set. Both code points (numbers) and symbols (strings consisting of a single Unicode symbol) are accepted, as well as arrays containing values of these types.
  91. ```js
  92. regenerate(0x1D306, 'A', '©', 0x2603).remove('☃').toString();
  93. // → '[A\\xA9]|\\uD834\\uDF06'
  94. ```
  95. It’s also possible to pass in a Regenerate instance. Doing so removes all code points in that instance from the current set.
  96. ```js
  97. var set = regenerate('☃');
  98. regenerate(0x1D306, 'A', '©', 0x2603).remove(set).toString();
  99. // → '[A\\xA9]|\\uD834\\uDF06'
  100. ```
  101. ### `regenerate.prototype.addRange(start, end)`
  102. Adds a range of code points from `start` to `end` (inclusive) to the set. Both code points (numbers) and symbols (strings consisting of a single Unicode symbol) are accepted.
  103. ```js
  104. regenerate(0x1D306).addRange(0x00, 0xFF).toString(16);
  105. // → '[\\0-\\xFF]|\\uD834\\uDF06'
  106. regenerate().addRange('A', 'z').toString();
  107. // → '[A-z]'
  108. ```
  109. ### `regenerate.prototype.removeRange(start, end)`
  110. Removes a range of code points from `start` to `end` (inclusive) from the set. Both code points (numbers) and symbols (strings consisting of a single Unicode symbol) are accepted.
  111. ```js
  112. regenerate()
  113. .addRange(0x000000, 0x10FFFF) // add all Unicode code points
  114. .removeRange('A', 'z') // remove all symbols from `A` to `z`
  115. .toString();
  116. // → '[\\0-@\\{-\\uD7FF\\uE000-\\uFFFF]|[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]|[\\uD800-\\uDBFF](?![\\uDC00-\\uDFFF])|(?:[^\\uD800-\\uDBFF]|^)[\\uDC00-\\uDFFF]'
  117. regenerate()
  118. .addRange(0x000000, 0x10FFFF) // add all Unicode code points
  119. .removeRange(0x0041, 0x007A) // remove all code points from U+0041 to U+007A
  120. .toString();
  121. // → '[\\0-@\\{-\\uD7FF\\uE000-\\uFFFF]|[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]|[\\uD800-\\uDBFF](?![\\uDC00-\\uDFFF])|(?:[^\\uD800-\\uDBFF]|^)[\\uDC00-\\uDFFF]'
  122. ```
  123. ### `regenerate.prototype.intersection(codePoints)`
  124. Removes any code points from the set that are not present in both the set and the given `codePoints` array. `codePoints` must be an array of numeric code point values, i.e. numbers.
  125. ```js
  126. regenerate()
  127. .addRange(0x00, 0xFF) // add extended ASCII code points
  128. .intersection([0x61, 0x69]) // remove all code points from the set except for these
  129. .toString();
  130. // → '[ai]'
  131. ```
  132. Instead of the `codePoints` array, it’s also possible to pass in a Regenerate instance.
  133. ```js
  134. var whitelist = regenerate(0x61, 0x69);
  135. regenerate()
  136. .addRange(0x00, 0xFF) // add extended ASCII code points
  137. .intersection(whitelist) // remove all code points from the set except for those in the `whitelist` set
  138. .toString();
  139. // → '[ai]'
  140. ```
  141. ### `regenerate.prototype.contains(value)`
  142. Returns `true` if the given value is part of the set, and `false` otherwise. Both code points (numbers) and symbols (strings consisting of a single Unicode symbol) are accepted.
  143. ```js
  144. var set = regenerate().addRange(0x00, 0xFF);
  145. set.contains('A');
  146. // → true
  147. set.contains(0x1D306);
  148. // → false
  149. ```
  150. ### `regenerate.prototype.clone()`
  151. Returns a clone of the current code point set. Any actions performed on the clone won’t mutate the original set.
  152. ```js
  153. var setA = regenerate(0x1D306);
  154. var setB = setA.clone().add(0x1F4A9);
  155. setA.toArray();
  156. // → [0x1D306]
  157. setB.toArray();
  158. // → [0x1D306, 0x1F4A9]
  159. ```
  160. ### `regenerate.prototype.toString(options)`
  161. Returns a string representing (part of) a regular expression that matches all the symbols mapped to the code points within the set.
  162. ```js
  163. regenerate(0x1D306, 0x1F4A9).toString();
  164. // → '\\uD834\\uDF06|\\uD83D\\uDCA9'
  165. ```
  166. If the `bmpOnly` property of the optional `options` object is set to `true`, the output matches surrogates individually, regardless of whether they’re lone surrogates or just part of a surrogate pair. This simplifies the output, but it can only be used in case you’re certain the strings it will be used on don’t contain any astral symbols.
  167. ```js
  168. var highSurrogates = regenerate().addRange(0xD800, 0xDBFF);
  169. highSurrogates.toString();
  170. // → '[\\uD800-\\uDBFF](?![\\uDC00-\\uDFFF])'
  171. highSurrogates.toString({ 'bmpOnly': true });
  172. // → '[\\uD800-\\uDBFF]'
  173. var lowSurrogates = regenerate().addRange(0xDC00, 0xDFFF);
  174. lowSurrogates.toString();
  175. // → '(?:[^\\uD800-\\uDBFF]|^)[\\uDC00-\\uDFFF]'
  176. lowSurrogates.toString({ 'bmpOnly': true });
  177. // → '[\\uDC00-\\uDFFF]'
  178. ```
  179. Note that lone low surrogates cannot be matched accurately using regular expressions in JavaScript. Regenerate’s output makes a best-effort approach but [there can be false negatives in this regard](https://github.com/mathiasbynens/regenerate/issues/28#issuecomment-72224808).
  180. If the `hasUnicodeFlag` property of the optional `options` object is set to `true`, the output makes use of Unicode code point escapes (`\u{…}`) where applicable. This simplifies the output at the cost of compatibility and portability, since it means the output can only be used as a pattern in a regular expression with [the ES6 `u` flag](https://mathiasbynens.be/notes/es6-unicode-regex) enabled.
  181. ```js
  182. var set = regenerate().addRange(0x0, 0x10FFFF);
  183. set.toString();
  184. // → '[\\0-\\uD7FF\\uE000-\\uFFFF]|[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]|[\\uD800-\\uDBFF](?![\\uDC00-\\uDFFF])|(?:[^\\uD800-\\uDBFF]|^)[\\uDC00-\\uDFFF]''
  185. set.toString({ 'hasUnicodeFlag': true });
  186. // → '[\\0-\\u{10FFFF}]'
  187. ```
  188. ### `regenerate.prototype.toRegExp(flags = '')`
  189. Returns a regular expression that matches all the symbols mapped to the code points within the set. Optionally, you can pass [flags](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp#Parameters) to be added to the regular expression.
  190. ```js
  191. var regex = regenerate(0x1D306, 0x1F4A9).toRegExp();
  192. // → /\uD834\uDF06|\uD83D\uDCA9/
  193. regex.test('𝌆');
  194. // → true
  195. regex.test('A');
  196. // → false
  197. // With flags:
  198. var regex = regenerate(0x1D306, 0x1F4A9).toRegExp('g');
  199. // → /\uD834\uDF06|\uD83D\uDCA9/g
  200. ```
  201. **Note:** This probably shouldn’t be used. Regenerate is intended as a tool that is used as part of a build process, not at runtime.
  202. ### `regenerate.prototype.valueOf()` or `regenerate.prototype.toArray()`
  203. Returns a sorted array of unique code points in the set.
  204. ```js
  205. regenerate(0x1D306)
  206. .addRange(0x60, 0x65)
  207. .add(0x59, 0x60) // note: 0x59 is added after 0x65, and 0x60 is a duplicate
  208. .valueOf();
  209. // → [0x59, 0x60, 0x61, 0x62, 0x63, 0x64, 0x65, 0x1D306]
  210. ```
  211. ### `regenerate.version`
  212. A string representing the semantic version number.
  213. ## Combine Regenerate with other libraries
  214. Regenerate gets even better when combined with other libraries such as [Punycode.js](https://mths.be/punycode). Here’s an example where [Punycode.js](https://mths.be/punycode) is used to convert a string into an array of code points, that is then passed on to Regenerate:
  215. ```js
  216. var regenerate = require('regenerate');
  217. var punycode = require('punycode');
  218. var string = 'Lorem ipsum dolor sit amet.';
  219. // Get an array of all code points used in the string:
  220. var codePoints = punycode.ucs2.decode(string);
  221. // Generate a regular expression that matches any of the symbols used in the string:
  222. regenerate(codePoints).toString();
  223. // → '[ \\.Ladeilmopr-u]'
  224. ```
  225. In ES6 you can do something similar with [`Array.from`](https://mths.be/array-from) which uses [the string’s iterator](https://mathiasbynens.be/notes/javascript-unicode#iterating-over-symbols) to split the given string into an array of strings that each contain a single symbol. [`regenerate()`](#regenerateprototypeaddvalue1-value2-value3-) accepts both strings and code points, remember?
  226. ```js
  227. var regenerate = require('regenerate');
  228. var string = 'Lorem ipsum dolor sit amet.';
  229. // Get an array of all symbols used in the string:
  230. var symbols = Array.from(string);
  231. // Generate a regular expression that matches any of the symbols used in the string:
  232. regenerate(symbols).toString();
  233. // → '[ \\.Ladeilmopr-u]'
  234. ```
  235. ## Support
  236. Regenerate supports at least Chrome 27+, Firefox 3+, Safari 4+, Opera 10+, IE 6+, Node.js v0.10.0+, io.js v1.0.0+, Narwhal 0.3.2+, RingoJS 0.8+, PhantomJS 1.9.0+, and Rhino 1.7RC4+.
  237. ## Unit tests & code coverage
  238. After cloning this repository, run `npm install` to install the dependencies needed for Regenerate development and testing. You may want to install Istanbul _globally_ using `npm install istanbul -g`.
  239. Once that’s done, you can run the unit tests in Node using `npm test` or `node tests/tests.js`. To run the tests in Rhino, Ringo, Narwhal, and web browsers as well, use `grunt test`.
  240. To generate the code coverage report, use `grunt cover`.
  241. ## Author
  242. | [![twitter/mathias](https://gravatar.com/avatar/24e08a9ea84deb17ae121074d0f17125?s=70)](https://twitter.com/mathias "Follow @mathias on Twitter") |
  243. |---|
  244. | [Mathias Bynens](https://mathiasbynens.be/) |
  245. ## License
  246. Regenerate is available under the [MIT](https://mths.be/mit) license.