This tip discusses some useful regular expressions you can use programatically or in regex supported editors.
A regular expression, or regex, is one of the most powerful ways to modify text and can save hours of tedious manual editing. A regex allows you to find or find-and-replace using patterns. When properly written, you can match multiple variations in a single operation. The examples discussed below show some of the power of a regular expression. They all use "captures" which specify a matching pattern within parentheses () in the find expression. These captures can be included in the replace expression using $1, $2 etc. Note the use of the \ escape character to delimit the / used in http//: and https:// protocol text and also the use of the g (global) option to apply the regex to all matches in the example.
For a proper tutorial, go to https://regexone.com/ and to build and test your regular expressions, go to https://regex101.com/.
Convert email addresses in a text string to hyperlinks:
| Find expression: | (.*?)( |^)(\S+)@(\S+)(\s|$) |
| Replace expression: | $1$2<a href="mailto:$3@$4">$3@$4</a>$5 |
| Test string: | Email bob@alpha.vic.gov.au for further details or email john@delta.org to register. |
| Result string: | Email <a href="mailto:bob@alpha.vic.gov.au">bob@alpha.vic.gov.au</a> for further details or email <a href="mailto:john@delta.org">john@delta.org</a> to register. |
| Javascript example: | var testString = "Email bob@alpha.vic.gov.au for further details or email john@delta.org to register."; var result = testString.replace(/(.*?)( |^)(\S+)@(\S+)(\s|$)/g,"$1$2<a href='mailto:$3@$4'>$3@$4</a>$5"; |
| How it works: | There are five "captures" in this example, enclosed in parentheses ( ) that match characters before and after the @ character. This is the @ in an email address. The first capture (.*?) matches zero or more characters up to the second capture ( |^) which is either a space or a ^ token. The ^ token denotes the start of the line and is needed to find an email address at the very start of the string, in which case the first capture equates to a blank. Note that the ? in the first capture makes the match lazy, which means it will stop at the space prior to the first email name. If the ? is removed, the match will keep going to the last email address in the test string (skipping over all other email addresses). The third capture (\S+) matches one or more non-space characters up to the @ character which will match the name portion of an email address. In the example, the first capture will match "Email", the second capture will match a space and the third capture will match "bob". The fourth capture matches one or more non space characters which capture the domain part of the email address - "alpha.vic.gov.au". The fifth capture matches the space following the domain or the end of the line $ token. When the g option is applied to the regex, the capture will repeat to match further email addresses in the string. |
Convert web URLs in a text string to hyperlinks:
| Find expression: | (.*?)( |^)(http:\/\/|https:\/\/)(\S+)(\s|$) |
| Replace expression: | $1$2<a href="$3$4">$4</a>$5 |
| Test string: | Go to http://alpha.vic.gov.au for further details or https://www.delta.org to register. |
| Result string: | Go to <a href="http://alpha.vic.gov.au">alpha.vic.gov.au</a> for further details or email <a href="https://www.delta.org">www.delta.org</a> to register. |
| How it works: | Works in a similar manner to the email command above except that the captures centre around protocol portion "http://" or "https://". Note that the protocol is removed from the anchor text for clarity but can be included by using $3$4 instead of just $4 between the anchor tags. |
Convert web URLs in a text string to hyperlinks with just the filename visible:
| Find expression: | (.*?)( |^)(http:\/\/|https:\/\/)(\S+)(\/)(\S*?)(\/?)(\S*?)(\s|$) |
| Replace expression: | $1$2<a href="$3$4$5$6$7$8" target="_blank">$8</a>$9 |
| Test String: | Go to http://alpha.vic.gov.au/docs/help.pdf for further details or https://www.delta.org to register. |
| Result string: | Go to <a href="http://alpha.vic.gov.au/docs/help.pdf" target="_blank">help.pdf</a> for further details or email https://www.delta.org to register. |
| How it works: | Additional captures are added to reliably identify and break up the web URL to separate the filename portion. Note that the second URL in the example is left untouched in the result string. However, the previous regex can be applied to process it, leaving the first filename URL link untouched. This example also includes a target="_blank" parameter to open the pdf in a new window. |
Add a visible URL to an existing anchor tag where no filename
| Find expression: | (.*?)(")([^"]+?)(")\s*(>)\s*(<\/a>)(.*?) |
| Replace expression: | $1$2$3$4$5$3$6$7 |
| Test String: | Go to <a href="http://www.alpha.vic.gov.au/docs/help.pdf">help.pdf</a> for further details or <a href="https://john.delta.org" > </a> for other info. |
| Result string: | Go to <a href="http://www.alpha.vic.gov.au/docs/help.pdf">help.pdf</a> for further details or <a href="https://john.delta.org"></a> for other info. |
| How it works: | Similar captures are added to reliably identify and break up the tags surrounding the web URL to separate the URL portion. As a variation, the match is made a little more robust by adding a \s* token at the end of the href string and between anchor tags. This will detect any additional whitespace in these positions in the test string. These whitespaces are not captured so they don't appear in the result string. |
Using lookaheads / lookbehinds
| zero width positive lookahead: | a(?=b) matches a followed by b |
| zero width positive lookbehind: | (?<=a)b matches b preceeded by a |
| non zero width: | [abc] matches character by character a or b or c |
| zero width negative lookahead: | a(?!b) matches a not followed by b |
| zero width negative lookbehind: | (?<!a)b matches b not preceeded by a |
| non zero width: | [^abc] matches character by character not a or b or c |