Course


Grep and regular expressions to analyze text

Learn how to use grep and regular expressions to analyze text

Introduction

Use grep and regular expressions to analyze text. It is very important to know these tools for a scientific programmer. Grep can be used to match literal patterns within a text file. This means that if you pass grep a word to search for, it will print out every line in the file containing that word. A regular expression, regex or regexp is a sequence of characters that define a search pattern. Usually such patterns are used by string searching algorithms for "find" or "find and replace" operations on strings, or for input validation.

Literal search

In this example, the first argument, "GNU", is the pattern we are searching for, while the second argument, "file" is the input file we wish to search.

Case insensitive search

By default, grep will simply search for the exact specified pattern within the input file and return the lines it finds. We can make this behavior more useful though by adding some optional flags to grep. If we would want grep to ignore the "case" of our search parameter and search for both upper- and lower-case variations, we can specify the "-i" or "--ignore-case" option. We will search for each instance of the word "license" (with upper, lower, or mixed cases) in the same file as before.

Invert search

If we want to find all lines that do not contain a specified pattern, we can use the "-v" or "--invert-match" option. We can search for every line that does not contain the word "the" in the BSD license with the following command:

Anchor Matches

Anchors are special characters that specify where in the line a match must occur to be valid. For instance, using anchors, we can specify that we only want to know about the lines that match "the" at the very beginning of the line. To do this, we could use the "^" anchor before the literal string. This string example will only mach "the" if it occurs at the very beginning of a line.

Similarly, the "$" anchor can be used after a string to indicate that the match will only be valid if it occurs at the very end of a line. We will match every line ending with the word dot "." (escaped with \.) in the following regular expression:

Matching Any Character

The period character (.) is used in regular expressions to mean that any single character can exist at the specified location. For example, if we want to match anything that has two characters and then the string "cept", we could use the following pattern:

Bracket Expressions

By placing a group of characters within brackets ("[" and "]"), we can specify that the character at that position can be any one character found within the bracket group. This means that if we wanted to find the lines that contain "too" or "two", we could specify those variations succinctly by using the following pattern:

Bracket notation also allows us some interesting options. We can have the pattern match anything except the characters within a bracket by beginning the list of characters within the brackets with a "^" character. This example is like the pattern ".ode", but will not match the pattern "code":

Another helpful feature of brackets is that you can specify a range of characters instead of individually typing every available character. This means that if we want to find every line that begins with a capital letter, we can use the following pattern:

Repeat Pattern Zero or More Times

If we wanted to find each line that contained an opening and closing parenthesis, with only letters and single spaces in between, we could use the following expression:

Escaping Meta-Characters

We can escape characters by using the backslash character (\) before the character that would normally have a special meaning. For instance, if we want to find any line that begins with a capital letter and ends with a period, we could use the following expression. The ending period is escaped so that it represents a literal period instead of the usual "any character" meaning:

There are many character classes that are outside of the scope of this guide, please refer our dedicated REGEX courses on this platform for more details.