In Stata, both interactively as well as in do-files, I use the -if- statement as a form of subsetting a dataset while keeping the full dataset in memory. An example of this would be using the auto dataset. You may want to list all datapoints with mpg above 30 in the terminal for inspection. You can do this using:
list * if mpg > 30
Now suppose you would like to subset your data further by using values of foreign and rep78, you could end up with something like:
list * if mpg > 30 & foreign == 1 & rep78 == 5
The same strategy using -if- can also be used to apply a function to a subset of your data. For example, regressing mpg on weight for foreign made vehicles with at least three repairs in 1978
reg mpg weight if foreign == 1 & rep78 >= 3
Where using regular expressions in Stata can help here is, for example, if one wanted to regress mpg on weight for all Datsuns, Pontiacs and Toyotas. Using the -if- statement strategy above one would have to type:
reg mpg weight if make == "Datsun 200" | make == "Datsun 210" | make == "Datsun 510" .........etc.
You can see it starts to get quite long. One could use -regexs- and -regexm- to create two new variables, manufacturer and model, out of make, but then you’d still have to choose the three manufacturers in the if statement.
The regex hack I propose here is to use -regexm- instead of multiple if statements. Instead of the code above, one could write:
reg mpg weight if regexm(make, "(Datsun|Pont|Toyota)")
which will search for any of the three terms within the brackets which are separated by the | symbol.
I have been using -regexm- quite a bit lately and I’ve found that besides saving some typing time, do-files are shorter and easier to follow.