A regex hack to simplify subsetting using the -if- statement

In Stata, both interactively as well as in do-files, I use the -if- statement as a form of subsetting a dataset while keeping the full dataset in memory. An example of this would be using the auto dataset. You may want to list all datapoints with mpg above 30 in the terminal for inspection. You can do this using:

list * if mpg > 30

Now suppose you would like to subset your data further by using values of foreign and rep78, you could end up with something like:

list * if mpg > 30 & foreign == 1 & rep78 == 5

The same strategy using -if- can also be used to apply a function to a subset of your data. For example, regressing mpg on weight for foreign made vehicles with at least three repairs in 1978

reg mpg weight if foreign == 1 & rep78 >= 3

Where using regular expressions in Stata can help here is, for example, if one wanted to regress mpg on weight for all Datsuns, Pontiacs and Toyotas. Using the -if- statement strategy above one would have to type:

reg mpg weight if make == "Datsun 200" | make == "Datsun 210" | make == "Datsun 510" .........etc.

You can see it starts to get quite long. One could use -regexs- and -regexm- to create two new variables, manufacturer and model, out of make, but then you’d still have to choose the three manufacturers in the if statement.

The regex hack I propose here is to use -regexm- instead of multiple if statements. Instead of the code above, one could write:

reg mpg weight if regexm(make, "(Datsun|Pont|Toyota)")

which will search for any of the three terms within the brackets which are separated by the | symbol.

I have been using -regexm- quite a bit lately and I’ve found that besides saving some typing time, do-files are shorter and easier to follow.

Good luck!

This entry was posted in Uncategorized and tagged . Bookmark the permalink.

4 Responses to A regex hack to simplify subsetting using the -if- statement

  1. Nick Cox says:

    Your example works fine. An alternative in this case is

    . regress mpg weight if inlist(word(make, 1), “Datsun”, “Pontiac”, “Honda”)

    As in your code using -regexm()-, negation is also often useful

    . regress mpg weight if !inlist(word(make, 1), “Datsun”, “Pontiac”, “Honda”)

  2. andrew says:

    Thanks Nick! The advantage to using -inlist()- over -regexm()- would be that it’s a little easier to use, especially if one doesn’t know regex very well.

    One disadvantage might be that -inlist()- expects an exact match to the arguments supplied. Using inlist(word(make, 1), “Pontiac”) will match records with the following strings stored in `make`: “Pontiac”, “Pontiac Sunbird”, etc. But, this command will not match “Sunbird, Pontiac” or “Pont. Sunbird”. Using a regex in the form of regexm(make, “Pont”) will match all of the examples I mentioned above, but has the downside of matching records like “Pontius Pilate”, if his name were in the dataset by mistake.

    I guess the key is to always take an extra few moments to inspect what your matches return no matter the method used.

  3. Nick Cox says:

    I agree. I did say “in this case”. As you say, the regular expression doesn’t have the implication that any of the strings specified must be the first word.

  4. Pingback: Regex, sql files, and panel data recoding on Statabytes | AndrewDyck.com

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>