Rain Data: Using Stata to automate the creation and labelling of each variable through looping
Often in data work one finds that the same work needs to be done again and again. Repeated actions and steps are an ideal reason to resort to programming capabilities of statistical software. In other terms, if one can get the computer to do the work that can be automated, one increases efficiency with regards to time but also to avoid errors. To summarise, one should always rely on computers and software to perform repetitive tasks. This saves time, is more efficient and prone to less mistakes.
A common task is to rename or create new variables, often a large number of variables. In this exercise, we use a fictional dataset to provide a simple example of how to use the programming capabilities of Stata to create several variables.
The data set is called “rain data” and includes rain precipitation numbers (in inches) for 10 cities between 2000 and 2003.
First, you would like to create a variable that list the highest precipitation record across the 4 course by city. One could look at each row and copy the highest number. This obviously is not feasible, prone to mistakes and is time consuming (just imagine the data set includes 1000 cities!). I use the egen
command in Stata with the rowmax
function to push Stata to search across each row (so each city) and return the maximum number and place it in the new variable. The command egen
is very powerful with a large number of functions that are very useful.
Assume now that you would like to create a variable for each year and each city that is 1 if the rain precipitation is higher than 35 inches (and 0 otherwise). Because the data set includes 4 years of rain precipitation data (2000, 2001, 2002 and 2003) I will have to generate 4 new variables. Creating 4 new variables can be done through the generate
command using the if
qualifier. This would take 4 lines of code. If I want to label each new variable then this will take another 4 lines. However, imagine the data set included 40 years of data, then this become cumbersome and a source of potential errors with more than 80 lines of code! I use the forvalues
loop to push Stata to run through the years 2000 to 2003, generate
4 new variables with the condition of rain precipitation above 35 inches and also to label
each variable. The command forevalues
sets a macro name to each element of a range and executes the commands enclosed in brackets. In this case, the enclosed commands in brackets are “generate 4 new variables and also label the new variables”. The loop accomplishes the same task in 3 lines of code (instead of 8) and with a much higher accuracy (in terms of avoiding human errors in repeated tasks).