1
SOCY7709: Quantitative Data Management
Instructor: Natasha Sarkisian
String Variables, Dates, and Formats for Numeric Variables
String Variable Formats
A string is a sequence of characters and is typically enclosed in double quotes. When considering
whether a string constitutes a distinct value (i.e., different from another string), capitalization
matters, and leading and trailing spaces – spaces before or after the text – matter as well. As we
already saw, numbers can be treated as strings.
String variables can have any length up to 244 characters in Stata 12 and you can identify these
formats as str1… str244. In Stata 13 and 14, string variables can contain up to two billion
characters in where you can use str1...str2045 to define fixed-length strings of up to 2045
characters, and strL format to define a very long string. If you are displaying a string that
contains a lot of text and use list command, it will only show as many characters as your width of
screen and then cut off the rest. If you wanted to more (up to 2045 characters), you could use
notrim option:
. list stringvar, notrim
To see all text for a given observation, regardless of how long it is (can be many pages given the
new limits in recent Stata versions): . display _asis stringvar[5]
When we refer to string values, we usually use quotes to delimit the beginning and the end. If
there are leading spaces in a quote, we’d need to include them to get the exact match, e.g., “
text” is a different string from “text” so we need to include spaces to refer to that first value.
When we enter data in Stata, we can enter them without quotes, but in that case, any leading or
trailing spaces will be automatically removed. If a string is entered in quotes, it is accepted as is.
In addition to regular double quotes "" for enclosing strings, Stata also allows compound double
quotes: `" and "'. That is, instead of typing "text", you can type `"text"' – the second version is
used in programing because it allows for that quoted string to itself contain double quotes within
it (without compound quotes, that is not possible because Stata would think we are ending the
string whenever the quotation mark appeared).
Missing values for strings are coded using null string – "" – not using either a period "." or a
blank space " ".
String variables can be formatted for display in different ways:
string
%fmt Description Example
-------------------------------------------------------
right-justified
%#s string %15s
2
left-justified
%-#s string %-20s
centered
%~#s string %~12s
-------------------------------------------------------
The centered format is for use with display only.
Basic Operations with Strings
Converting strings to numbers and vice versa:
We already learned to use tostring and destring to convert between strings and numbers. Those
commands are useful for converting variables that actually contain numbers to and from string
versus numeric format. Note that if you want to convert all string variables in your dataset that
contain numbers saved as strings (e.g., because a lot of them were created that way in the dataset
provided to you, but they are truly numeric), you can also use destring without specifying
variables, i.e., . destring, replace
Sometimes, most of the variable values are numeric, but some values are text-based – e.g., if
missing values are coded as X. In such cases, we can use ignore option:
. destring, replace ignore(X)
If a string variable contains nonnumeric characters that are not specified with ignore option, then
no changes will be made at all (unless force option is also specified). Note that if a cell contains
both a number and a character we specified to ignore, the character will be omitted and the
number will be used, e.g. if there is a string variable with percentage values coded as “58%” etc.,
we can use:
. destring varname, gen(varname_v2) ignore(%)
or . destring varname, gen(varname_v2) percent
The latter option also divides values by 100, turning them into proportions.
Another way to convert between string and numeric formatt is string(n,s) function and real(s):
. sum id
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
id | 2765 1383 798.3311 1 2765
. gen stringid=string(id, "%05.0f")
. list stringid in 1/10
+----------+
| stringid |
|----------|
1. | 00001 |
2. | 00002 |
3
3. | 00003 |
4. | 00004 |
5. | 00005 |
|----------|
6. | 00006 |
7. | 00007 |
8. | 00008 |
9. | 00009 |
10. | 00010 |
+----------+
. gen idreal=real(stringid)
. sum idreal id
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
idreal | 2765 1383 798.3311 1 2765
id | 2765 1383 798.3311 1 2765
Or for date:
. gen datestring=string(date, "%td")
. list datestring in 1/10
+-----------+
| datestr~g |
|-----------|
1. | 30may2002 |
2. | 03jun2002 |
3. | 23feb2002 |
4. | 26mar2002 |
5. | 04may2002 |
|-----------|
6. | 02jun2002 |
7. | 21feb2002 |
8. | 09may2002 |
9. | 07may2002 |
10. | 13may2002 |
+-----------+
Formats (such as %05.0f and %td in the examples above) can be stored separately as a separate
string variable if different formats are desired for different observations.
But if we are dealing with variables where there are actual words, these commands are less
useful. Instead, we could use encode and decode.
. tab marital, nol
marital |
status | Freq. Percent Cum.
------------+-----------------------------------
1 | 1,269 45.90 45.90
2 | 247 8.93 54.83
3 | 445 16.09 70.92
4 | 96 3.47 74.39
5 | 708 25.61 100.00
------------+-----------------------------------
Total | 2,765 100.00
4
. decode marital, gen(marstring)
. tab marstring
marital |
status | Freq. Percent Cum.
--------------+-----------------------------------
divorced | 445 16.09 16.09
married | 1,269 45.90 61.99
never married | 708 25.61 87.59
separated | 96 3.47 91.07
widowed | 247 8.93 100.00
--------------+-----------------------------------
Total | 2,765 100.00
. tab marstring, nol
marital |
status | Freq. Percent Cum.
--------------+-----------------------------------
divorced | 445 16.09 16.09
married | 1,269 45.90 61.99
never married | 708 25.61 87.59
separated | 96 3.47 91.07
widowed | 247 8.93 100.00
--------------+-----------------------------------
Total | 2,765 100.00
. des marstring
storage display value
variable name type format label variable label
--------------------------------------------------------------------------------
marstring str13 %13s marital status
. sum marstring
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
marstring | 0
And going back:
. encode marstring, gen(marnumeric)
. tab marnumeric
marital |
status | Freq. Percent Cum.
--------------+-----------------------------------
divorced | 445 16.09 16.09
married | 1,269 45.90 61.99
never married | 708 25.61 87.59
separated | 96 3.47 91.07
widowed | 247 8.93 100.00
--------------+-----------------------------------
Total | 2,765 100.00
. tab marnumeric, nol
marital |
5
status | Freq. Percent Cum.
------------+-----------------------------------
1 | 445 16.09 16.09
2 | 1,269 45.90 61.99
3 | 708 25.61 87.59
4 | 96 3.47 91.07
5 | 247 8.93 100.00
------------+-----------------------------------
Total | 2,765 100.00
In contrast, if we tried to apply destring:
. destring marstring, gen(test)
marstring contains nonnumeric characters; no generate
. destring marstring, gen(test) force
marstring contains nonnumeric characters; test generated as byte
(2765 missing values generated)
. tab test
no observations
Combining strings:
The following function of egen concatenates multiple variables to produce one string variable:
concat(varlist) [, format(%fmt) decode maxlength(#) punct(pchars)]
The values of string variables are unchanged. Values of numeric variables are converted to
string, as is, or are converted using a numeric format under the format(%fmt) option or decoded
under the decode option, in which case maxlength() may also be used to control the maximum
label length used. By default, variables are added end to end: punct(pchars) may be used to
specify punctuation, such as a space, punct(" "), or a comma, punct(,).
. egen text=concat(age hrs1 marstring), format(%2.1f) decode punct(,)
. list text in 1/10
+-------------------+
| text |
|-------------------|
1. | 27.0,20.0,married |
2. | 37.0,49.0,married |
3. | 33.0,44.0,married |
4. | 62.0,20.0,married |
5. | 70.0,3.0,married |
|-------------------|
6. | 29.0,43.0,married |
7. | 25.0,49.0,married |
8. | 58.0,.,married |
9. | 51.0,37.0,married |
10. | 43.0,.,married |
+-------------------+
Cleaning strings:
itrim(s) – remove any multiple blank spaces in strings; trim(s) – removes all leading spaces
(spaces prior to actual text) and trailing spaces (those after the text); ltrim(s) – removes leading
spaces only; rtrim(s) – removes trailing spaces only. E.g.:
6
. gen testvar=" We love Stata"
. gen testtrim=ltrim(testvar)
. tab testtrim
testtrim | Freq. Percent Cum.
--------------------------+-----------------------------------
We love Stata | 2,765 100.00 100.00
--------------------------+-----------------------------------
Total | 2,765 100.00
. replace testtrim=itrim(testvar)
(2765 real changes made)
. tab testtrim
testtrim | Freq. Percent Cum.
--------------------------+-----------------------------------
We love Stata | 2,765 100.00 100.00
--------------------------+-----------------------------------
Total | 2,765 100.00
Changing case:
lower(s) converts strings into lowercase, and upper(s) converts strings into uppercase. proper(s)
makes the first letter capitalized, and capitalizes any other letters immediately following any
characters that are not letters (e.g. space, period, etc.). E.g.: . gen marstring_up=upper(marstring)
. tab marstring_up
marstring_up | Freq. Percent Cum.
--------------+-----------------------------------
DIVORCED | 445 16.09 16.09
MARRIED | 1,269 45.90 61.99
NEVER MARRIED | 708 25.61 87.59
SEPARATED | 96 3.47 91.07
WIDOWED | 247 8.93 100.00
--------------+-----------------------------------
Total | 2,765 100.00
. gen city="st.louis" in 1/100
(2665 missing values generated)
. replace city="new york" in 101/200
(100 real changes made)
. replace city="boston" in 201/300
(100 real changes made)
. replace city=proper(city)
(300 real changes made)
. tab city
city | Freq. Percent Cum.
------------+-----------------------------------
Boston | 100 33.33 33.33
New York | 100 33.33 66.67
St.Louis | 100 33.33 100.00
------------+-----------------------------------
Total | 300 100.00
7
Measuring strings:
length(s) function evaluates the length of a string. wordcount(s) evaluates the number of words
in the string (a word is defined as a set of characters that start and terminate with spaces, start
with the beginning of the string, or terminate with the end of the string). E.g.:
. gen marlength=length(marstring)
. tab marlength
marlength | Freq. Percent Cum.
------------+-----------------------------------
7 | 1,516 54.83 54.83
8 | 445 16.09 70.92
9 | 96 3.47 74.39
13 | 708 25.61 100.00
------------+-----------------------------------
Total | 2,765 100.00
. gen marcount=wordcount(marstring)
. tab marcount
marcount | Freq. Percent Cum.
------------+-----------------------------------
1 | 2,057 74.39 74.39
2 | 708 25.61 100.00
------------+-----------------------------------
Total | 2,765 100.00
Advanced Operations with Strings
Changing strings:
We already learned that you can add two strings using “+”. You can also multiply strings by a
number to duplicate the same text:
. replace city=city*3
city was str8 now str24
(300 real changes made)
. tab city
city | Freq. Percent Cum.
-------------------------+-----------------------------------
BostonBostonBoston | 100 33.33 33.33
New YorkNew YorkNew York | 100 33.33 66.67
St.LouisSt.LouisSt.Louis | 100 33.33 100.00
-------------------------+-----------------------------------
Total | 300 100.00
abbrev(s,n) -- abbreviates strings to some number of characters between 5 and 32
. gen marabb=abbrev(marstring, 6)
. tab marabb
8
marabb | Freq. Percent Cum.
------------+-----------------------------------
divo~d | 445 16.09 16.09
marr~d | 1,269 45.90 61.99
neve~d | 708 25.61 87.59
sepa~d | 96 3.47 91.07
wido~d | 247 8.93 100.00
------------+-----------------------------------
Total | 2,765 100.00
reverse(s) – reverses the text:
. gen marrev=reverse(marstring)
. tab marrev
marrev | Freq. Percent Cum.
--------------+-----------------------------------
decrovid | 445 16.09 16.09
deirram | 1,269 45.90 61.99
deirram reven | 708 25.61 87.59
detarapes | 96 3.47 91.07
dewodiw | 247 8.93 100.00
--------------+-----------------------------------
Total | 2,765 100.00
Searching in strings:
strmatch(s1,s2) -- returns 1 if s1 matches the pattern s2; otherwise, it returns 0. In s2, "?" means
that one character goes here, and "*" means that any number of characters (including possibly
zero characters) go here.
. gen match=strmatch(marstring, "married")
. tab marstring match
marital | match
status | 0 1 | Total
--------------+----------------------+----------
divorced | 445 0 | 445
married | 0 1,269 | 1,269
never married | 708 0 | 708
separated | 96 0 | 96
widowed | 247 0 | 247
--------------+----------------------+----------
Total | 1,496 1,269 | 2,765
. gen match2=strmatch(marstring, "*married")
. tab marstring match2
marital | match2
status | 0 1 | Total
--------------+----------------------+----------
divorced | 445 0 | 445
married | 0 1,269 | 1,269
never married | 0 708 | 708
separated | 96 0 | 96
widowed | 247 0 | 247
--------------+----------------------+----------
Total | 788 1,977 | 2,765
9
strpos(s1,s2) -- returns the number indicating the position in s1 at which s2 is first found; if s2 is
not found, it returns 0. . gen marpos=strpos(marstring, "married")
. tab marpos
marpos | Freq. Percent Cum.
------------+-----------------------------------
0 | 788 28.50 28.50
1 | 1,269 45.90 74.39
7 | 708 25.61 100.00
------------+-----------------------------------
Total | 2,765 100.00
. gen marpos2=strpos(marstring, "ed")
. tab marpos2
marpos2 | Freq. Percent Cum.
------------+-----------------------------------
6 | 1,516 54.83 54.83
7 | 445 16.09 70.92
8 | 96 3.47 74.39
12 | 708 25.61 100.00
------------+-----------------------------------
Total | 2,765 100.00
subinstr(s1,s2,s3,n) – takes s1 and replaces the first n occurrences of s2 within s1 with s3. If n is
missing (.), all occurrences are replaced.
. gen marsub=subinstr(marstring, "ed", "ing", .)
. tab marsub
marsub | Freq. Percent Cum.
---------------+-----------------------------------
divorcing | 445 16.09 16.09
marriing | 1,269 45.90 61.99
never marriing | 708 25.61 87.59
separating | 96 3.47 91.07
widowing | 247 8.93 100.00
---------------+-----------------------------------
Total | 2,765 100.00
subinword(s1,s2,s3,n) – takes string s1 and replaces the first n occurrences of s2 as a word (i.e.,
space separated entity) within s1 with s3. If n is missing (.), all occurrences are replaced.
. gen marchanged=subinword(marstring, "married", "wedded",.)
. tab marchanged
marchanged | Freq. Percent Cum.
-------------+-----------------------------------
divorced | 445 16.09 16.09
never wedded | 708 25.61 41.70
separated | 96 3.47 45.17
wedded | 1,269 45.90 91.07
widowed | 247 8.93 100.00
10
-------------+-----------------------------------
Total | 2,765 100.00
For a way to do more complex search and substitution operations, also see functions regexm,
regexr, and regexs (need to know how to work with so-called regular expressions).
Extracting portion of strings:
substr(s,n1,n2) – we already used this one when working with dates; it returns the substring of s,
starting at column n1, for a length of n2. If n1 is negative, then the distance is counted from the
end of the string; n2 cannot be negative. if n2 is . (missing), the entire remaining portion of the
string starting at position n1 is returned.
word(s, n) - returns the nth word in the string s. Positive numbers count words from the
beginning of the string, and negative numbers count words from the end of string. E.g.:
. gen marstring2=word(marstring, 2)
(2057 missing values generated)
. tab marstring2
marstring2 | Freq. Percent Cum.
------------+-----------------------------------
married | 708 100.00 100.00
------------+-----------------------------------
Total | 708 100.00
A function in egen command, ends, gives the first word of a string with the “head” option, the
last word using “last” option, or everything EXCEPT the first word using “tail” option (if there is
nothing that occurs after the first space or the first punctuation sign of choice, then the result will
be an empty string). How the words are separated is determined by the option punct – the default
is a space, but it can be comma etc. For example, if we have a string “rock, paper, scissors” and
we use egen to extract the last item:
. egen marstring3=ends(marstring), last punct(,) trim
Note that the trim option removes any leading or trailing spaces.
Nesting Operations
You can nest operations in one another. For example:
. gen maroperate=proper(abbrev(reverse(marstring*2), 15))
. tab maroperate
maroperate | Freq. Percent Cum.
----------------+-----------------------------------
Decroviddecro~D | 445 16.09 16.09
Deirram Reven~N | 708 25.61 41.70
Deirramdeirram | 1,269 45.90 87.59
Detarapesdeta~S | 96 3.47 91.07
Dewodiwdewodiw | 247 8.93 100.00
----------------+-----------------------------------
11
Total | 2,765 100.00
Or we can convert ID into string with some leading zeros, and then select the portion that omits
the last two digits:
. gen newid = substr(string(id, "%05.0f"), 1, length(string(id, "%05.0f")) - 2)
. list newid in 1/10
+-------+
| newid |
|-------|
1. | 000 |
2. | 000 |
3. | 000 |
4. | 000 |
5. | 000 |
|-------|
6. | 000 |
7. | 000 |
8. | 000 |
9. | 000 |
10. | 000 |
+-------+
. tab newid
newid | Freq. Percent Cum.
------------+-----------------------------------
000 | 99 3.58 3.58
001 | 100 3.62 7.20
002 | 100 3.62 10.81
003 | 100 3.62 14.43
004 | 100 3.62 18.05
005 | 100 3.62 21.66
006 | 100 3.62 25.28
007 | 100 3.62 28.90
008 | 100 3.62 32.51
009 | 100 3.62 36.13
010 | 100 3.62 39.75
011 | 100 3.62 43.36
012 | 100 3.62 46.98
013 | 100 3.62 50.60
014 | 100 3.62 54.21
015 | 100 3.62 57.83
016 | 100 3.62 61.45
017 | 100 3.62 65.06
018 | 100 3.62 68.68
019 | 100 3.62 72.30
020 | 100 3.62 75.91
021 | 100 3.62 79.53
022 | 100 3.62 83.15
023 | 100 3.62 86.76
024 | 100 3.62 90.38
025 | 100 3.62 94.00
026 | 100 3.62 97.61
027 | 66 2.39 100.00
------------+-----------------------------------
Total | 2,765 100.00
12
Dealing with Date Variables
Stata wants dates stored in number of units since January 1, 1960—the units can be seconds,
minutes, days or months. So if we want to be able to do use date procedures in Stata (e.g.
calculate the number of months between some events), we should store date variables in Stata
format. Coding and interpretation of date and time values in Stata are as follows: +---------------------------------------------------------------------
| | | ----- Numerical value & interpretation ------
| Format | Meaning | Value = -1 | Value = 0 | Value = 1
|--------+------------+---------------+---------------+---------------
| %tc | clock | 31dec1959 | 01jan1960 | 01jan1960
| | | 23:59:59.999 | 00:00:00.000 | 00:00:00.001
| | | | |
| %td | days | 31dec1959 | 01jan1960 | 02jan1960
| | | | |
| %tw | weeks | 1959w52 | 1960w1 | 1960w2
| | | | |
| %tm | months | 1959m12 | 1960m1 | 1960m2
| | | | |
| %tq | quarters | 1959q4 | 1960q1 | 1960q2
| | | | |
| %th | half-years | 1959h2 | 1960h1 | 1960h2
| | | | |
| %tg | generic | -1 | 0 | 1
| | | | |
| %ty | year | 1959 | 1960 | 1961
| | | | |
| %tC | clock | 31dec1959 | 01jan1960 | 01jan1960
| | | 23:59:59.999 | 00:00:00.000 | 00:00:00.001
+---------------------------------------------------------------------
(Note: %tC with capital C includes leap seconds).
We will work with the interview date variable in GSS 2002 as an example.
DATEINTV
Date of interview
Survey Question: Date of interview.
Range of Valid Numeric Responses
Minimum value= 1 Maximum value= 9998
Response Categories
Category Label Frequency
0 Not applicable 0
9999 Not available 18
Column: 1276 Width: 4 Type: numeric
Text: REMARKS: This variable consists of the month (Cols. 5734-5735) and date (Cols. 5736-
5737) on which the interview was conducted. Collapsed information by month is listed above for
convenience of display only.
. sum dateintv
13
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
dateintv | 2747 383.1736 120.219 206 626
One way to manage this would be to split the original variable into date and month and use a
numeric importing function. To split it, it might be easier to use it as a string, so we convert the
original variable into string using tostring command:
. tostring dateintv, gen(datestr2)
datestr2 generated as str3
. gen month=substr(datestr2, 1, 1)
. gen day=substr(datestr2, 2, 2)
(18 missing values generated)
. tab month
month | Freq. Percent Cum.
------------+-----------------------------------
. | 18 0.65 0.65
2 | 557 20.14 20.80
3 | 745 26.94 47.74
4 | 703 25.42 73.16
5 | 526 19.02 92.19
6 | 216 7.81 100.00
------------+-----------------------------------
Total | 2,765 100.00
. tab day, m
day | Freq. Percent Cum.
------------+-----------------------------------
| 18 0.65 0.65
01 | 79 2.86 3.51
02 | 77 2.78 6.29
03 | 60 2.17 8.46
04 | 79 2.86 11.32
05 | 55 1.99 13.31
06 | 95 3.44 16.75
07 | 87 3.15 19.89
08 | 90 3.25 23.15
09 | 80 2.89 26.04
10 | 83 3.00 29.04
11 | 122 4.41 33.45
12 | 101 3.65 37.11
13 | 134 4.85 41.95
14 | 86 3.11 45.06
15 | 103 3.73 48.79
16 | 94 3.40 52.19
17 | 65 2.35 54.54
18 | 104 3.76 58.30
19 | 97 3.51 61.81
20 | 101 3.65 65.46
21 | 99 3.58 69.04
22 | 120 4.34 73.38
23 | 110 3.98 77.36
24 | 83 3.00 80.36
25 | 119 4.30 84.67
26 | 92 3.33 87.99
27 | 84 3.04 91.03
14
28 | 110 3.98 95.01
29 | 72 2.60 97.61
30 | 54 1.95 99.57
31 | 12 0.43 100.00
------------+-----------------------------------
Total | 2,765 100.00
Now we convert these back into numbers:
. destring month, replace
month has all characters numeric; replaced as byte
(18 missing values generated)
. destring day, replace
day has all characters numeric; replaced as byte
(18 missing values generated)
. tab month, m
month | Freq. Percent Cum.
------------+-----------------------------------
2 | 557 20.14 20.14
3 | 745 26.94 47.09
4 | 703 25.42 72.51
5 | 526 19.02 91.54
6 | 216 7.81 99.35
. | 18 0.65 100.00
------------+-----------------------------------
Total | 2,765 100.00
. sum day
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
day | 2747 15.97306 8.259537 1 31
We need to add year – but such variable exists already: . tab year
gss year |
for this |
respondent | Freq. Percent Cum.
------------+-----------------------------------
2002 | 2,765 100.00 100.00
------------+-----------------------------------
Total | 2,765 100.00
Now we need to import date information from these three numeric variables into a single
numeric variable that is coded in the way that Stata understands; here are various possibilities of
importing date from numeric variables – the structure of the command would be, for example:
gen varname= mdyhms(M, D, Y, h, m, s)
where mdyhms is the function you use and M, D, Y, h, m, s in parentheses are replaced with
names of variables where information on each component is stored. Here are all possible
functions:
%tc | mdyhms(M, D, Y, h, m, s)
15
%tc | dhms(td, h, m, s)
%tc | hms(h, m, s)
|
%tC | Cmdyhms(M, D, Y, h, m, s)
%tC | Cdhms(td, h, m, s)
%tC | Chms(h, m, s)
|
%td | mdy(M, D, Y)
|
%tw | yw(Y, W)
%tm | ym(Y, M)
%tq | yq(Y, Q)
%th | yh(Y, H)
%ty | Y
So for our example: . gen intervdate=mdy(month, day, year)
(18 missing values generated)
. sum intervdate
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
intervdate | 2747 15436.14 35.72223 15377 15517
. list intervdate in 1/10
+----------+
| interv~e |
|----------|
1. | 15490 |
2. | 15494 |
3. | 15394 |
4. | 15425 |
5. | 15464 |
|----------|
6. | 15493 |
7. | 15392 |
8. | 15469 |
9. | 15467 |
10. | 15473 |
+----------+
. format intervdate %td
. list intervdate in 1/10
+-----------+
| intervd~e |
|-----------|
1. | 30may2002 |
2. | 03jun2002 |
3. | 23feb2002 |
4. | 26mar2002 |
5. | 04may2002 |
|-----------|
6. | 02jun2002 |
7. | 21feb2002 |
8. | 09may2002 |
9. | 07may2002 |
10. | 13may2002 |
+-----------+
Here is an alternative example from HRS where the precision is only in months, not in days:
16
. use C:\hrs2008\stata\H08A_R.dta, clear
. gen date2008=ym(LA501, LA500)
. tab date2008
date2008 | Freq. Percent Cum.
------------+-----------------------------------
577 | 24 0.14 0.14
578 | 2,781 16.15 16.29
579 | 3,031 17.60 33.90
580 | 2,674 15.53 49.43
581 | 2,233 12.97 62.40
582 | 1,968 11.43 73.83
583 | 1,526 8.86 82.69
584 | 961 5.58 88.27
585 | 842 4.89 93.16
586 | 505 2.93 96.10
587 | 322 1.87 97.97
588 | 325 1.89 99.85
589 | 25 0.15 100.00
------------+-----------------------------------
Total | 17,217 100.00
. format date2008 %tm
. tab date2008
date2008 | Freq. Percent Cum.
------------+-----------------------------------
2008m2 | 24 0.14 0.14
2008m3 | 2,781 16.15 16.29
2008m4 | 3,031 17.60 33.90
2008m5 | 2,674 15.53 49.43
2008m6 | 2,233 12.97 62.40
2008m7 | 1,968 11.43 73.83
2008m8 | 1,526 8.86 82.69
2008m9 | 961 5.58 88.27
2008m10 | 842 4.89 93.16
2008m11 | 505 2.93 96.10
2008m12 | 322 1.87 97.97
2009m1 | 325 1.89 99.85
2009m2 | 25 0.15 100.00
------------+-----------------------------------
Total | 17,217 100.00
Sometimes, however, date information is stored as string variables or we could transfer it into
strings; such strings could contain words, not only numbers. For our example, we could convert
our original variable into string, then add a string containing year (and the leading zero for
month): . tostring dateintv, gen(datestr)
datestr generated as str3
. replace datestr="20020" + datestr
datestr was str3 now str8
(2765 real changes made)
17
Now we will generate an actual date variable in a kind of format that Stata recognizes using a so-
called mask. The conversion from string works as follows:
Format | String-to-numeric conversion function
-------+-----------------------------------------
%tc | clock(string, mask)
%tC | Clock(string, mask)
%td | date(string, mask)
%tw | weekly(string, mask)
%tm | monthly(string, mask)
%tq | quarterly(string, mask)
%th | halfyearly(string, mask)
%ty | yearly(string, mask)
%tg | no function necessary; read as numeric
-------------------------------------------------
The mask specifies the order in which the elements appear; e.g., for a string "June 3, 2010" or"6-
3-2010" we would use "MDY" -- month, day, and year; for a string "3jun2010 16:30:26", the
mask would be DMYhms". The full commands for these examples (not in our dataset!) would
be: . gen date2008= date(intervdate, "MDY")
. gen double time2008 = clock(time, "DMYhms")
Then we would format these: . format date2008 %td
. format time2008 %tc
Now back to our dataset:
. gen date= date(datestr, "YMD")
(18 missing values generated)
. sum date
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
date | 2747 15436.14 35.72223 15377 15517
Finally, we format this variable:
. format date %td
. list date in 1/10
+-----------+
| date |
|-----------|
1. | 30may2002 |
2. | 03jun2002 |
3. | 23feb2002 |
4. | 26mar2002 |
5. | 04may2002 |
|-----------|
6. | 02jun2002 |
7. | 21feb2002 |
8. | 09may2002 |
9. | 07may2002 |
10. | 13may2002 |
+-----------+
18
A few additional functions to work with dates include those that allow us to extract specific
components (month, week, year, etc.) from a date:
. sum date
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
date | 2747 15436.14 35.72223 15377 15517
. gen month=month(date)
(18 missing values generated)
. tab month
month | Freq. Percent Cum.
------------+-----------------------------------
2 | 557 20.28 20.28
3 | 745 27.12 47.40
4 | 703 25.59 72.99
5 | 526 19.15 92.14
6 | 216 7.86 100.00
------------+-----------------------------------
Total | 2,747 100.00
. gen yr=year(date)
(18 missing values generated)
. tab yr
yr | Freq. Percent Cum.
------------+-----------------------------------
2002 | 2,747 100.00 100.00
------------+-----------------------------------
Total | 2,747 100.00
. gen daynum=day(date)
(18 missing values generated)
. sum daynum
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
daynum | 2747 15.97306 8.259537 1 31
. gen weekday=dow(date)
(18 missing values generated)
. tab weekday
weekday | Freq. Percent Cum.
------------+-----------------------------------
0 | 213 7.75 7.75
1 | 466 16.96 24.72
2 | 430 15.65 40.37
3 | 416 15.14 55.52
4 | 439 15.98 71.50
5 | 384 13.98 85.48
6 | 399 14.52 100.00
------------+-----------------------------------
Total | 2,747 100.00
19
. gen week=week(date)
(18 missing values generated)
. sum week
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
week | 2747 14.18092 5.116454 6 26
Note that the first week of a year is the first 7-day period of the year.
When the dates are coded in terms of days (%td format), the following options are possible:
month(…), week(…) , year (…), day (…), dow(…). When the time measure is more precise and
dates are coded in terms of milliseconds, we can extract other types of entities, e.g., options that
can be used include hours(ms), minutes(ms), seconds(ms), etc. There are additional options for
different levels of precision; you can look them up:
. help datetime_functions
Numeric Variable Formats
Numbers in Stata can be stored in 5 different types of variables.
There are three integer formats:
byte – for numbers below 100, ideal for categorical variables
int - numbers up to 32,000
long – up to about 2 billion
And three formats for numbers with fractions:
float (the default) -- about 7 digits of accuracy (224 = 16,777,216 is the largest number
that can be precisely stored)
double – 16 digits of accuracy
When you create a new numeric variable and do not specify the storage type for it, the new
variable is made a float, unless you have previously used “set type” command. For example: . gen hrs40=(hrs1>=40) if hrs1<.
. des hrs40
storage display value
variable name type format label variable label
--------------------------------------------------------------------------------------
hrs40 float %9.0g
. set type double
. drop hrs40
. gen hrs40=(hrs1>=40) if hrs1<.
(1036 missing values generated)
. des hrs40
storage display value
variable name type format label variable label
--------------------------------------------------------------------------------------
20
hrs40 double %10.0g
To set the default back: . set type float
To create a specific variable that differs from default format (float), specify format in the gen or
egen command: . gen byte hrs40=(hrs1>=40) if hrs1<.
If you declare a variable as an integer (byte, int or long), but make it equal to something that in
fact contains fractions, the fractional part will be truncated (not rounded but just cut off!). For
example:
. sum hrs1
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
hrs1 | 1729 41.77675 14.62304 1 89
. gen hrs1d10=hrs1/10
(1036 missing values generated)
. sum hrs1d10
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
hrs1d10 | 1729 4.177675 1.462304 .1 8.9
. gen byte hrs1d10_b=hrs1d10
(1036 missing values generated)
. sum hrs1d10_
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
hrs1d10_b | 1729 3.95026 1.485213 0 8
In most cases, it doesn’t make sense to worry too much about setting the format – except in those
cases where the default (float) causes an undesirable loss of precision. For example, if your IDs
are very large numbers (more than 7 digits) and you store them as default (float), they can be
rounded and therefore no longer uniquely identify individuals. Store such IDs using long or
double; saving them as a string variable is another safe option.
Float and double can also cause us problems if we want to use exact comparisons with fractions
because the way there are stored (in binary format), they might be a tiny little bit off from, say,
1.3 that is displayed to us. So do your comparisons based on intervals rather than exact values
when dealing with fractions. For example:
. tab hrs1d10 if hrs1d10>6 & hrs1d10<7
hrs1d10 | Freq. Percent Cum.
------------+-----------------------------------
6.1 | 2 5.00 5.00
6.2 | 6 15.00 20.00
6.3 | 3 7.50 27.50
6.4 | 4 10.00 37.50
6.5 | 18 45.00 82.50
21
6.6 | 4 10.00 92.50
6.8 | 3 7.50 100.00
------------+-----------------------------------
Total | 40 100.00
. list id if hrs1d10==6.1
. list id if hrs1d10==6.2
. list id if hrs1d10==6.3
. list id if hrs1d10==6.4
. list id if hrs1d10==6.5
+------+
| id |
|------|
33. | 33 |
408. | 408 |
453. | 453 |
758. | 758 |
1105. | 1105 |
|------|
1264. | 1264 |
1340. | 1340 |
1414. | 1414 |
1520. | 1520 |
1702. | 1702 |
|------|
1947. | 1947 |
1957. | 1957 |
2096. | 2096 |
2156. | 2156 |
2269. | 2269 |
|------|
2277. | 2277 |
2327. | 2327 |
2743. | 2743 |
+------+
This problem never occurs with byte, integer, long, or string formats or with integer numbers
stored as float so if you want to use exact conditions, multiply your variable by, say, 100 or 1000
to get rid of decimals.
. gen hrs1dm=hrs1d10*10
(1036 missing values generated)
. des hrs1dm
storage display value
variable name type format label variable label
--------------------------------------------------------------------------------------
hrs1dm float %9.0g
. list id if hrs1dm==61
+------+
| id |
|------|
865. | 865 |
2000. | 2000 |
+------+
22
If your dataset is large, using small variable types like byte can save a lot of memory, but that
can be accomplished after all the variables are created, before saving the dataset, using
the compress command. It will automatically store variables in smaller types if it is possible to
do that without losing precision. It also looks whether strings can be stored as shorter strings. . compress
emailhr was int now byte
chathr was int now byte
artshr was int now byte
emhrh was int now byte
emhrw was int now byte
wwwhrw was int now byte
emhro was int now byte
wwwhro was int now byte
chldprb was int now byte
chldhlp was int now byte
hrs40 was double now byte
(47,005 bytes saved)
You can also change the format type of a specific variable using recast command: recast type varlist [, force]
where type is byte, int, long, float, double, str1, str2, ..., str2045, or strL. For example:
. recast byte hrs1dm
. des hrs1dm
storage display value
variable name type format label variable label
--------------------------------------------------------------------------------------
hrs1dm byte %9.0g
. recast byte hrs1d10
hrs1d10: 786 values would be changed; not changed
. sum hrs1d10
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
hrs1d10 | 1729 4.177675 1.462304 .1 8.9
. recast byte hrs1d10, force
hrs1d10: 786 values changed
. sum hrs1d10
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
hrs1d10 | 1729 3.95026 1.485213 0 8
Note that force makes recast unsafe -- variables can get the new storage type even if that will
cause a loss of precision, introduction of missing values, or, for a string variables, the truncation
of strings.
23
Display Formats for Numeric Variables
We already saw that formatting date variables helps Stata understand that we specified dates and
to display them correctly. You can also modify display format of various numeric variables, also
using format command: format varlist %fmt
Here are the variable formats for numeric variables (from help format):
Numerical
%fmt Description Example
-------------------------------------------------------
right-justified
%#.#g general %9.0g
%#.#f fixed %9.2f
%#.#e exponential %10.7e
%21x hexadecimal %21x
%16H binary, hilo %16H
%16L binary, lohi %16L
%8H binary, hilo %8H
%8L binary, lohi %8L
right-justified with commas
%#.#gc general %9.0gc
%#.#fc fixed %9.2fc
right-justified with leading zeros
%0#.#f fixed %09.2f
left-justified
%-#.#g general %-9.0g
%-#.#f fixed %-9.2f
%-#.#e exponential %-10.7e
left-justified with commas
%-#.#gc general %-9.0gc
%-#.#fc fixed %-9.2fc
-------------------------------------------------------
You may substitute comma (,) for period (.) in any of
the above formats to make comma the decimal point. In
%9,2fc, 1000.03 is 1.000,03. Or you can use “set dp comma.”
The format %g is usually used as %width.0g with 0 decimal points specified, but in fact what
that means is that this format can decide how many digits to display to the right of the decimal
point depending on how many digits total there are, while in %f, the number of digits after the
decimal point is specified precisely by the format. Also, %g format will switch to a %e display
format (exponential) if the number is too large or too small, while %f does not do that.
. des spsei
storage display value
variable name type format label variable label
--------------------------------------------------------------------------------------
spsei float %3.2f spsei r's spouse's socioeconomic index
. list spsei in 7/8
+-------+
| spsei |
24
|-------|
7. | 64.1 |
8. | 29.2 |
+-------+
. format spsei %3.2f
. list spsei in 7/8
+-------+
| spsei |
|-------|
7. | 64.10 |
8. | 29.20 |
+-------+
. format spsei %09.2f
. list spsei in 7/8
+-----------+
| spsei |
|-----------|
7. | 000064.10 |
8. | 000029.20 |
+-----------+
. format spsei %3.2e
. list spsei in 7/8
+---------+
| spsei |
|---------|
7. | 6.4e+01 |
8. | 2.9e+01 |
+---------+
The default formats are: byte %8.0g
int %8.0g
long %12.0g
float %9.0g
double %10.0g
You can also change the default format for displaying all coefficients using set cformat
command – e.g., to only show 2 decimal points, we can use the following command prior to
running our regression models:
set cformat %9.2f