SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

5.3 Numeric variables

5.3.1 Data concepts

5.3.1.1 Vectors/sequences

The columns of data frames in the tidyverse and pandas are vectors and sequences respectively. Vectors and sequences are ordered one dimensional structures in which the elements are all the same type.

The order of vectors and sequence is important for data frames. This allows the observations of a data frame to be represented by the rows.

5.3.1.2 Lists

Lists are similar to vectors and sequences in many regards. They are one dimensional and ordered. The difference is that the types of the elements can be different.

One powerful feature of lists is that lists can be nested within lists. As you learn more you may see these structures. This book does not cover them.

List are used by both the tidyverse and pandas. They will be used more frequently going forward in this book.

5.3.1.3 Dictionary

A dictionary is a Python object that is one dimensional with no order. The only method to access data in a dictionary is with a name. There is no numeric index for a dictionary (unless the names are given as numbers.) This structure is used to associate names with objects.

R has no user class that is similar to dictionaries in Python. The environment objects can be used as dictionaries in R. This is an advanced use that will not be covered in this book.

5.3.2 Programming skills - Variables not in a data frame.

Both R and Python allow variables to be copied from a data frame. This can be convenient to do since there are many functions in both R and Python that operate on vectors and lists.

Variables can be manipulated outside of the data frame. The variable can also be put back in the data frame. It can seem convenient to manipulate variables independently of the data frame, but this must be done with care to insure that the observations line up properly when a variable is added back to the data frame. Some of the most difficult problems to debug occur when a variable is manipulated outside of its data frame and the order of its or the data frame's rows are changed before it is put back in the data frame. Manipulating the variables of a data frame without saving them external to the data frame is the best way to avoid these kinds of problems.

5.3.3 Examples - R

These examples use the Forbes2000.csv data set.

  1. We begin by loading the tidyverse, importing the csv file, and naming the variables.

    library(tidyverse)
    forbes_path <- file.path("..", "datasets", "Forbes2000.csv")
    forbes_in <- read_csv(forbes_path, col_types = cols())
    Warning: Missing column names filled in: 'X1' [1]
    forbes_in <- rename(forbes_in, market_value = marketvalue)
    
    forbes <- 
      forbes_in %>%
      select(-X1)
    
    glimpse(forbes)
    Observations: 2,000
    Variables: 8
    $ rank         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
    $ name         <chr> "Citigroup", "General Electric", "American Intl G...
    $ country      <chr> "United States", "United States", "United States"...
    $ category     <chr> "Banking", "Conglomerates", "Insurance", "Oil & g...
    $ sales        <dbl> 94.71, 134.19, 76.66, 222.88, 232.57, 49.01, 44.3...
    $ profits      <dbl> 17.85, 15.59, 6.46, 20.96, 10.27, 10.81, 6.66, 7....
    $ assets       <dbl> 1264.03, 626.93, 647.66, 166.99, 177.57, 736.45, ...
    $ market_value <dbl> 255.30, 328.54, 194.87, 277.02, 173.54, 117.55, 1...
  2. Converting a variable to numeric

    Numeric variables are sometimes not imported as numeric due to missing value symbols. In other cases numeric values are not imported as numeric due to the inclusion of non-digit charters, such as $, ,, %. When the text that prevented the variable from being parsed as a numeric variable are corrected, such as missing value symbols have been changed to NA and non-numeric characters taken care of, the variable can then be converted to numeric using parse_number().

    All of the forbes numeric variables were read as numeric. We will start this example by converting the sales variable to character and putting a $ in front using concatenation, the str_c() function.

    forbes <- mutate(forbes, sales = str_c("$", sales))
    
    forbes %>%
      select(name, country, sales) %>%
      glimpse()
    Observations: 2,000
    Variables: 3
    $ name    <chr> "Citigroup", "General Electric", "American Intl Group"...
    $ country <chr> "United States", "United States", "United States", "Un...
    $ sales   <chr> "$94.71", "$134.19", "$76.66", "$222.88", "$232.57", "...

    Now we can convert the the sales column to a numeric variable.

    The parse_number() function converts the sales values to numbers.

    forbes <- 
      forbes %>%
      mutate(
        sales = parse_number(sales)
        )
    
    forbes %>%
      select(sales) %>%
      head()
    # A tibble: 6 x 1
      sales
      <dbl>
    1  94.7
    2 134. 
    3  76.7
    4 223. 
    5 233. 
    6  49.0

    The base R as.numeric() function requires that all characters are numbers. The tidyverse parse_number() is a much more flexible approach to converting characters to numeric values.

    This can be seen by using as.numeric() to covert sales with the $ added.

    sales <- as.numeric(str_c("$", forbes$sales))
    Warning: NAs introduced by coercion
    head(sales)
    [1] NA NA NA NA NA NA

    This produces all NAs.

  3. Variables that are not included as part of a data frame.

    In this example we create a sales variable that is not part of the forbes data frame.

    Recall that the tidyverse data frame is a tibble. Subsetting a column of a tibble returns a data frame (tibble) with one column. If the column is to be used as a vector, one does a pull() on the column. Base R data frames will return a vector if a single column is subset from a data frame. A variable that has been removed from a data frame or tibble is a vector.

    sales <- forbes %>% pull(sales)
    
    class(sales)
    [1] "numeric"

    Pulling vectors from data frames can be useful when exploring data. For example, if you wanted a summary of the values of only the sales variable, one could do the following.

    forbes %>% pull(sales) %>% summary()
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0.010   2.018   4.365   9.697   9.547 256.330 

    Note, a list can be added as a column to a data frame. Many data frame methods and functions will not operate correctly with a list-based column. There are times when a list-based variable allows for some very powerful programming. This use of lists as columns will not be covered in this book.

    The sales vector can be converted to a list.

    sales <- as.list(sales)
    head(sales, 4)
    [[1]]
    [1] 94.71
    
    [[2]]
    [1] 134.19
    
    [[3]]
    [1] 76.66
    
    [[4]]
    [1] 222.88
    class(sales)
    [1] "list"
  4. Creating numeric variables.

    Numeric variables result from any expresion made up of functions/methods and operators that return a numeric value for each row.

    Here we calculate the price-earnings (pe) ratio for the companies three different ways. The first uses only the divide operator. The second uses if_else() to add NA for non-positive profits. The third uses the base R if_else() to return no earnings when profits are non-positive. The result of the third pe calculation is of type character.

    forbes <-
      forbes %>%
      mutate(
        pe = market_value / profits,
        pe2 = if_else(profits > 0, market_value / profits, NA_real_),
        pe3 = if_else(profits > 0, as.character(market_value / profits), "no earnings")
      )
    
    filter(forbes, profits <= 0) %>%
      select(name, profits, pe, pe2, pe3) %>%
      head()
    # A tibble: 6 x 5
      name                profits     pe   pe2 pe3        
      <chr>                 <dbl>  <dbl> <dbl> <chr>      
    1 Allianz Worldwide     -1.23 -39.1     NA no earnings
    2 Vodafone             -15.5  -11.3     NA no earnings
    3 Deutsche Telekom     -25.8   -3.26    NA no earnings
    4 Credit Suisse Group   -2.4  -18.4     NA no earnings
    5 France Telecom       -21.8   -2.96    NA no earnings
    6 Generali Group        -0.79 -44.4     NA no earnings

5.3.4 Examples - Python

These examples use the Forbes2000.csv data set.

  1. We begin by loading the packages, importing the csv file, and naming the variables.

    from pathlib import Path
    import pandas as pd
    import numpy as np
    forbes_path = Path('..') / 'datasets' / 'Forbes2000.csv'
    forbes_in = pd.read_csv(forbes_path)
    forbes_in = (
        forbes_in
            .rename(columns={'marketvalue': 'market_value'}))
    forbes =  forbes_in.copy(deep=True)
    
    print(forbes.dtypes)
    Unnamed: 0        int64
    rank              int64
    name             object
    country          object
    category         object
    sales           float64
    profits         float64
    assets          float64
    market_value    float64
    dtype: object
  2. Converting a variable to numeric

    Numeric variables are sometimes not imported as numeric due to missing value symbols. In other cases numeric values are not imported as numeric due to the inclusion of non-digit charters, such as $, ,, %. When the missing value symbols have been changed to np.NaN and the other non-numeric characters taken care of, the variable can then be converted to numeric.

    All of the forbes numeric variables were read as numeric. We will start this example by converting the sales variable to character and putting a $ in front using concatenation.

    forbes.sales = '$' + forbes.sales.astype(str)
    
    (forbes
        .loc[:, ['name', 'country', 'sales']]
        .head(5)
        .pipe(print))
                      name         country    sales
    0            Citigroup   United States   $94.71
    1     General Electric   United States  $134.19
    2  American Intl Group   United States   $76.66
    3           ExxonMobil   United States  $222.88
    4                   BP  United Kingdom  $232.57

    Now we can convert the the sales column to a numeric variable.

    The to_numeric() pandas function parses string variables and produces a numeric variable.

    forbes = (
        forbes
            .assign(
                sales=lambda df: df.sales.str.extract('\$(.+)', expand=False))
            .assign(
                sales=lambda df: pd.to_numeric(df.sales, errors='coerce')))
    
    (forbes
        .loc[:, ['name', 'country', 'sales']]
        .head(5)
        .pipe(print))
                      name         country   sales
    0            Citigroup   United States   94.71
    1     General Electric   United States  134.19
    2  American Intl Group   United States   76.66
    3           ExxonMobil   United States  222.88
    4                   BP  United Kingdom  232.57
    print(forbes.sales.dtype)
    float64

    The pandas methods astype('float64') and astype('int64') are not as flexible as to_numeric() in converting from strings to numeric, but you might see them in exmaples on the web.

  3. Variables that are not included as part of a data frame.

    In this example we create a sales variable that is not part of the forbes data frame. The type of the stand-alone variable is a series.

    sales = forbes['sales']
    
    print(type(sales))
    <class 'pandas.core.series.Series'>

    The sales series can be converted to a list. The list is created using the [] operator. The list() function could also be used.

    sales = [forbes['sales']]
    
    print(type(sales))
    <class 'list'>

    The sales series can be converted to an np array object using the values attribute.

    sales = forbes['sales'].values
    
    print(type(sales))
    <class 'numpy.ndarray'>
  4. Creating numeric variables.

    Numeric variables result from any expresion made up of functions/methods and operators that return a numeric value for each row.

    Here we calculate the price-earnings (pe) ratio for the companies three different ways. The first uses only the divide operator. The second uses np.where() to add np.NaN for non-positive profits. The third uses np.where() to add 'no earnings' for non-positive profits. The result of the third pe calculation is of type string.

    forbes = (
        forbes
            .assign(
                pe=lambda df: df['market_value'] / df['profits'])
            .assign(
                pe2=lambda df: np.where(df['pe'] < 0, np.NAN, df['pe']),
                pe3=lambda df: np.where(
                    df['pe'] < 0,
                    'no earnings',
                    df['pe'])))
    
    (forbes
        .loc[:, ['name', 'profits', 'pe', 'pe2', 'pe3']]
        .query('profits <= 0')
        .head(5)
        .pipe(print))
                        name  profits         pe  pe2          pe3
    349    Allianz Worldwide    -1.23 -39.081301  NaN  no earnings
    353             Vodafone   -15.51 -11.257898  NaN  no earnings
    363     Deutsche Telekom   -25.83  -3.259001  NaN  no earnings
    371  Credit Suisse Group    -2.40 -18.387500  NaN  no earnings
    373       France Telecom   -21.78  -2.955005  NaN  no earnings

    The calculation for the pe variable, df['market_value'] / df['profits'], is the result of a mathematical operation on series that have been subset from the data frame. These kinds of mathematical operations preserve the order of the series.

    This example could also have been done using a series outside of the data frame and subsetting.

    pe = forbes['market_value'] / forbes['profits']
    forbes['pe'] = pe
    forbes['pe2'] = np.where(pe < 0, np.NAN, pe)
    forbes['pe3'] = np.where(pe < 0, 'no earnings', pe)
    
    (forbes
        .loc[:, ['name', 'profits', 'pe', 'pe2', 'pe3']]
        .query('profits <= 0')
        .head(5)
        .pipe(print))
                        name  profits         pe  pe2          pe3
    349    Allianz Worldwide    -1.23 -39.081301  NaN  no earnings
    353             Vodafone   -15.51 -11.257898  NaN  no earnings
    363     Deutsche Telekom   -25.83  -3.259001  NaN  no earnings
    371  Credit Suisse Group    -2.40 -18.387500  NaN  no earnings
    373       France Telecom   -21.78  -2.955005  NaN  no earnings

    The results are the same as with method chaining. This approach is a common coding stlye and results in code that executes quickly and is still readable. Here the pe series, which is outside of the data frame, is created and then inserted into the data frame almost immediately. There is no opportunity for the order of either the series or the data frame to change. If the creation of the series and the insertion into the data frame do not occur right next to each other as in this example, the data frame that is being worked on may be manipulated in some way that changes the order. Then the series, that is outside the data frame, is not aligned with the observations of the data frame. These kinds of error can go undetected for a long time and are difficult to find. The assign() and lambda approach protects the relationship of the observations, at the cost of a little extra typing.

5.3.5 Exercises

These exercises use the mtcars.csv data set.

  1. Import the mtcars.csv data set.

  2. The wt variable is measured in thousands of pounds. Change this variable to a character variable that has a comma separating the thousands digit from the hundreds digit, e.g. 2.14 becomes 2,140.

    Hint, for one of the possible solution you may find it useful to look for a string function/method that will pad. Padding adds characters to a fixed width.

  3. Convert the character variable you created in the prior exercise to a new numeric variable. Make the units of measure for this new variable 1,000 of pounds.