Description¶

This tool is designed to keep track of all changes made to your models, and to track any metrics you want to run on your models. For example if you make many changes to your model's formula, just run

track(model)

after every iteration to store all the models you tested and it will track them, store them in a dataframe, and sort them by time. It can also keep track of metrics, functions you want to run on your model, and custom metrics.

Usage¶

In [ ]:

track(metrics, metrics=NULL, customMetrics=NULL, ...)

Arguments¶

metrics A list of strings that exist as attributes on the model.

customFunctions A list of strings that exists as functions you have defined.

... Additional arguments to be entered as new columns in the MODELS dataframe.

Source code¶

In [2]:

library(plyr)
library(dplyr)

track <- function(model, metrics=NULL, customFunctions=NULL, ...){
    
    kwargs <- list(...);
    
    environmentVars <- list(
        datetime = format(Sys.time(), "%Y-%m-%d %H:%M:%S"),
        username = Sys.info()[["user"]],
        className = class(model)[[1]]
    )
    
    form <- data.frame(formula=paste(deparse(formula(model)), collapse=''));
    
    for (env in names(environmentVars)){
        form[[env]] <-  environmentVars[[env]];
    }

    if (!is.null(metrics)){
        for (metric in metrics){
            form[[metric]]  <- summary(model)[[metric]];
        }
    }

    if (!is.null(customFunctions)){
        for (cmetric in customFunctions){
            form[[cmetric]] <- get(cmetric)(model);
        }
    }

    for (arg in names(kwargs)){
        form[[arg]] <- kwargs[[arg]];
    }
    
    DIRTY_METRICS <- FALSE;
    
    if (exists('METRICS_LIST')){
        
        DIRTY_METRICS <- FALSE %in% 
            unique(
                append(
                    names(kwargs), unlist(
                        append(
                            metrics, customFunctions
                        )
                    )
                ) == METRICS_LIST
            )
        
        if (DIRTY_METRICS){
            MMODELS <<- form;
        }
    }
    
    METRICS_LIST <<- append(
        names(kwargs), unlist(
            append(
                metrics, customFunctions
            )
        )
    )

    if (exists('MODELS')){
        COUNTER <<- COUNTER + 1;
    } else {
        MODELS <<- form;
        COUNTER <<- 1;
    }
    
    if (exists('METRICS_LIST') & DIRTY_METRICS){

        for (env in names(environmentVars)){
            MMODELS[[env]] <<-  environmentVars[[env]];
        }

        if (!is.null(metrics)){
            for (metric in metrics){
                MMODELS[[metric]]  <<- summary(model)[[metric]];
            }
        }

        if (!is.null(customFunctions)){
            for (cmetric in customFunctions){
                MMODELS[[cmetric]] <<- get(cmetric)(model);
            }
        }

        for (arg in names(kwargs)){
            MMODELS[[arg]] <<- kwargs[[arg]];
        }
    }
    
    if (COUNTER > 1){
        if (DIRTY_METRICS){
            combinedDf <- rbind.fill(MODELS, MMODELS);
            MODELS <<- combinedDf[with(combinedDf, order(datetime, decreasing=TRUE)),];
        } else {
            combinedDf <- rbind.fill(MODELS, form);
            MODELS <<- combinedDf[with(combinedDf, order(datetime, decreasing=TRUE)),];
        }
    }
    return(form);
}

Read in data¶

In [3]:

df <- read.csv('middle_tn_schools.csv')

In [4]:

head(df)

name	school_rating	size	reduced_lunch	state_percentile_16	state_percentile_15	stu_teach_ratio	school_type	avg_score_15	avg_score_16	full_time_teachers	percent_black	percent_white	percent_asian	percent_hispanic
Allendale Elementary School	5	851	10	90.2	95.8	15.7	Public	89.4	85.2	54	2.9	85.5	1.6	5.6
Anderson Elementary	2	412	71	32.8	37.3	12.8	Public	43.0	38.3	32	3.9	86.7	1.0	4.9
Avoca Elementary	4	482	43	78.4	83.6	16.6	Public	75.7	73.0	29	1.0	91.5	1.2	4.4
Bailey Middle	0	394	91	1.6	1.0	13.1	Public Magnet	2.1	4.4	30	80.7	11.7	2.3	4.3
Barfield Elementary	4	948	26	85.3	89.2	14.8	Public	81.3	79.6	64	11.8	71.2	7.1	6.0
Barkers Mill Elementary School	4	893	48	78.1	76.4	13.9	Public	69.4	72.3	64	28.6	39.9	2.2	17.8

ModelTracker default options¶

By default, ModelTracker records your model's formula, the execution timestamp, the username, and the model's class name.

In [5]:

model.lm <- lm(avg_score_16 ~ stu_teach_ratio + school_type, data=df)

summary(model.lm)

track(model.lm)

Call:
lm(formula = avg_score_16 ~ stu_teach_ratio + school_type, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-64.817 -18.583   1.753  21.543  59.687 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  5.3802     8.6094   0.625 0.532438    
stu_teach_ratio              3.5258     0.5578   6.321 8.10e-10 ***
school_typePublic Charter    5.7192     9.3476   0.612 0.541051    
school_typePublic Magnet   -15.5177     4.1378  -3.750 0.000207 ***
school_typePublic Virtual -318.9387    59.4852  -5.362 1.52e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 26.08 on 342 degrees of freedom
Multiple R-squared:  0.1403,	Adjusted R-squared:  0.1303 
F-statistic: 13.96 on 4 and 342 DF,  p-value: 1.47e-10

formula	datetime	username	className
avg_score_16 ~ stu_teach_ratio + school_type	2017-06-29 21:34:18	tdobbins	lm

Details about the models you have tested are stored in MODELS.

In [6]:

MODELS

formula	datetime	username	className
avg_score_16 ~ stu_teach_ratio + school_type	2017-06-29 21:34:18	tdobbins	lm

Metrics¶

track accepts an argument called metrics. This has to be an attribute that exists on your model's summary object. To extract it, put the attribute name in a list of strings like

list('r.squared', 'sigma')

We'll also add another variable (state_percentile_15) to our formula to display how ModelTracker keeps track of the formulas.

In [7]:

model.lm <- lm(avg_score_16 ~ stu_teach_ratio + school_type + state_percentile_15, data=df)

metrics <- list('r.squared', 'sigma')

summary(model.lm)
track(model.lm, metrics=metrics)

Call:
lm(formula = avg_score_16 ~ stu_teach_ratio + school_type + state_percentile_15, 
    data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-33.660  -5.444   0.069   5.711  41.670 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 8.5299     3.1113   2.742  0.00644 ** 
stu_teach_ratio             0.1617     0.2132   0.759  0.44864    
school_typePublic Charter  -3.1549     3.7291  -0.846  0.39814    
school_typePublic Magnet   -2.0058     1.4565  -1.377  0.16939    
school_typePublic Virtual   1.3703    22.3127   0.061  0.95107    
state_percentile_15         0.7992     0.0161  49.651  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.005 on 335 degrees of freedom
  (6 observations deleted due to missingness)
Multiple R-squared:  0.8972,	Adjusted R-squared:  0.8956 
F-statistic: 584.5 on 5 and 335 DF,  p-value: < 2.2e-16

formula	datetime	username	className	r.squared	sigma
avg_score_16 ~ stu_teach_ratio + school_type + state_percentile_15	2017-06-29 21:34:23	tdobbins	lm	0.8971659	9.00481

Notice our MODELS object now contains two additional metrics, r.squared and sigma. The first model, since it wasn't tracking those metrics, shows NA for those values.

In [8]:

MODELS

	formula	datetime	username	className	r.squared	sigma
2	avg_score_16 ~ stu_teach_ratio + school_type + state_percentile_15	2017-06-29 21:34:23	tdobbins	lm	0.8971659	9.00481
1	avg_score_16 ~ stu_teach_ratio + school_type	2017-06-29 21:34:18	tdobbins	lm	NA	NA

Custom functions¶

ModelTracker also accepts an argument called customFunctions. This is for defining custom functions that accept the model as input and return a single value. Be sure that the functions only accept one argument, your model.

For example, I define two functions, getr2 and getFstat.

In [9]:

getr2 <- function(model){
    return(summary(model)$r.squared);
}

getFstat <- function(model){
    return(summary(model)$fstatistic[['value']])
}

Now I will create a list and put my custom functions in them then pass them to track. Note, you must specify them as strings.

In [10]:

model.lm <- lm(avg_score_16 ~ reduced_lunch + poly(size, 2), data=df)

summary(model.lm)

customFunctions <- list('getr2', 'getFstat')
track(model.lm, customFunctions=customFunctions)

Call:
lm(formula = avg_score_16 ~ reduced_lunch + poly(size, 2), data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-45.168 -10.779  -1.083  10.450  58.887 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    103.22970    1.97690  52.218  < 2e-16 ***
reduced_lunch   -0.91846    0.03541 -25.935  < 2e-16 ***
poly(size, 2)1 -46.38692   16.66492  -2.784  0.00568 ** 
poly(size, 2)2 -19.69939   16.12389  -1.222  0.22264    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 16 on 343 degrees of freedom
Multiple R-squared:  0.6756,	Adjusted R-squared:  0.6728 
F-statistic: 238.1 on 3 and 343 DF,  p-value: < 2.2e-16

formula	datetime	username	className	getr2	getFstat
avg_score_16 ~ reduced_lunch + poly(size, 2)	2017-06-29 21:38:37	tdobbins	lm	0.6756111	238.1242

Notice that the MODELS object sorts your models by time in descending order.

In [11]:

MODELS

	formula	datetime	username	className	r.squared	sigma	getr2	getFstat
3	avg_score_16 ~ reduced_lunch + poly(size, 2)	2017-06-29 21:38:37	tdobbins	lm	NA	NA	0.6756111	238.1242
1	avg_score_16 ~ stu_teach_ratio + school_type + state_percentile_15	2017-06-29 21:34:23	tdobbins	lm	0.8971659	9.00481	NA	NA
2	avg_score_16 ~ stu_teach_ratio + school_type	2017-06-29 21:34:18	tdobbins	lm	NA	NA	NA	NA

Custom metrics¶

Finally, my favorite. This feature lets you track any metric by calculating it before running track and then just inputting the value into our track call.

For instance, here I calculate pseudoR2 and retrieve the coefficient for reduced_lunch. Then I add them to track using whatever names I want; here I use pseudoR2 and reducedLunchCoef.

In [12]:

model.glm <- glm(avg_score_16 ~ I(reduced_lunch^2) + sqrt(size) + factor(school_type),
                 data=df)

summary(model.glm)

pseudoR2 <- 1-(model.glm$deviance/model.glm$null.deviance)
reducedLunchCoef <- summary(model.glm)$coefficients[2,1]

track(model.glm, pseudoR2=pseudoR2, reducedLunchCoef=reducedLunchCoef)

Call:
glm(formula = avg_score_16 ~ I(reduced_lunch^2) + sqrt(size) + 
    factor(school_type), data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-45.657   -9.101    0.439    9.698   45.115  

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                       92.5334745  3.6667864  25.236  < 2e-16 ***
I(reduced_lunch^2)                -0.0093298  0.0003373 -27.662  < 2e-16 ***
sqrt(size)                        -0.2344695  0.1221849  -1.919   0.0558 .  
factor(school_type)Public Charter 33.3718677  5.5360610   6.028 4.31e-09 ***
factor(school_type)Public Magnet  -4.7844382  2.4292237  -1.970   0.0497 *  
factor(school_type)Public Virtual -4.4385146 15.2299952  -0.291   0.7709    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 228.2756)

    Null deviance: 270663  on 346  degrees of freedom
Residual deviance:  77842  on 341  degrees of freedom
AIC: 2877.1

Number of Fisher Scoring iterations: 2

formula	datetime	username	className	pseudoR2	reducedLunchCoef
avg_score_16 ~ I(reduced_lunch^2) + sqrt(size) + factor(school_type)	2017-06-29 21:43:12	tdobbins	glm	0.7124028	-0.009329825

Whatever name you pass to track will show up as the column name.

In [13]:

MODELS

	formula	datetime	username	className	r.squared	sigma	getr2	getFstat	pseudoR2	reducedLunchCoef
4	avg_score_16 ~ I(reduced_lunch^2) + sqrt(size) + factor(school_type)	2017-06-29 21:43:12	tdobbins	glm	NA	NA	NA	NA	0.7124028	-0.009329825
1	avg_score_16 ~ reduced_lunch + poly(size, 2)	2017-06-29 21:38:37	tdobbins	lm	NA	NA	0.6756111	238.1242	NA	NA
2	avg_score_16 ~ stu_teach_ratio + school_type + state_percentile_15	2017-06-29 21:34:23	tdobbins	lm	0.8971659	9.00481	NA	NA	NA	NA
3	avg_score_16 ~ stu_teach_ratio + school_type	2017-06-29 21:34:18	tdobbins	lm	NA	NA	NA	NA	NA	NA

In [ ]:

ModelTrackeR

Description¶

Usage¶

Arguments¶

Source code¶

Read in data¶

ModelTracker default options¶

Metrics¶

Custom functions¶

Custom metrics¶

Share on

You May Also Enjoy

SQLCell 2.0: Redesigning SQLCell for JupyterLab

So you want to write a book? A conversation with Manning author John Berryman

Concept Frequency-Inverse Concept Document Frquency: Analyzing Concepts in Text

Think Twice Before You Accept That Fancy Data Science Job