This article reviews two possible scenarios for “writing back” predictions to the database, and without importing importing the data into memory first. Both scenarios share a common model preparation method.
To keep the example reproducible, a SQLite database will be used to simulate a larger scale deployment.
First the data is prepared in memory. The article will use the
nycflights13::flights data, with a couple of modifications
A new database is created using
A sample is downloaded from the database for modeling. This example already selects the needed variables.
A linear model is fitted using
It is highly recommendable to always run
tidypredict_test() to make sure that the predictions are in line with what the
predict() command returns.
glm models contain the data that they were fitted with, so it’s easy to run a test.
tidypredict_test() uses the model’s internal data set by default
In cases where the model re-fitting is automated, the
tidypredict_test() function returns an
alert in case the threshold is exceeded, which can be used to fail the automated script.
In this scenario, The table that supplies the term values is also the recipient of the new score. This is done by updating a specific field in the table. For this example, the field is called
current_score. The following SQL UPDATE statement should work in most databases:
This statement can be then passed on to the database team, via documentation or an automated process. In cases where the analyst has the responsibility to run the new SQL statement, or if R is being used to automate the scoring, the next line can be used:
Here is a sample of the newly populated field:
There may be the need to retain not only the new score, but when it was determined and its history. This is usually possible because the source of record possesses a unique key identifier per entity, such as transaction ID or customer ID. In the example,
flights_id is the unique identifier.
In this example, the new scores will be stored in a new table called
daily_scores. The following code is part of the example preparation, it creates the table and seeds it with a single row. This is not the best way to create an empty table, but it’ll do for the purposes of this example.
The plan is to use a SQL statement that most vendors support, is called INSERT INTO SELECT. The idea is to use
dplyr laziness to prepare the data transformation and predictions, but it’s not going to be executed until is parsed into SQL and sent as part of a another statement. The INSERT INTO SELECT statement allows for the results of a query to be saved in a table, and without leaving the database.
In this example, predictions are going to be executed for just the records in the month of December. The data is filtered, and then
tidypredict_to_column() is used to create the new fit field. The results are then transformed to match to the structure of the new
insert_scores <- build_sql("INSERT INTO daily_scores ", sql_render(new_predictions, con = con), con = con) insert_scores #> <SQL> INSERT INTO daily_scores SELECT `flight_id`, `score`, '01/01/2018' AS `date` #> FROM (SELECT `flight_id`, -3.5984422918702 + (`hour` * 1.38710560882252) + (`distance` * -0.00307606912118567) AS `score` #> FROM (SELECT * #> FROM `flights_table` #> WHERE (`month` = 12.0)))
As in the first scenario, this statement can be then passed on to the database team, via documentation or an automated process. In cases where the analyst has the responsibility to run the new SQL statement, or if R is being used to automate the scoring, the next line can be used:
dbSendQuery(con, insert_scores) #> <SQLiteResult> #> SQL INSERT INTO daily_scores SELECT `flight_id`, `score`, '01/01/2018' AS `date` #> FROM (SELECT `flight_id`, -3.5984422918702 + (`hour` * 1.38710560882252) + (`distance` * -0.00307606912118567) AS `score` #> FROM (SELECT * #> FROM `flights_table` #> WHERE (`month` = 12.0))) #> ROWS Fetched: 0 [complete] #> Changed: 28135
A simple table join can be used to confirm that the new update worked. For real life scenarios, a more sophisticated query should be performed in order to only get the latest score. For this example, we simple filter on the same date we inserted
tbl(con, "daily_scores") %>% inner_join(tbl(con, "flights_table"), by = "flight_id") %>% filter(date == "01/01/2018") %>% select(dep_delay, hour, distance, score, date) #> Warning: Closing open result set, pending rows #> # Source: lazy query [?? x 5] #> # Database: sqlite 3.30.1  #> dep_delay hour distance score date #> <dbl> <dbl> <dbl> <dbl> <chr> #> 1 14 23 1617 23.3 01/01/2018 #> 2 18 23 1576 23.5 01/01/2018 #> 3 -7 5 529 1.71 01/01/2018 #> 4 5 5 1400 -0.969 01/01/2018 #> 5 -4 5 1089 -0.0128 01/01/2018 #> 6 -10 5 1576 -1.51 01/01/2018 #> 7 -4 5 569 1.59 01/01/2018 #> 8 1 5 1416 -1.02 01/01/2018 #> 9 -11 6 214 4.07 01/01/2018 #> 10 -10 6 1065 1.45 01/01/2018 #> # … with more rows