Skip to content

Latest commit

 

History

History
133 lines (80 loc) · 2.08 KB

File metadata and controls

133 lines (80 loc) · 2.08 KB

Lecture 8

Data warehousing


Anything you'd like to share?


Data loading

  • Append load
  • Trunc(ate) and load
  • Incremental load

Tracking loads

It can be helpful to track the load information for each row. These might be things like:

  • _LOADED_FROM: filename, if applicable
  • _LOADED_AT: timestamp

"Give me the data from the most recent load"


DuckDB example

Create

CREATE TABLE my_table AS
SELECT
   *,
   current_localtimestamp() AS _LOADED_AT,
   filename AS _LOADED_FROM
FROM read_csv('[path]', filename=true);

Subsequent loads

INSERT INTO my_table
SELECT
   *,
   current_localtimestamp() AS _LOADED_AT,
   filename AS _LOADED_FROM
FROM read_csv('[path]', filename=true);

You can imagine the equivalent in a DataFrame.


Retrieving the latest load

SELECT *
FROM my_table
WHERE _LOADED_AT = (SELECT MAX(_LOADED_AT) FROM my_table);

Let's say you were given access to a random table that uses one of the three data loading methods above. How would you tell which it was?


Incremental load

The trick is avoiding duplicates. Your script might then need to say something like:

  1. What's the latest timestamp in the database?
  2. Pull data from the API that's more recent than that.

BEGIN TRANSACTION;
INSERT INTO ...;
COMMIT;

Revisiting the Project goals


Why use a database?


Data engineering