data science overview 🌱

Last updated on March 12, 2021

Data Science Overview

Walk you through the steps of what the data science process looks like from start to finish.

Everyone’s data is different and many companies use different methods of storing their data, even if they have extremely similar underlying data.
To solve their problems, we most likely want a subset of the data that they have available, which is stored all over the place

Spark
SQL
Data lake vs data warehouse
AWS (S3, Glue, Redshift)
- Glue to piece together your various data from S3. Redshift can be used if you have structured data stories that you want to be able to quickly query from. Can also use glue to write to Redshift datawarehouse

Python/R
Jupyter Notebook/Databricks
Tableau/Excel
Communication with ETL throughout rest of stages to talk about new features needed, errors, etc.

$Y_{i} = f (X_{i}) + ϵ_{i}$

Most important thing to be aware of: what assumptions am I making?

Train/Validation/Test Split
- train on train, select model/hyperparameters on val, evaluate on test (should be holdout)
Supervised/Unsupervised
- Supervised: regression/classification
- Unsupervised: clustering, generative modeling
Prediction/Inference
Regression/Classification
- Regression: MSE
- Classification: Accuracy, ROC Curve

Tools for Testing and Integration

Services to Train/Host Model

AWS Sagemaker (preconfigured environments) or AWS EC2 (compute power, so you need to set up your own environment)

Infrustructure Stuff

There are no notes linking to this note.

Here are all the notes in this garden, along with their links, visualized as a graph.