Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix links to references #210

Merged
merged 1 commit into from
May 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions contents/data_engineering/data_engineering.bib
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,19 @@ @article{gebru2021datasheets
month = nov,
}

@inproceedings{Data_Cascades_2021,
author = {Sambasivan, Nithya and Kapania, Shivani and Highfill, Hannah and Akrong, Diana and Paritosh, Praveen and Aroyo, Lora M},
title = {{{\textquotedblleft}Everyone} wants to do the model work, not the data work{\textquotedblright}: {Data} Cascades in High-Stakes {AI}},
booktitle = {Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems},
pages = {1--15},
year = {2021},
doi = {10.1145/3411764.3445518},
source = {Crossref},
url = {https://doi.org/10.1145/3411764.3445518},
publisher = {ACM},
month = may,
}

@misc{googleinformation,
author = {Google},
bdsk-url-1 = {https://blog.google/documents/83/},
Expand Down
2 changes: 1 addition & 1 deletion contents/data_engineering/data_engineering.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ We begin by discussing data collection: Where do we source data, and how do we g

## Problem Definition

In many machine learning domains, sophisticated algorithms take center stage, while the fundamental importance of data quality is often overlooked. This neglect gives rise to ["Data Cascades"](https://research.google/pubs/pub49953/) (see @fig-cascades)—events where lapses in data quality compound, leading to negative downstream consequences such as flawed predictions, project terminations, and even potential harm to communities. In @fig-cascades, we have an illustration of potential data pitfalls at every stage and how they influence the entire process down the line. The influence of data collection errors is especially pronounced. Any lapses in this stage will become apparent at later stages (in model evaluation and deployment) and might lead to costly consequences, such as abandoning the entire model and restarting anew. Therefore, investing in data engineering techniques from the onset will help us detect errors early.
In many machine learning domains, sophisticated algorithms take center stage, while the fundamental importance of data quality is often overlooked. This neglect gives rise to ["Data Cascades"](https://research.google/pubs/pub49953/) by (see @fig-cascades)—events where lapses in data quality compound, leading to negative downstream consequences such as flawed predictions, project terminations, and even potential harm to communities. In @fig-cascades, we have an illustration of potential data pitfalls at every stage and how they influence the entire process down the line. The influence of data collection errors is especially pronounced. Any lapses in this stage will become apparent at later stages (in model evaluation and deployment) and might lead to costly consequences, such as abandoning the entire model and restarting anew. Therefore, investing in data engineering techniques from the onset will help us detect errors early.

![Data cascades: compounded costs. Credit: @Data_Cascades_2021.](images/png/data_engineering_cascades.png){#fig-cascades}

Expand Down
Loading