Skip to content

Commit

Permalink
Merge pull request #210 from harvard-edge/155-article-references-cons…
Browse files Browse the repository at this point in the history
…istency

Fix links to references
  • Loading branch information
profvjreddi authored May 18, 2024
2 parents 7803fe9 + dba0233 commit 9b98ea5
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 1 deletion.
13 changes: 13 additions & 0 deletions contents/data_engineering/data_engineering.bib
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,19 @@ @article{gebru2021datasheets
month = nov,
}

@inproceedings{Data_Cascades_2021,
author = {Sambasivan, Nithya and Kapania, Shivani and Highfill, Hannah and Akrong, Diana and Paritosh, Praveen and Aroyo, Lora M},
title = {{{\textquotedblleft}Everyone} wants to do the model work, not the data work{\textquotedblright}: {Data} Cascades in High-Stakes {AI}},
booktitle = {Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems},
pages = {1--15},
year = {2021},
doi = {10.1145/3411764.3445518},
source = {Crossref},
url = {https://doi.org/10.1145/3411764.3445518},
publisher = {ACM},
month = may,
}

@misc{googleinformation,
author = {Google},
bdsk-url-1 = {https://blog.google/documents/83/},
Expand Down
2 changes: 1 addition & 1 deletion contents/data_engineering/data_engineering.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ We begin by discussing data collection: Where do we source data, and how do we g

## Problem Definition

In many machine learning domains, sophisticated algorithms take center stage, while the fundamental importance of data quality is often overlooked. This neglect gives rise to ["Data Cascades"](https://research.google/pubs/pub49953/) (see @fig-cascades)—events where lapses in data quality compound, leading to negative downstream consequences such as flawed predictions, project terminations, and even potential harm to communities. In @fig-cascades, we have an illustration of potential data pitfalls at every stage and how they influence the entire process down the line. The influence of data collection errors is especially pronounced. Any lapses in this stage will become apparent at later stages (in model evaluation and deployment) and might lead to costly consequences, such as abandoning the entire model and restarting anew. Therefore, investing in data engineering techniques from the onset will help us detect errors early.
In many machine learning domains, sophisticated algorithms take center stage, while the fundamental importance of data quality is often overlooked. This neglect gives rise to ["Data Cascades"](https://research.google/pubs/pub49953/) by (see @fig-cascades)—events where lapses in data quality compound, leading to negative downstream consequences such as flawed predictions, project terminations, and even potential harm to communities. In @fig-cascades, we have an illustration of potential data pitfalls at every stage and how they influence the entire process down the line. The influence of data collection errors is especially pronounced. Any lapses in this stage will become apparent at later stages (in model evaluation and deployment) and might lead to costly consequences, such as abandoning the entire model and restarting anew. Therefore, investing in data engineering techniques from the onset will help us detect errors early.

![Data cascades: compounded costs. Credit: @Data_Cascades_2021.](images/png/data_engineering_cascades.png){#fig-cascades}

Expand Down

0 comments on commit 9b98ea5

Please sign in to comment.