Skip to content

Latest commit

 

History

History
25 lines (19 loc) · 1.29 KB

README.md

File metadata and controls

25 lines (19 loc) · 1.29 KB

elon_musk_dataset

Preparing datasets with Elon Musk phrases.

Preparing

  1. Install poetry
  2. Install dependencies: poetry install
  3. Activate virtual environment: poetry shell

Scrapping

To scrap data from rev.com use command like that:
PYTHONPATH=. python scrap/rev.py --base_url https://www.rev.com/blog/transcripts?s=Elon+Musk --save_path rev-interviews.jsonlines

From vox.com:
PYTHONPATH=. python scrap/vox.py --interview_url https://www.vox.com/2018/11/2/18053428/recode-decode-full-podcast-transcript-elon-musk-tesla-spacex-boring-company-kara-swisher --save_path vox-interview.jsonlines

Collecting data

By script

PYTHONPATH=. python collect_data/run.py collect-all-dialogs --interview-paths=rev-interviews.jsonlines,vox-interview.jsonlines --save-path=all_dialog_dataset.csv

PYTHONPATH=. python collect_data/run.py collect-short-answer-dialogs --interview-paths=rev-interviews.jsonlines,vox-interview.jsonlines --save-path=short_answer_dialog_dataset.csv

PYTHONPATH=. python collect_data/run.py collect-all-phrases --interview-paths=rev-interviews.jsonlines,vox-interview.jsonlines --save-path=all_phrases_dataset.csv

By ipynb

  • notebooks/create_twitter_dataset.ipynb
  • notebooks/create_oneliners_dataset.ipynb