Build a system that automatically answer students’ questions posted on Piazza.
We will use GPT to answer students’ questions. However, to make GPT’s answers more accurate, we will send along “context” that can help GPT answer the questions. This context is a chunk of transcription of Prof. Kaiser’s lecture.
Data preparation
- Transcription
- Transcribe Prof. Kaiser’s lectures using Whisper from OpenAI. Result = a lot of text files
- Prepare a database of vectors (vectors = embeddings of text chunks)
- Split all of the text into chunks
- Embed all those chunks using SBERT (in the code, you will see HuggingFaceEmbeddings, but it is just a wrapper of SBERT) into vectors
- Save those vectors in the Chroma database
Real-time response
- Fetch a student’s question from Piazza
- Embed that question text using SBERT into a vector (called vector Q)
- Perform similarity search -> find k nearest neighbors to vector Q (by default k = 4). Neighbors = Vectors saved in Chroma
- Send that question, along with the text associated with the closest vector (a context), to GPT
- Post GPT’s answer back to Piazza
- Sometimes, the context vector that has the answer isn't nearest to vector Q, e.g. answer is in the second nearest vector. -> We can try fixing this by sending more than one context vector to GPT.
- It is hard to know how we should split the transcriptions (what chunk size/overlap to use). -> We can split texts with different parameters, embed and store all of them in the database. (Certain sentences might appear in multiple vectors.)
Transcription - https://github.com/openai/whisper/blob/main/whisper/transcribe.py
Question Answering - https://python.langchain.com/docs/use_cases/question_answering/how_to/vector_db_qa
- Current code is using VectorDBQAWithSourcesChain, but it is deprecated. Please change it to RetrievalQAWithSourcesChain.
- Currently, there are no official Piazza APIs, so we will use unofficial ones (https://github.com/hfaran/piazza-api/tree/develop) to fetch questions from Piazza and post answers back to it.