1. Insert your data into TReviewsUsefulInfo.py file matching the fields that will communicate with the other files and supply the necessary data to execute many of the functions present in the scripts.
import re
import multiprocessing
def getCompanyNameFromURL(url): # Get Company Name By Reading the Last Part of the Input URL
url = url.split("/")[-1]
regex = r'(?:https?://)?(?:www\.)?([a-zA-Z0-9-]+)\.'
match = re.search(regex, url)
if match:
return match.group(1)
return None
tLinksList = {"TRUSTPILOT LINK": [10, 20, 30, 40, 50, ...]} #Trustpilot link and relative intervals to stop
idx = 0
TrustpilotLink = list(tLinksList.keys())[idx]
TrustpilotLinkIntervals = list(tLinksList.values())[idx]
TrustpilotCompanyName = getCompanyNameFromURL(TrustpilotLink)
CPUsNumber = multiprocessing.cpu_count()
seleniumExecutablePath = "YOUR PATH/chromedriver.exe"
4. Set the level of multiprocessing you want by adjusting the CPULevel parameter in the readHTMLAndDBInsert() function.
This will determine the number of logic processors your CPU will use to parse data from n HTML files at a time. Be aware that this parameter needs to be between 0 and 1.
1. The processing power and memory available required to run this code are directly proportional to the quantity of data you'll be scraping, parsing and analyzing afterwards.
2. With thousands of pages the process can be a bit slow, so don't worry if BeautifulSoup or Selenium functions don't run very fast because they're dealing with potentially Gb of data.
3. IMPORTANT: Trustpilot's pages HTML tags can change overtime, thus if the code happens to not be able to parse the HTML code that's likely the problem.
4. Keep an eye on the countryCodesISO3166.csv file containing ISO3166 country codes. Because there are many files like this around I got one from online and added some missing countries. Be sure that none are missing, otherwise the data insertion in the db won't be possible.
6. Do not delete intervals.txt files, otherwise you might not be able to parse all of the HTML files.
7. Do not execute the db insertion more than one time without clearing the database first, otherwise you'll be trying to insert duplicated rows.
8. The roBERTa model can sometimes raise errors because of the text beign too long, so do not edit the try-except construct there.
This code has been thought to extract and analyze any Trustpilot page and export useful data and a complete Sentiment Analysis using two different approaches: Bag of Words (VADER) and Transformers (roBERTa).
A Trustpilot user (that can represent a company, a shop, etc.) has a specific part of its page dedicated to show informations about itself.
A Trustpilot user may have one or more pages that contain reviews with details of the person who leaves the reivew.
A person can leave more than one review for the same user and this makes the other ones be hidden by a "Read n More Reviews About COMPANY_NAME".
First of all set in the TReviewsUsefulInfo.py the link you want to extract data from and the intervals that describe the number of splits in which the whole mass of reviews will be saved.
Keep in mind that the program will automatically add to the intervals list the last page number so it knows where to stop scraping
NOTE: Intervals may vary in case of errors or other events that block Selenium from working. These errors get cought by a try-except and the last page scraped before the error raised is set as an interval too.
The intervals solution has been used just to reduce the size of each single HTML file and apply mulitiprocesses to parse more than one at a time to speed up the process.
Once you gathered all of the HTML files in a custom folder called just like the Trustpilot user's name, you can start parsing them using BeautifulSoup4 and save the data into a DB.
Specifically, the number of processes is equal to the number of HTML files. Each process will parse a specific HTML file.
If the number of files is bigger than a threshold you'll set, then the program will gather them into groups and then execute one group at a time.
The threshold is determined by a percentage that you may set which by default is .50, meaning it will take the number of logic processors and do a ceil division with that. The result will be the number of processes in a group.
CPUsNumber = multiprocessing.cpu_count()
def readHTMLAndDBInsert(self, CPULevel=0.50):
optimalProcessesNumber = math.ceil(CPUsNumber * CPULevel)
...