-
Notifications
You must be signed in to change notification settings - Fork 3
Implement Tor middleware for anonymous data scraping
I'm not responsible for any illegal usage of this knowledge. Please use it responsibly.
Before this tutorial, go ahead and read the scrapy documentations.
scrapy documantations: https://doc.scrapy.org/en/latest/
This isn't a scrapy tutorial.
- You must install the
nc
(netcat) and Tor service on your GNU Linux operating system, and edit/etc/tor/torrc
, add a control port and a password to it.
Install privoxy for having HTTP and HTTPS over TorSOCKS
UBUNTU GUIDE:
You need to add two directives to the /etc/tor/torrc file:
ControlPort 9051
HashedControlPassword **hash
You can use this command to get the hash from your password-string:
tor --hash-password yourpassword
You will see a string like this: 16:C27F9535B4F417F96064BD3762593271BA9B883AAC42888CFB86B0EBA7
Copy this text and place it instead of the **hash in front of the HashedControlPassword
directive.
Restart the Tor service.
Now the TorService is waiting for your command Sir!:D.
-
Then open
Tor.py
in the middleware section and find the line containing this command.os.system("""(echo authenticate '"yourpassword"'; echo signal newnym; echo quit) | nc localhost 9051""")
and replace yourpassword
with your password :D.
- After installing and preparing the tools we create the scrapy project, I changed the project structure for more efficiency and ease of training. changes:
mkdir middlewares
mv middlewares.py -> middlwares/
mkdir pipelines
mv pipelines.py -> pipelines/
final structure:
└── tutorial
├── scrapy.cfg
└── tutorial
├── __init__.py
├── items.py
├── middlewares
│ ├── middlewares.py
│ └── Tor.py
├── pipelines
│ └── pipelines.py
├── __pycache__
├── settings.py
└── spiders
├── __init__.py
└── __pycache__
- Change settings.py: uncomment
DOWNLOADER_MIDDLEWARES
section and add the Tor middleware address like this:
DOWNLOADER_MIDDLEWARES = {
'tutorial.middlewares.Tor.TorMiddleware': 100,
}
- We need one more thing in
settings.py
which is settingHTTP_PROXY
variable.
HTTP_PROXY = 'http://127.0.0.1:8118'
- Run the
test_ip
spider to see how your IP changes.
Good Luck