Skip to content

Implement Tor middleware for anonymous data scraping

Mohammad Reza Tavakoli Kousha edited this page Jun 30, 2019 · 10 revisions

I'm not responsible for any illegal usage of this knowledge. Please use it responsibly.

NOTE THAT:

Before this tutorial, go ahead and read the scrapy documentations.

scrapy documantations: https://doc.scrapy.org/en/latest/

This isn't a scrapy tutorial.

  1. You must install the nc(netcat) and Tor service on your GNU Linux operating system, and edit /etc/tor/torrc, add a control port and a password to it.

Install privoxy for having HTTP and HTTPS over TorSOCKS

UBUNTU GUIDE:

You need to add two directives to the /etc/tor/torrc file:

	ControlPort 9051
	HashedControlPassword **hash

You can use this command to get the hash from your password-string:

	tor --hash-password yourpassword

You will see a string like this: 16:C27F9535B4F417F96064BD3762593271BA9B883AAC42888CFB86B0EBA7

Copy this text and place it instead of the **hash in front of the HashedControlPassword directive.

Restart the Tor service.

Now the TorService is waiting for your command Sir!:D.

  1. Then open Tor.py in the middleware section and find the line containing this command.

    os.system("""(echo authenticate '"yourpassword"'; echo signal newnym; echo quit) | nc localhost 9051""")

and replace yourpassword with your password :D.

  1. After installing and preparing the tools we create the scrapy project, I changed the project structure for more efficiency and ease of training. changes:
		mkdir middlewares
		mv middlewares.py -> middlwares/
		
		mkdir pipelines
		mv pipelines.py -> pipelines/
	final structure:
		└── tutorial
		    ├── scrapy.cfg
		    └── tutorial
			├── __init__.py
			├── items.py
			├── middlewares
			│   ├── middlewares.py
			│   └── Tor.py
			├── pipelines
			│   └── pipelines.py
			├── __pycache__
			├── settings.py
			└── spiders
			    ├── __init__.py
			    └── __pycache__
  1. Change settings.py: uncomment DOWNLOADER_MIDDLEWARES section and add the Tor middleware address like this:
	DOWNLOADER_MIDDLEWARES = {
	   'tutorial.middlewares.Tor.TorMiddleware': 100,
	}
  1. We need one more thing in settings.py which is setting HTTP_PROXY variable.
	HTTP_PROXY = 'http://127.0.0.1:8118'
  1. Run the test_ip spider to see how your IP changes.

Good Luck

Clone this wiki locally