Onion Harvester: First step to TOR Search Engines

Knowing all possible web paths in the world is the initial step for making a search engine (SE). By means of  SE one can analyze the web for the material he/she likes. In normal Domain Name System, each TLD provider (Top Level Domain) can sell or release list of all its domains. As an example .com TLD can sell or release all the domains which are end with “.com“. But the problem is more complicated in TOR (or other hidden service providers). In this post I will talk about my tool named Onion Harvester and how to find initial points for finding hidden services to be crawled.

TOR Network
TOR Network

I have investigated about how to find all onion addresses. My question in TOR stack exchange network can be found this link. In a conclusion there are two ways to find all the onion addresses, starting points for crawling and making a search engine.

  1.  Run as a Hidden Service Directory which a hidden service publish its address through 6 of 9 HSDirs for being found by the users whom tries to connect.
  2. Brute force all address space which is exponentially time consuming.

The first method fails because the 9 HSDirs is controlled by the TOR network itself. You can check the 9 HSDirs status in this link. Detailed information about the 9 HSDirs can be found this link. Therefore you cannot add yourself as a HSDir without verification of the TOR developers. In addition, as defalt answered to my question in the tor stack exchange network:

“Harvesting onion addresses has been fixed in Next Generation Tor Onion Services so you can’t fetch list of running onion services by hosting your own HSDir anymore.”

In other point of view, TOR is open source and you can own and run your own TOR network which you can add your HSDirs. But users should use your network instead of TOR!

My Own Tor Network
My Own Tor Network

Check this link for finding how to run your own TOR network.

But in general, you may harvest the whole address space of onion for specific ports to check if the service is open or not. I have developed a small multi thread Java application named “Onion Harvester” to harvest the onion addresses. I generate onion addresses starting from “aaaaaaaaaaaaaaaa.onion” to “7777777777777777.onion”. The address is base-32 which contains alphabet and numbers except (0,1,8,9). There are totally (26 + 6) ^ {16} = 1208925819614629174706176 addresses!

As a small program, I’ve added some flexibility to it:

  1. If the program got Exit signal, it stores the next onion address which should be resume the program.
  2. I’ve added a switch (–start) the resume the scanning.
  3. I’ve added configurable local TOR socks5 address. By default Tor Bundle uses 127.0.0.1:9150 as its socks5 proxy and tor binary in Linux uses 127.0.0.1:9050.
  4. The project is opensource and you can find it in Mr Tajbakhsh GitHub account in the repository.

Onion Harvester in Action:

Onion Harvester in Action
Onion Harvester in Action

You may fork or use the project for help me creating the onion database. If you want to contribute, contact me at saman [@] mstajbakhsh [.] ir

2 Comments

  1. Diego

    Sorry for my bad English. Your project seems very interesting but I would like to understand 2 things: 1) why did not you include some numbers? 2) but now, with the new version of Tor, the string size is increased by many characters, do you think to support it or is the keyword spectrum too wide to waste time?

    • Hello
      Thanks for your interest in OH. For no. 1, it is because the onion addresses are base32 and address space has 32 chars (26 chars from a to z and numbers from 3 to 8)
      For no. 2, if the address increased in number of chars, the next version of OH will be handy in which I started custom number of Tor clients in a single machine and each of them works with custom number of threads which means increase speed. Actually it will take some years to scan whole the addresses and it is useful for persistent addresses not temporary ones.

Leave a Reply to Diego Cancel reply

Your email address will not be published. Required fields are marked *