A crawler for the Gemini network. Easily extendable as a "wayback machine" of Gemini.
- Save image/* and text/* files
- Concurrent downloading with configurable number of workers
- Connection limit per host
- URL Blacklist
- Follow robots.txt, see gemini://geminiprotocol.net/docs/companion/robots.gmi
- Configuration via environment variables
- Storing capsule snapshots in PostgreSQL
- Proper response header & body UTF-8 and format validation
- Proper URL normalization
- Handle redirects (3X status codes)
- Crawl Gopher holes
Spin up a PostgreSQL, check db/sql/initdb.sql
to create the tables and start the crawler.
All configuration is done via environment variables.
Bool can be true
or 0
LogLevel string // Logging level (debug, info, warn, error)
MaxResponseSize int // Maximum size of response in bytes
NumOfWorkers int // Number of concurrent workers
ResponseTimeout int // Timeout for responses in seconds
WorkerBatchSize int // Batch size for worker processing
PanicOnUnexpectedError bool // Panic on unexpected errors when visiting a URL
BlacklistPath string // File that has blacklisted strings of "host:port"
DryRun bool // If false, don't write to disk
PrintWorkerStatus bool // If false, print logs and not worker status table
LOG_LEVEL=info \
BLACKLIST_PATH="./blacklist.txt" \ # one url per line, can be empty
PG_PORT=5434 \
PG_USER=test \
DRY_RUN=false \
Install linters. Check the versions first.
go install mvdan.cc/gofumpt@v0.7.0
go install github.com/golangci/golangci-lint/cmd/golangci-lint@v1.63.4
- Add snapshot history
- Add a web interface
- Provide to servers a TLS cert for sites that require it, like Astrobotany
- Use pledge/unveil in OpenBSD hosts
- More protocols? http://dbohdan.sdf.org/smolnet/
Good starting points:
gemini://warmedal.se/~antenna/ gemini://tlgs.one/ gopher://i-logout.cz:70/1/bongusta/ gopher://gopher.quux.org:70/