Features
- Concurrent
- Fit for vertical communities
- Flexible, Modular
- Native Go implementation
- Can be expanded to an individualized crawler easily
Requirements
- Go 1.2 or higher
Documentation
InstallationInstallation
1 | go get github.com/hu17889/go_spider |
This project is based on simplejson, goquery.
You can download packages from http://gopm.io/ in China.
Use example
Here is an example for crawling github content. You can have a try of the crawl process.
go install github.com/hu17889/go_spider/example/github_repo_page_processor
./bin/github_repo_page_processor
More examples here: examples.
Make your spider
1 | // Spider input: |
Use default modules
Downloader:HttpDownloader
Scheduler:QueueScheduler
Pipeline:PipelineConsole,PipelineFile
Use your modules
Just copy the default modules and modify it!
If you make a Downloader module, you can use it by Spider.SetDownloader(your_downloader)
.
If you make a Pipeline module, you can use it by Spider.AddPipeline(your_pipeline)
.
If you make a Scheduler module, you can use it by Spider.SetScheduler(your_scheduler)
.
Extensions
Extensions folder include modulers or other tools someone sharing. You can push your code without bugs.
Modulers
Spider
Summary: Crawler initialization, concurrent management, default moduler, moduler management, config setting.
Functions:
- Clawler startup functions: Get, GetAll, Run
- Add request: AddUrl, AddUrls, AddRequest, AddRequests
- Set main moduler: AddPipeline(could have several pipeline modulers), SetScheduler, SetDownloader
- Set config: SetExitWhenComplete, SetThreadnum(concurrent number), SetSleepTime(sleep time after one crawl)
- Monitor: OpenFileLog, OpenFileLogDefault(open file log function, logged by mlog package), CloseFileLog, OpenStrace(open tracing info printed on screen by stderr), CloseStrace
Downloader
Summary: Spider gets a Request in Scheduler that has url to be crawled. Then Downloader downloads the result(html, json, jsonp, text) of the Request. The result is saved in Page for parsing in PageProcesser.
Html parsing is based on goquery package. Json parsing is based on simplejson package. Jsonp will be conversed to json. Text form represents plain text content without parser.
Functions:
- Download: download content of the crawl objective. Result contains data body, header, cookies and request info.
PageProcesser
Summary: The PageProcesser moduler only parse results. The moduler gets results(key-value pairs) and urls to be crawled next step.
These key-value pairs will be saved in PageItems and urls will be pushed in Scheduler.
Functions:
- Process: parse the objective crawled.
Page
Summary: save information of request.
Functions:
- Get result: GetJson, GetHtmlParser, GetBodyStr(plain text)
- Get information of objective: GetRequest, GetCookies, GetHeader
- Get Status of crawl process: IsSucc(Download success or not), Errormsg(Get error info in Downloader)
- Set config:SetSkip, GetSkip(if skip is true, do not output result in Pipeline), AddTargetRequest, AddTargetRequests(Save urls to be crawled next stage), AddTargetRequestWithParams, AddTargetRequestsWithParams, AddField(Save key-value pairs after parsing)
Scheduler
Summary: The Scheduler moduler is a Request queue. Urls parsed in PageProcesser will be pushed in the queue.
Functions:
- Push
- Poll
- Count
Pipeline
Summary: The Pipeline moduler will output the result and save wherever you want. Default moduler is PipelineConsole(Output to stdout) and PipelineFile(Output to file)
Functions:
- Process
Request
Summary: The Request moduler has config for http request like url, header and cookies.
Functions:
- Process