entrepreneur-interet-general / OpenScraper

Licence: MIT license
An open source webapp for scraping: towards a public service for webscraping

Programming Languages

python
139335 projects - #7 most used programming language
HTML
75241 projects
CSS
56736 projects
javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to OpenScraper

scrapy facebooker
Collection of scrapy spiders which can scrape posts, images, and so on from public Facebook Pages.
Stars: ✭ 22 (-72.5%)
Mutual labels:  scraper, spider, scrapy
Python Spider
豆瓣电影top250、斗鱼爬取json数据以及爬取美女图片、淘宝、有缘、CrawlSpider爬取红娘网相亲人的部分基本信息以及红娘网分布式爬取和存储redis、爬虫小demo、Selenium、爬取多点、django开发接口、爬取有缘网信息、模拟知乎登录、模拟github登录、模拟图虫网登录、爬取多点商城整站数据、爬取微信公众号历史文章、爬取微信群或者微信好友分享的文章、itchat监听指定微信公众号分享的文章
Stars: ✭ 615 (+668.75%)
Mutual labels:  spider, xpath, scrapy
Django Dynamic Scraper
Creating Scrapy scrapers via the Django admin interface
Stars: ✭ 1,024 (+1180%)
Mutual labels:  scraper, spider, scrapy
Fp Server
Free proxy server, continuously crawling and providing proxies, based on Tornado and Scrapy. 免费代理服务器,基于Tornado和Scrapy,在本地搭建属于自己的代理池
Stars: ✭ 154 (+92.5%)
Mutual labels:  spider, tornado, scrapy
Fbcrawl
A Facebook crawler
Stars: ✭ 536 (+570%)
Mutual labels:  scraper, spider, scrapy
Mailinglistscraper
A python web scraper for public email lists.
Stars: ✭ 19 (-76.25%)
Mutual labels:  scraper, spider, scrapy
Goribot
[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。
Stars: ✭ 190 (+137.5%)
Mutual labels:  scraper, spider, scrapy
blinkist-m4a-downloader
Grabs all of the audio files from all of the Blinkist books
Stars: ✭ 100 (+25%)
Mutual labels:  scraper, spider
scraper
图片爬取下载工具,极速爬取下载 站酷https://www.zcool.com.cn/, CNU 视觉 http://www.cnu.cc/ 设计师/用户 上传的 图片/照片/插画。
Stars: ✭ 64 (-20%)
Mutual labels:  scraper, spider
factory
Docker microservice & Crawler by scrapy
Stars: ✭ 56 (-30%)
Mutual labels:  tornado, scrapy
OLX Scraper
📻 An OLX Scraper using Scrapy + MongoDB. It Scrapes recent ads posted regarding requested product and dumps to NOSQL MONGODB.
Stars: ✭ 15 (-81.25%)
Mutual labels:  scraper, scrapy
ant
A web crawler for Go
Stars: ✭ 264 (+230%)
Mutual labels:  scraper, spider
scrapy-LBC
Araignée LeBonCoin avec Scrapy et ElasticSearch
Stars: ✭ 14 (-82.5%)
Mutual labels:  scraper, scrapy
small-spider-project
日常爬虫
Stars: ✭ 14 (-82.5%)
Mutual labels:  spider, scrapy
crawler-chrome-extensions
爬虫工程师常用的 Chrome 插件 | Chrome extensions used by crawler developer
Stars: ✭ 53 (-33.75%)
Mutual labels:  scraper, spider
robotstxt
robots.txt file parsing and checking for R
Stars: ✭ 65 (-18.75%)
Mutual labels:  scraper, spider
Web-Iota
Iota is a web scraper which can find all of the images and links/suburls on a webpage
Stars: ✭ 60 (-25%)
Mutual labels:  spider, scrapy
TikTokDownloader PyWebIO
🚀「Douyin_TikTok_Download_API」是一个开箱即用的高性能异步抖音|TikTok数据爬取工具,支持API调用,在线批量解析及下载。
Stars: ✭ 919 (+1048.75%)
Mutual labels:  scraper, spider
python-crawler
爬虫学习仓库,适合零基础的人学习,对新手比较友好
Stars: ✭ 37 (-53.75%)
Mutual labels:  xpath, scrapy
163Music
163music spider by scrapy.
Stars: ✭ 60 (-25%)
Mutual labels:  spider, scrapy

OpenScraper


part 1/3 of the TADATA! sofware suite (ApiViz / Solidata_backend / Solidata_frontend / OpenScraper )


a public service for webscraping

OpenScraper is a minimalistic, open source webscraper with a simple interface, so almost anyone with very little technical knowledge could scrap public data and install/adapt it for its own purposes... for free.

... anyway, that's the goal folks ! ...
(it's a development phase for now)

OpenScraper is a project by SocialConnect

#python #tornado #scrapy #selenium #mongodb #bulma


WHAT IS NEW ?

  • v1.4 - 07/02/2019 : adding infinite scroll for reactive websites to scrap
  • v1.3 - 20/10/2018 : added first csv converter and downloader for every spider's dataset and first routes to documentation
  • v1.2 - 18/10/2018 : a spider can adapt to follow pages either if they are API or HTML
  • v1.1 - 15/10/2018 : parser adapts to API Rest, configuration based on "/" description for the path inside the JSON
  • v1.0 - 10/06/2018 : parser adapts to reactive website (SPA, vue, etc...)
  • vBeta : Scrapy parser based on spider configuration with Xpaths

ARGUMENT

To which needs this project aims to answer ?

Scraping can quickly become a mess, mostly if you need to scrap several websites in order to eventually get a structured dataset. Usually you need to set up several scrapers for every website, configure the spiders one by one, get the data from every website, and clean up the mess to get from this raw material one structured dataset you know that exists...

Yes, similar solutions already does exist... but...

So you have mainly three options when it comes to scrap the web :

  • either use a proprietary and quite expensive service (like Apify or import.io) and depend on an external service ;
  • ask a friend if you are lucky, ask a developer or a company to do it for you if you have money for that...
  • or if you have the know-how write your own code (for instance based on BeautifulSoup or Scrapy), adapt it for your own purposes, and usually be the only one (I mean the only developer around) to be able to use/adapt it.

A theoretical use case

So let's say you are a researcher, a journalist, a public servant in an administration, a member of any association who want to survey some evolutions in the society... Let's say you need data not easy to get, and you can't afford to spend thousand of euros in using a private service for webscraping.

You'd have a list of different websites you want to scrap similar information from, each website having some urls where are listed those data (in our first case social innovation projects). For every information you know it could be similarly described with : a title, an abstract, an image, a list of tags, an url, and the name and url of the source website, and so on...

So to use OpenScraper you would have to :

  • specify the data structure you expect ("title", "abstract", etc...) ;
  • add a new contributor (a source website) : at least its name and the start_url from which you'll do the scraping ;
  • configure the spider for every contributor, i.e. specify the xpaths for every field (xpath for "title", xpath for "abstract", etc... );
  • save the contributor spider configuration, and click on the "run spider" button...
  • the data will be stored in the OpenScraper database (MongoDB), so you could later retrieve the structured data (with an API endpoint or in a tabular format like a .csv file)

An open scraper for more digital commons

To make that job a bit easier (and far cheaper) OpenScraper aims to display an online GUI interface (a webapp on the client side) so you'll just have to set the field names (the data structure you expect), then enter a list of websites to scrap, for each one set up the xpath to scrap for each field, and finally click on a button to run the scraper configured for each website...

... and tadaaaa, you'll have your data : you will be able able to import it, share it, and visualize it (at least we're working on it as quickly as we can)...

OpenScraper is developped in open source, and will provide a documentation as much as a legal framework (licence and CGU) aiming to make the core system of OpenScraper fit the RGPD, in the letter and in the spirit.


INSTALLATION WALKTHROUGH

LOCALLY

  1. clone or download the repo

  2. install MongoDB locally or get the URI of the MongoDB you're using

  3. install chromedriver

    • on MacOS :
     $ brew tap caskroom/cask
     $ brew cask install chromedriver
    
    • on Ubuntu :
     $ sudo apt-get install chromium-chromedriver
    
  4. go to your openscraper folder

  5. create a virtual environment for python 2.7 virtual environment)

     $ python venv venv
     $ source venv/bin/activate
    
  6. install the libraries

     $ pip install -r requirements.txt
    
  7. optionnal : notes for installing python libs on linux servers

     $ sudo apt-get install build-essential libssl-dev libffi-dev python-dev python-psycopg2 python-mysqldb python-setuptools libgnutls-dev libcurl4-gnutls-dev
     $ sudo apt install libcurl4-openssl-dev libssl-dev
     $ sudo apt-get install python-pip 
     $ sudo pip install --upgrade pip 
     $ sudo pip install --upgrade virtualenv 
     $ sudo pip install --upgrade setuptools
    
  8. optionnal : create a config/settings_secret.py file based on config/settings_example.py with your mongoDB URI (if you're not using default mongoDB connection) :

  9. run app

     $ cd openscraper
     $ python main.py
    
  10. you can also choose options when running main.py

  • -p or --port : the number of your port (default : 8000)

  • -m or --mode : the mode (default : default) - choices : default (uses settings_example.py in openscraper/config folder) | production (uses settings_secret.py in ~/config folder )

    example :

      $ python main.py -p 8100 --mode=production
    
  1. check in your browser at localhost:8000 (or whichever port you entered)

  2. create/update your datamodel at localhost:8000/datamodel/edit

  3. create/update your spiders at localhost:8000/contributors

  4. run the test spider in the browser by clicking on the test spider at localhost:8000/contributors

PRODUCTION

  1. get a server - check digital ocean, OVH, ...
  2. optionnal : get a domain name : check OVH, namecheap, godaddy....
  3. follow (most of) these instructions
  4. pray for all that to work...

TECHNICAL POINTS

Tech stack

  • Language : Python... because let's be honest, I don't manage so many languages for that kind of project
  • Backend : Tornado... one of the few async/non-blocking Python frameworks
  • Scraping : Scrapy, with Selenium for Python inside specific instances of the generic spider, or Splash for jquery follow-up...
  • Frontend : Bulma (to make it nice) and then Vue.js (to make it even nicer and bi-directional)

Tech goals for a MVP

  • web interface to edit the data structure
  • Python asynchronous interface (Tornado) for Scrapy
  • store a list of url sources + corresponding xpaths in a DB (Mongo)
  • web interface to edit the sources' xpath list
  • display the list of sources + for each add a button to run the scraper
  • store/extract results in the DB

ROADMAP TO A MVP

To do list :

  1. DONE - understand basics of Tornado (reuse some tutorial material)
  2. DONE - basic Tornado + MongoDB setup
  3. DONE - understand basics of Scrapy
  4. DONE - UI to create user (register), create/update a datamodel, create/update a spider configuration
  5. DONE - add a GUI to configure the data structure you expect from the scraping
  6. DONE - create a generic spider (class) + generic item to fill, both callable from handlers
  7. DONE - integrate generic spider + tests + run
  8. DONE - make Tornado and a basic scrapy spider work together (non-blocking)
  9. DONE - make a nice front in Bulma
  10. DONE - add Selenium to mimic navigation by clics on reactive websites
  11. DONE - add API points for JSON feeds
  12. DONE - add an "export csv" button and function to download the dataset
  13. deploy a demo at http://www.cis-openscraper.com/
  14. ... nicer front in vue.js
  15. integrate JWT and hash private infos for users and API

Currently :

  • adding documentation ...
  • ...

Notes for later / issues :

  • must migrate/copy data to a Elastic search (not only MongoDB)
  • containerize the app for simpler deployment (locally or in production)
  • ...

CREDITS

OpenScraper's team thanks :

Contacts :


SCREENSHOTS (development)

index

alt text


edit your datamodel
(only for admin and staff of openscraper)

alt text


add a field to your datamodel
(only for admin and staff of openscraper)

alt text


list of websites you want to crawl
(for admin, staff and users of openscraper)

alt text


add a new website to scrap
(for admin, staff and users of openscraper)

alt text


the resulting dataset
(data shown depends on the user's auth level : admin, staff, user, visitor)

alt text


overview of the API response
(data shown depends on your token)

alt text

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].
OSZAR »