Retail Shake Academy

What is scraping and why spiders?

(and Crawling, JavaScript, Python and the like)

 

Visiting a competitor’s site, noting all the prices, reading all their product features and customer reviews is time consuming. It takes more than a day to peruse the market, copy the data, then sort it. Unless, of course, you’re a robot that knows how to decrypt code on sites. Those robots are called Spiders and the way they search is called Scraping.

I’m (first name and job title) and I’m delighted to deliver this tutorial about the application and tell you about our scraping skills.

 

What is scraping?

Web scraping, datascraping or harvesting is when a programme extracts data from a site. It is structured and systematic exploration to extract complete and reliable data. Data collected this way can be reused.

These tools literally scrape the site to extract content. The large quantities of data extracted in a very short period then need to be classified for rapid and efficient use. The quality of the software and spraping algorithms ensure that the right content is copied and accurately classified in our databases so that the information is accessible to you – both online and in real time.

 

What are the advantages of scraping?

• Quick
Every day we scrape the whole market you are targeting to provide you with the most recent information.

• Low cost
Our scraping tools are used for all our customers, which means we can offer the service from €99 per month.

• Multipurpose
It’s possible to define scraping scripts to decode all site languages and structures regardless of the format and size of the web page and on any browser.

 

What’s a spider?

A spider is a robot – or bot, index robot, or smart software – designed to scroll through the pages of websites, follow links from one page to another, and extract data. This is called crawling. These are the same tools used by Google to browse through your site and identify key words for referencing.

The typical method:
• We indicate a list of web pages to be scraped.
• The spiders extract pertinent information from the source code.
• A loop is created to continuously repeat the operation.

Spiders speak several languages

To be more exact, spiders can understand several programming languages, making them capable of decrypting a site so they can find the information they’ve been programmed to track down.

• HTML5: tagging language designed to represent web pages.
• CSS: language that describes the presentation of HTML and XML documents.
• JS or JavaScript: programming languages for scripts mainly used in interactive web pages – an essential part of applications.

 

Which tool do we use?

We use Scrapy which is a Python framework for crawling websites and extracting structured data. It has a wide range of functions including exploring data, processing information and archiving.

Now you know more about our scraping skills. Our next tutorial is about 

By Clémentine