You can install the extension from Chrome store. After installing it you should restart chrome to make sure the extension is fully loaded. If you don't want to restart Chrome then use the extension only in tabs that are created after installing it.
The extension requires Chrome 49+ . There are no OS limitations.
Web Scraper is integrated into chrome Developer tools. Figure 1 shows how you can open it. You can also use these shortcuts to open Developer tools. After opening Developer tools open Web Scraper tab.
Tools / Developer tools
Tools / Developer tools
Open the site that you want to scrape.
The first thing you need to do when creating a sitemap is specifying the start url. This is the url from which the scraping will start. You can also specify multiple start urls if the scraping should start from multiple places. For example if you want to scrape multiple search results then you could create a separate start url for each search result.
In cases where a site uses numbering in pages URLs it is much simpler to create
a range start url than creating Link selectors that would navigate the site.
To specify a range url replace the numeric part of start url with a range
[1-100]. If the site uses zero padding in urls then add zero
padding to the range definition -
[001-100]. If you want to skip some urls
then you can also specify incremental like this
Use range url like this
http://example.com/page/[1-3] for links like these:
Use range url with zero padding like this
for links like these:
Use range url with increment like this
links like these:
After you have created the sitemap you can add selectors to it. In the Selectors panel you can add new selectors, modify them and navigate the selector tree. The selectors can be added in a tree type structure. The web scraper will execute the selectors in the order how they are organized in the tree structure. For example there is a news site and you want to scrape all articles whose links are available on the first page. In image 1 you can see this example site.
To scrape this site you can create a Link selector which will extract all article links in the first page. Then as a child selector you can add a Text selector that will extract articles from the article pages that the Link selector found links to. Image below illustrates how the sitemap should be built for the news site.
Note that when creating selectors use Element preview and Data preview features to ensure that you have selected the correct elements with the correct data.
More information about selector tree building is available in selector documentation. You should atleast read about these core selectors:
After you have created selectors for the sitemap you can inspect the tree structure of selectors in the Selector graph panel. Image below shows an example selector graph.
After you have created selectors for the sitemap you can start scraping. Open Scrape panel and start scraping. A new popup window will open in which the scraper will load pages and extract data from them. After the scraping is done the popup window will close and you will be notified with a popup message. You can view the scraped data by opening Browse panel and export it by opening the Export data as CSV panel.
Web scraper has multiple selectors that can be used for different type data extraction and for different interaction with the website. The selectors can be divided in three groups:
Data extraction selectors simply return data from the selected element. For example Text selector extracts text from selected element. These selectors can be used as data extraction selectors:
Link selectors extract URLs from links that can be later opened for data extraction. For example if in a sitemap tree there is a Link selector that has 3 child text selectors then the Web Scraper extract all urls with the Link selector and then open each link and use those child data extraction selectors to extract data. Of course a link selector might have Link selectors as child selectors then these child Link selectors would be used for further page navigation. These are currently available Link selectors:
Element selectors are for element selection that contain multiple data elements. For example an element selector might be used to select a list of items in an e-commerce site. The selector will return each selected element as a parent element to its child selectors. Element selectors child selectors will extract data only within the element that the element selector gave them. These are currently available Element selectors:
Each selector has configuration options. Here you can see the most common ones. Configuration options that are specific to a selector are described in selectors documentation.
Note! A common mistake when using multiple configuration option is to create two selectors alongside with multiple checked and expect that the scraper will join selector values in pairs. For example if you selected pagination links and navigation links these links couldn't be logically joined in pairs. The correct way is to select a wrapper element with Element selector and add data selectors as child selectors to the element selector with multiple option not checked.
Text selector is used for text selection. The text selector will extract text
from the selected element and from all its child elements. HTML will be
stripped and only text will be returned. Selector will ignore text within
<style> tags. New line
<br> tags will be replaced with
newline characters. You can additionally apply a regular expression to
The regular expression attribute can be used to extract a substring of the text that the selector extracts. When a regular expression is used the whole match (group 0) will be returned as a result www.regexr.com is a great site where you can learn about regular expressions and try them out.
Here are some examples that you might find useful:
Extract one record per page with multiple text selectors
For example you are scraping news site that has one article per page. The page might contain the article, its title, date published and the author. A Link selector can navigate the scraper to each of these article pages. Multiple text selectors can extract the title, date, author and article. Multiple option should be left unchecked for text selectors because each page is extracting only one record.
Extract multiple items with multiple text selectors per page
E-commerce sites usually have multiple items per page. If you want to scrape these items you will need an Element selector that selects item wrapper elements and multiple text selectors that select data within each item wrapper element.
Extract multiple text records per page
For example you want to extract comments for an article. There are multiple comments in a single page and you only need the comment text (If you would need other comment attributes then see the example above). You can use Text selector to extract these comments. The Text selectors multiple attribute should be checked because you will be extracting multiple records.
Link selector is used for link selection and website navigation. If you use Link selector without any child selectors then it will extract the link and the href attribute of the link. If you add child selectors to Link selector then these child selectors will be used in the page that this link was leading to. If you are selecting multiple links then check multiple property.
Note! Link selector works only with
<a> tags with
href attribute. If the
link selector is not working for you then you can try these workarounds:
window.locationto change the URL. Web Scraper cannot handle this kind of navigation right now.
Navigate through multiple levels of navigation
For example an e-commerce site has multi level navigation -
categories -> subcategories. To scrape data from all categories and
subcategories you can create two Link selectors. One selector would select
category links and the other selector would select subcategory links that are
available in the category pages. The subcategory Link selector should be made
as a child of the category Link selector. The selectors for data extraction
from subcategory pages should be made as a child selectors to the subcategory
For example an e-commerce site has multiple categories. Each category has a list of items and pagination links. Also some pages are not directly available from the category but are available from pagination pages (you can see pagination links 1-5, but not 6-8). You can start by building a sitemap that visits each category and extract items from category page. This sitemap will extract items only from the first pagination page. To extract items from all of the pagination links including the ones that are not visible at the beginning you need to create another Link selector that selects the pagination links. Figure 2 shows how the link selector should be created in the sitemap. When the scraper opens a category link it will extract items that are available in the page. After that it will find the pagination links and also visit those. If the pagination link selector is made a child to itself it will recursively discover all pagination pages. Figure 3 shows a selector graph where you can see how pagination links discover more pagination links and more data.
Sitemap.xml link selector can be used similarly as Link selector to get to target pages (for example product pages).
By using this selector, the whole site can be traversed without setting up selectors for pagination or other site navigation.
The Sitemap.xml link selector extracts URLs from
sitemap.xml files which websites publish so that search engine crawlers can navigate the sites easier.
In most cases, they contain all of the sites relevant page URLs.
Web Scraper supports standard sitemap.xml format.
sitemap.xml file can also be compressed (
If a sitemap.xml contains URLs to other sitemap.xml files, the selector will work recursively to find all URLs in sub
Note! Web Scraper has download size limit. If multiple sitemap.xml URLs are used, scraping job might fail due to exceeding the limit. To work around this, try splitting the sitemap into multiple sitemaps, where each sitemap has only one sitemap.xml.
Note! Sites that have
sitemap.xml files are sometimes quite large.
We recommend using Cloud Web Scraper for large volume scraping.
sitemap.xmlfiles. Multiple URLs can be added. By clicking on "Add from robots.txt" Web Scraper will automatically add all
sitemap.xmlURLs that can be found in sites
https://example.com/robots.txtfile. If no URLs are found, it is worth checking
https://example.com/sitemap.xmlURL which might contain a
sitemap.xmlfile that isn't listed in the
sitemap.xmlthat match RegEx will be scraped.
sitemap.xmlfile to decide if this value should be filled.
Sitemap.xml files are usually used for sites that want to be indexed by search engines, sitemaps can be found for most: * e-commerce sites; * travel sites; * news sites; * yellow pages.
Best way to scrape the whole site is by using Sitemap.xml link selector. It removes the necessity of dealing with pagination, categories and search forms/queries. Some sites don't display category tree(breadcrumbs) if the page is opened directly. In these cases site has to be traversed through category pages to scrape the category tree.
Making sure that only specific pages are scraped
As in most cases, sitemap.xml contains all pages of the site, it is possible to limit the scraper so it scrapes only
the pages that contain the required data. For example, e-commerce sites
sitemap.xml will contain of product pages,
category pages and contact/about/etc. pages. To limit the scraper, so that it scrapes only product pages, one or more methods
can be used:
/product/. This will prevent the scraper from traversing and scraping unnecessary pages.
Using wrapper element selector - if none of the previously mentioned methods are possible, an element wrapper selector can be set up. This method works for all sites and doesn't return empty records in the result file if invalid or unnecessary page is traversed. To set up the element wrapper selector, follow these steps:
multipleand set its selector to (for example)
The key part of this method is that a unique element has to be found and included in
selector. If the data from meta tags has to be scraped,
html tag can be used instead of
body tag. Scraper will
extract data only from the pages that have this unique element.
When using Sitemap.xml selector, set the main page of the site as a start URL.
Link popup selector works similarly as Link selector. It can be used for url extraction and site navigation. The only difference is that Link popup selector should be used when clicking on a link the site opens a new window (popup) instead of loading the URL in the same tab or opening it in a new tab. This selector will catch the popup creation event and extract the URL. If the site creates a visual popup but not a real window then you should try Element click selector
Note! when selecting these link elements you can move the mouse over the element and press "S" to select it to prevent it from opening a popup.
See Link selector use cases.
Image selector can extract
src attribute (URL) of an image.
Note! When selecting CSS selector for image selector all the images within the site are moved to the top. If this feature somehow breaks sites layout please report it as a bug.
See Text selector use cases.
Image downloader script finds image urls scraped by Image Selector in a csv file and downloads them.
Images are renamed to
Terminalapplication. You should have one preinstalled
Downloadsdirectory by typing:
python image-downloader scraped_data.csv
Table selector can extract data from tables. Table selector has 3 configurable CSS selectors. The selector is for table selection. After you have selected the selector the Table selector will try to guess selectors for header row and data rows. You can click Element preview on those selectors to see whether the Table selector found table header and data rows correctly. The header row selector is used to identify table columns when data is extracted from multiple pages. Also you can rename table columns. Figure 1 shows what you should select when extracting data from a table.
See Text selector use cases.
Element attribute selector can extract an attributes value of an HTML element.
For example you could use this selector to extract title attribute from
<a href="#" title="my title">link<a>.
See Text selector use cases.
HMTL selector can extract HTML and text within the selected element. Only the inner HTML of the element will be extracted.
See Text selector use cases.
Grouped selector can group text data from multiple elements into one record. The extracted data will be stored as JSON.
For example you are extracting a news article that might have multiple
reference links. If you are selecting these links with link selector with
multiple checked you would get duplicate articles in the result set where each
record would contain one reference link. Using grouped selector you could
serialize all these reference links into one record. To do that select all
reference links and set attribute name to
href to also extract links to these
Element selector is for element selection that contain multiple data elements. For example element selector might be used to select a list of items in an e-commerce site. The selector will return each selected element as a parent element to its child selectors. Element selectors child selectors will be extracting data only within the element that the element selector gave them.
Note! If the page dynamically loads new items after scrolling down or clicking on a button then you should try these selectors:
For example an e-commerce site has a page with a list of items. With element selector you can select the elements that wrap these items and then add multiple child selectors to it to extract data within the items wrapper element. Figure 1 shows how an element selector could be used in this situation.
Similarly to e-commerce item selection you can also select table rows and add child selectors for data extraction from table cells. Though Table selector might be much better solution.
This is another Element selector that works similarly to Element selector but additionally it scrolls down the page multiple times to find those elements which are added when page is scrolled down to the bottom. Use the delay attribute to configure waiting interval between scrolling and element search. Scrolling is stopped after no new elements are found. If the page can scroll infinitely then this selector will be stuck in an infinite loop.
See Element selector use cases.
Note! when selecting clickable elements you should select them by moving the mouse over the element and pressing "S". This kind of selection will avoid events triggered by the button.
Click Once type will click on the buttons only once. If a new button appears that can be selected it will be also clicked. For example pagination links might show pages 1 to 5 but pages 6 to 10 would appear some time later. The selector will also click on those buttons.
Click More type makes the selector click on given buttons multiple times until there are no new elements appearing. A new element is considered an element that has unique text content.
When using Click Once only unique buttons will be clicked. When using Click More this helps to ignore buttons that don't generate more elements.
Unique CSS Selector - buttons with identical CSS Selector are considered equal
Never discard - scrapes data before and after click action
For example there is a site that displays a list of items and there are some pagination buttons that reload these items dynamically (after clicking a button the url doesn't change. changes after hash tag # doesn't count). Using Element click selector you can select these items and buttons that need to be clicked. The scraper during scraping phase will click these buttons to extract all elements. Also you need to add child selectors for the Element click selector that select data within each element. In figure 1 you can see how to configure the Element click selector to extract data from the described site.
This example is similar to the one above. The only difference is that in this site items are loaded by clicking a single button multiple times. In this case the Element click selector should be configured to use "Click more" click type. In figure 2 you can see how to configure the Element click selector to extract data from this site.
The Discard when click element exists option is picked when, for example, product pages, of an e-commerce website, that are being scraped have an almost identical structure, with the only differentiator being the presence of a variation option (size, color, quantity, etc.) that has to be iterated through in the product page.
Due to the fact that the scraper will collect data before and after the click, by not having this option selected scraping a page with multiple variations available, the scraper will extract information before initiating the click and during the clicking process, resulting in duplicate or unusable row of data.
Web Scraper uses css selectors to find HTML elements in web pages and to extract data from them. When selecting an element the Web Scraper will try to make its best guess what the CSS selector might be for the selected elements. But you can also write it yourself and test it with by clicking "Element preview". You can use CSS selectors that are available in CSS versions 1-3 and also pseudo selectors that are additionally available in jQuery. Here are some documentation links that might help you:
It is possible to add new pseudo CSS selectors to Web Scraper. Right now there is only one CSS selector added.
_parent_ allows a child selector of an
Element selector to select the element that was returned by the Element selector. For
example this CSS selector could be used in a case where you need to extract an
attribute from the element that the Element selector returned.