How to scrape data from between List and Detail page?

In the previous chapters, we have learned how to extract data from pages.

In this tutorial, we will learn how to scrape data from combined pages?

Open our practice page now: Best Seller Books

As you can see, we need to go from the bestseller list to the details page, and then scrape data from it.

data-marker

As marked in the figure, suppose we want to extract the following 3 data:

  1. Title
  2. Author
  3. Link

Let's take a look at what the data we successfully collected looks like!👇

data-preview

Task Analysis

Before scraping data, we need to decompose the task. In this case, the steps are as follows:

  1. Open link Best Seller Books
  2. Click the 1th book
  3. Open the details page
  4. Extract the required data
  5. Close the details page
  6. Click the 2th book
  7. Open the details page
  8. ...

Repeat steps 2 ~ 5 until the list collection is complete.

We convert these steps into a flowchart, like this:

flow-chart

Create Recipe

The previous analysis has laid a good foundation for us to create a recipe, so let's get started!

For ease of understanding, we create recipes from scratch, so we choose "Custom" for the template.

custom-template

Step 1. Edit the "Open Page" node and fill in the current link.

node-open-page

Step 2. Add the "Scroll Page" node, use the default value without changing it.

node-scroll-page

Step 3. Edit the "Extract Data" node to extract the link of books, the configuration is as follows:

  1. Change the name of the table to: Links (optional, just for better visibility)
  2. Uncheck: Dump (no storage required, only for reference by other nodes)
  3. Set the row selector to: div.book-list > div.book-item (you can generate it via Advanced Finder)
  4. Add a column named: Value with selector: h3 > a (you can generate it via Advanced Finder)

node-extract-list

Step 4. Add the "Enter Loop" node to open the book details page in a loop. The configuration is as follows:

  1. Change the name of the loop to: Loop Links (optional, just for better recognition)
  2. Data source, select "From Table Input"
  3. For the reference field, select "Links.Value" (this is the data table we created in the previous step)
  4. Fill in the link of a book's detail page

node-loop-links

Step 5. Open the details page of a book, take Book 1 as an example.

Step 6. Edit the "Open Page" node in the loop, and quickly fill in the current page link.

node-open-detail-page

Step 7. Add the "Extract Data" node inside the loop, which will be the table where we will finally store the data. The configuration is as follows:

  1. Change the name of the table to: Books (optional, just for better recognition)
  2. Check: Dump (need to store data)
  3. The rows selector is: body (to extract the details page, it is usually enough to fill in body)
  4. Add column 1, named: Title, selector: h1 (you can generate it via Advanced Finder)
  5. Add column 2, named: Author, selector: div.book-details_author (you can use Advanced Finder to generate)
  6. Add column 3, named: Link, column type: Extract from system, Extract Page Link field

So far, all nodes are configured.

Step 8. Fill in the recipe basic information

  1. Fill in a suitable recipe name
  2. Briefly describe the function of the recipe
  3. Check "Open In New Tab" (to extract data from the combination page, you usually need to check this option)
  4. Check "Active Tab" (some websites need to be displayed before loading data)

fill-basic-info

Summarize

Well, the recipe for extracting data from the combination page is ready, let's see what it looks like?

recipe-preview

The key point of data collection on combined pages is how to open the details page from the list and continue collecting?

There are 2 ways to achieve this:

  1. Collect the links of all items in the list, and then open them one by one through a loop
  2. Collect an element of all items in the list, and then click on them in turn to open the details page through a loop

all right! This is the end of this tutorial, thank you for reading, go and try it now! 👉