How to scrape data from between List and Detail page?
In the previous chapters, we have learned how to extract data from pages.
In this tutorial, we will learn how to scrape data from combined pages?
Open our practice page now: Best Seller Books
As you can see, we need to go from the bestseller list to the details page, and then scrape data from it.
As marked in the figure, suppose we want to extract the following 3 data:
- Title
- Author
- Link
Let's take a look at what the data we successfully collected looks like!đ
Task Analysis
Before scraping data, we need to decompose the task. In this case, the steps are as follows:
- Open link Best Seller Books
- Click the 1th book
- Open the details page
- Extract the required data
- Close the details page
- Click the 2th book
- Open the details page
- ...
Repeat steps 2 ~ 5 until the list collection is complete.
We convert these steps into a flowchart, like this:
Create Recipe
The previous analysis has laid a good foundation for us to create a recipe, so let's get started!
For ease of understanding, we create recipes from scratch, so we choose "Custom" for the template.
Step 1. Edit the "Open Page" node and fill in the current link.
Step 2. Add the "Scroll Page" node, use the default value without changing it.
Step 3. Edit the "Extract Data" node to extract the link of books, the configuration is as follows:
- Change the name of the table to: Links (optional, just for better visibility)
- Uncheck: Dump (no storage required, only for reference by other nodes)
- Set the row selector to:
div.book-list > div.book-item
(you can generate it via Advanced Finder) - Add a column named: Value with selector:
h3 > a
(you can generate it via Advanced Finder)
Step 4. Add the "Enter Loop" node to open the book details page in a loop. The configuration is as follows:
- Change the name of the loop to: Loop Links (optional, just for better recognition)
- Data source, select "From Table Input"
- For the reference field, select "Links.Value" (this is the data table we created in the previous step)
- Fill in the link of a book's detail page
Step 5. Open the details page of a book, take Book 1 as an example.
Step 6. Edit the "Open Page" node in the loop, and quickly fill in the current page link.
Step 7. Add the "Extract Data" node inside the loop, which will be the table where we will finally store the data. The configuration is as follows:
- Change the name of the table to: Books (optional, just for better recognition)
- Check: Dump (need to store data)
- The rows selector is: body (to extract the details page, it is usually enough to fill in
body
) - Add column 1, named: Title, selector:
h1
(you can generate it via Advanced Finder) - Add column 2, named: Author, selector:
div.book-details_author
(you can use Advanced Finder to generate) - Add column 3, named: Link, column type:
Extract from system
, ExtractPage Link
field
So far, all nodes are configured.
Step 8. Fill in the recipe basic information
- Fill in a suitable recipe name
- Briefly describe the function of the recipe
- Check "Open In New Tab" (to extract data from the combination page, you usually need to check this option)
- Check "Active Tab" (some websites need to be displayed before loading data)
Summarize
Well, the recipe for extracting data from the combination page is ready, let's see what it looks like?
The key point of data collection on combined pages is how to open the details page from the list and continue collecting?
There are 2 ways to achieve this:
- Collect the links of all items in the list, and then open them one by one through a loop
- Collect an element of all items in the list, and then click on them in turn to open the details page through a loop
all right! This is the end of this tutorial, thank you for reading, go and try it now! đ