Tuesday, November 21, 2023
HomeMobile MarketingHow To Crawl A Massive Website And Extract Knowledge Utilizing Screaming Frog's...

How To Crawl A Massive Website And Extract Knowledge Utilizing Screaming Frog’s search engine marketing Spider


We’re helping a number of purchasers proper now with Marketo migrations. As giant firms make the most of enterprise options like this, it’s like a spider net that weaves itself into processes and platforms over years… till the purpose that firms aren’t even conscious of each touchpoint.

With an enterprise advertising and marketing automation platform like Marketo, varieties are the entry level of information all through websites and touchdown pages. Firms usually have hundreds of pages and tons of of varieties all through their websites that have to be recognized for updating.

An ideal software for that is Screaming Frog’s search engine marketing Spider… maybe the preferred platform within the search engine marketing marketplace for crawling, auditing, and extracting knowledge from a web site. The platform is feature-rich and provides tons of of choices for nearly each job you require. The options lengthen far past optimization for search, although, with one extremely useful function for extracting knowledge out of your web site because it’s being crawled.

Screaming Frog search engine marketing Spider: Crawl And Extract

A key function of Screaming Frog search engine marketing Spider is that you could carry out customized extractions based mostly on Regex, XPath, or CSSPath specifics. This is available in extraordinarily helpful as we want to crawl the shopper’s websites and audit and seize the MunchkinID and FormId values from pages.

With the software, open Configuration > Customized > Extraction to establish components you want to extract.

screamingfrog custom extraction

The extraction display permits for nearly limitless knowledge assortment:

Screaming Frog SEO Spider Extraction Rules

Regex, XPath, and CSSPath Extraction

For the MunchkinID, the identifier is situated inside the type script that’s inside the web page:

<script kind='textual content/javascript' id='marketo-fat-js-extra'>
    /* <![CDATA[ */
    var marketoFat = {
        "id": "123-ABC-456",
        "prepopulate": "",
        "ajaxurl": "https://yoursite.com/wp-admin/admin-ajax.php",
        "popout": {
            "enabled": false
        }
    };
    /* ]]> */

We then apply a Regex rule to seize the id from inside the script tag that’s inserted within the web page:

Regex: ["']id["']: *["'](.*?)["']

For the Kind ID, the info is in an enter tag inside the Marketo type:

<enter kind="hidden" title="formid" class="mktoField mktoFieldDescriptor" worth="1234">

We apply an XPath rule to seize the id from inside the type that’s inserted within the web page. The XPath question appears for a type with an enter with a reputation of formid, then the extraction saves the worth:

XPath: //type/enter[@name="formid"]/@worth

Extract Inline Type Tags

We’re serving to a shopper proper now clear up a web site the place they used inline kinds on the Elementor plugin to customise nearly each aspect with a web page. To establish the place inline kinds have been used, we scrapted the location with a variety of RegEx guidelines for customized extraction:

<spans+(?:[^>]*?s+)?kinds*=s*"([^"]*)"
<as+(?:[^>]*?s+)?kinds*=s*"([^"]*)"
<divs+(?:[^>]*?s+)?kinds*=s*"([^"]*)"
  • Heading Tag Inline Type:
<h+(?:[^>]*?s+)?kinds*=s*"([^"]*)"

Exclude Subdomains In Your Crawl

At Martech Zone, we serve the location in a number of languages at completely different subdomains. Crawling these translations isn’t mandatory since all of the belongings and knowledge relies on the core web site. Due to this, we enabled the Exclude Listing Configuration and added the next rule:

.*.martech.zone

You may as well use this to skip crawling pointless paths like tags by including:

martech.zone/tag/.*

The platform even has a pleasant methodology to check some URLs towards the principles to make sure it really works correctly earlier than you crawl your web site.

Screaming Frog search engine marketing Spider Javascript Rendering

One other nice choice of Screaming Frog is that you simply aren’t restricted to the HTML within the web page, you possibly can render any JavaScript that’s going to insert varieties inside your web site. Inside Configuration > Spider, you possibly can go to the Rendering tab and allow this.

Screaming Frog SEO Spider Javascript Rendering

This does take slightly longer to crawl the location, after all, however you’ll get varieties which are rendered client-side by JavaScript in addition to varieties which are inserted server-side.

Whereas it is a very particular software, it’s an extremely helpful one as you’re working with giant websites. You’ll completely wish to audit the place your varieties are embedded all through the location.

Obtain Screaming Frog search engine marketing Spider

Disclosure: Martech Zone is utilizing its affiliate hyperlinks on this article.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments