A Breakdown of HTML Utilization Throughout ~eight Million Pages (& What It Means for Trendy web optimization)

Not way back, my colleagues and I at Superior Net Rating got here up with an HTML research primarily based on about eight million index pages gathered from the highest twenty Google outcomes for greater than 30 million key phrases.

We wrote in regards to the markup outcomes and the way the highest twenty Google outcomes pages implement them, then went even additional and obtained HTML utilization insights on them.

What does this should do with web optimization?

The best way HTML is written dictates what customers see and the way search engines like google and yahoo interpret internet pages. A legitimate, well-formatted HTML web page additionally reduces attainable misinterpretation — of structured knowledge, metadata, language, or encoding — by search engines like google and yahoo.

That is meant to be a technical web optimization audit, one thing we wished to do from the start: a breakdown of HTML utilization and the way the outcomes relate to fashionable web optimization strategies and finest practices.

On this article, we’re going to deal with issues like meta tags that Google understands, JSON-LD structured knowledge, language detection, headings utilization, social hyperlinks & meta distribution, AMP, and extra.

Meta tags that Google understands

When speaking about the principle search engines like google and yahoo as site visitors sources, sadly it is simply Google and the remaining, with Duckduckgo gaining traction recently and Bing virtually nonexistent.

Thus, on this part we’ll be focusing solely on the meta tags that Google listed within the Search Console Assist Middle.

Pie chart displaying the overall numbers for the meta tags that Google understands, described intimately within the sections under.

The meta description is a ~150 character snippet that summarizes a web page’s content material. Search engines like google and yahoo present the meta description within the search outcomes when the searched phrase is contained within the description.






On the extremes, we discovered 685,341 meta parts with content material shorter than 30 characters and 1,293,842 parts with the content material textual content longer than 160 characters.

</h3> <p>The title is technically not a meta tag, however it’s used along with meta title=”description”.</p> <p>This is likely one of the two most necessary HTML tags in terms of web optimization. It is also a should in response to W3C, that means no web page is legitimate with a lacking title tag.</p> <p>Analysis means that for those who hold your titles underneath an inexpensive 60 characters then you may anticipate your titles to be rendered correctly within the SERPs. Previously, there have been indicators that Google’s search outcomes title size was prolonged, however it wasn’t a everlasting change.</p> <p>Contemplating all of the above, from the total 6,263,396 titles we discovered, 1,846,642 title tags look like too lengthy (greater than 60 characters) and 1,985,020 titles had lengths thought of too quick (underneath 30 characters).</p> <p><img alt="titles.png" src="http://d2v4zi8pl64nxt.cloudfront.net/a-technical-seo-audit-of-8-million-pages/5d9ce8753cb0a0.97189359.png" width="624" height="280" data-image="t20qt2hyesi2" title="titles.png"/>Pie chart displaying the title tag size distribution, with a size lower than 30 chars being 31.7% and a size higher than 60 chars being about 29.5%.</p> <p>A title being too quick should not be an issue —in spite of everything, it is a subjective factor relying on the web site enterprise. That means will be expressed with fewer phrases, however it’s undoubtedly an indication of wasted optimization alternative.</p> <p><strong>SELECTOR</strong></p> <p><strong>COUNT</strong></p> <p><title>*


lacking tag</p> <p>1,285,738</p> <p></p> <p>One other attention-grabbing factor is that, among the many websites rating on web page 1–2 of Google, 351,516 (~5% of the overall 7.5M) are utilizing the identical textual content for the title and h1 on their index pages.</p> <p>Additionally, do you know that with HTML5 you solely must specify the HTML5 doctype and a title so as to have a superbly legitimate web page?</p> <p><!DOCTYPE html><br /> <title>pink

“These meta tags can management the habits of search engine crawling and indexing. The robots meta tag applies to all search engines like google and yahoo, whereas the “googlebot” meta tag is particular to Google.”
– Meta tags that Google understands





HTML snippet with a meta robots and its content material parameters.

So the robots meta directives present directions to search engines like google and yahoo on how one can crawl and index a web page’s content material. Leaving apart the googlebot meta rely which is type of low, we have been curious to see essentially the most frequent robots parameters, contemplating that an enormous false impression is that you must add a robots meta tag in your HTML’s head. Right here’s the highest 5:








“When customers seek for your website, Google Search outcomes typically show a search field particular to your website, together with different direct hyperlinks to your website. This meta tag tells Google to not present the sitelinks search field.”
– Meta tags that Google understands




Unsurprisingly, not many web sites select to explicitly inform Google to not present a sitelinks search field when their website seems within the search outcomes.

“This meta tag tells Google that you don’t need us to supply a translation for this web page.” – Meta tags that Google understands

There could also be conditions the place offering your content material to a a lot bigger group of customers just isn’t desired. Simply because it says within the Google assist reply above, this meta tag tells Google that you don’t need them to supply a translation for this web page.




“You should use this tag on the top-level web page of your website to confirm possession for Search Console.”
– Meta tags that Google understands




Whereas we’re on the topic, do you know that for those who’re a verified proprietor of a Google Analytics property, Google will now robotically confirm that very same web site in Search Console?

“This defines the web page’s content material sort and character set.”
– Meta tags that Google understands

That is principally one of many good meta tags. It defines the web page’s content material sort and character set. Contemplating the desk under, we seen that almost half of the index pages we analyzed outline a meta charset.




“This meta tag sends the person to a brand new URL after a sure period of time and is usually used as a easy type of redirection.”
– Meta tags that Google understands

It is preferable to redirect your website utilizing a 301 redirect slightly than a meta refresh, particularly once we assume that 30x redirects do not lose PageRank and the W3C recommends that this tag not be used. Google just isn’t a fan both, recommending you utilize a server-side 301 redirect as a substitute.




From the overall 7.5M index pages we parsed, we discovered 7,167 pages which can be utilizing the above redirect technique. Authors don’t all the time have management over server-side applied sciences and apparently they use this system so as to allow redirects on the shopper aspect.

Additionally, utilizing Staff is a cutting-edge various n order to beat points when working with legacy tech stacks and platform limitations.

“This tag tells the browser how one can render a web page on a cellular machine. Presence of this tag signifies to Google that the web page is mobile-friendly.”
– Meta tags that Google understands




Beginning July 1, 2019, all websites began to be listed utilizing Google’s mobile-first indexing. Lighthouse checks whether or not there is a meta title=”viewport” tag within the head of the doc, so this meta must be on each webpage, it doesn’t matter what framework or CMS you are utilizing.

Contemplating the above, we might have anticipated extra web sites than the four,992,791 out of seven.5 million index pages analyzed to make use of a sound meta title=”viewport” of their head sections.

Designing mobile-friendly websites ensures that your pages carry out properly on all units, so ensure your internet web page is mobile-friendly right here.

“Labels a web page as containing grownup content material, to sign that it’s filtered by SafeSearch outcomes.”
– Meta tags that Google understands




This tag is used to indicate the maturity score of content material. It was not added to the meta tags that Google understands listing till just lately. Take a look at this text by Kate Morris on how one can tag grownup content material.

JSON-LD structured knowledge

Structured knowledge is a standardized format for offering details about a web page and classifying the web page content material. The format of structured knowledge will be Microdata, RDFa, and JSON-LD — all of those assist Google perceive the content material of your website and set off particular search consequence options to your pages.

Whereas having a dialog with the superior Dan Shure, he got here up with a good suggestion to search for structured knowledge, such because the group’s brand, in search outcomes and within the Data Graph.

On this part, we’ll be utilizing JSON-LD (JavaScript Object Notation for Linked Information) solely so as to collect structured knowledge information.That is what Google recommends anyway for offering clues in regards to the that means of an internet web page.

Some helpful bits on this:

At Google I/O 2019, it was introduced that the structured knowledge testing software shall be outmoded by the wealthy outcomes testing software.Now Googlebot indexes internet pages utilizing the most recent Chromium slightly than the previous Chrome 42, that means you may mitigate the web optimization points you will have had previously, with structured knowledge assist as properly.Jason Barnard had an attention-grabbing speak at SMX London 2019 on how Google Search rating works and in response to his idea, there are seven rating components we are able to rely on; structured knowledge is certainly one in every of them. Builtvisible’s information on Microdata, JSON-LD, & Schema.org accommodates every little thing you should find out about utilizing structured knowledge in your web site.Here is an superior information to JSON-LD for novices by Alexis Sanders.Final however not least, there are many articles, displays, and posts to dive in on the official JSON for Linking Information web site.

Superior Net Rating’s HTML research depends on analyzing index pages solely. What’s attention-grabbing is that despite the fact that it isn’t acknowledged within the tips, Google would not appear to care about structured knowledge on index pages, as acknowledged in a Stack Overflow reply by Gary Illyes a number of years in the past. But, on JSON-LD structured knowledge varieties that Google understands, we discovered a complete of two,727,045 options:

json-ld-chart.pngPie chart displaying the structured knowledge varieties that Google understands, with Sitelinks searchbox being 49.7% — the very best worth.







E book




Company contact




Critic evaluate




Employer combination score




Reality test


FAQ web page




Job posting




Native enterprise










Q&A web page




Evaluation snippet


Sitelinks searchbox


Social profile


Software program app




Subscription and paywalled content material





The rel=canonical factor, typically referred to as the “canonical hyperlink,” is an HTML factor that helps site owners stop duplicate content material points. It does this by specifying the “canonical URL,” the “most popular” model of an internet web page.




meta title=”key phrases”

It isn’t new that is out of date and Google would not use it anymore. It additionally seems as if  is a spam sign for many of the major search engines.

“Whereas the principle search engines like google and yahoo do not use meta key phrases for rating, they’re very helpful for onsite search engines like google and yahoo like Solr.”
– JP Sherman on why this out of date meta may nonetheless be helpful these days.







Inside 7.5 million pages, h1 (59.6%) and h2 (58.9%) are among the many twenty-eight parts used on essentially the most pages. Nonetheless, after gathering all of the headings, we discovered that h3 is the heading with the biggest variety of appearances — 29,565,562 h3s out of 70,428,376  complete headings discovered.

Random info:

The h1–h6 parts characterize the six ranges of part headings. Listed here are the total stats on headings utilization, however we discovered 23,116 of h7s and seven,276 of h8s too. That is a humorous factor as a result of loads of folks do not even use h6s fairly often.There are three,046,879 pages with lacking h1 tags and inside the remainder of the four,502,255 pages, the h1 utilization frequency is 2.6, with a complete of 11,675,565 h1 parts.Whereas there are 6,263,396 pages with a sound title, as seen above, solely four,502,255 of them are utilizing a h1 inside the physique of their content material.

Lacking alt tags

This everlasting web optimization and accessibility subject nonetheless appears to be frequent after analyzing this set of knowledge. From the overall of 669,591,743 photos, virtually 90% are lacking the alt attribute or use it with a clean worth.

chart (4).pngPie chart displaying the img tag alt attribute distribution, with lacking alt being predominant — 81.7% from a complete of about 670 million photos we discovered.





img alt=”*”


img alt=””


img w/ lacking alt


Language detection

Based on the specs, the language info specified through the lang attribute could also be utilized by a person agent to regulate rendering in a wide range of methods.

The half we’re fascinated by right here is about “helping search engines like google and yahoo.”

“The HTML lang attribute is used to establish the language of textual content content material on the internet. This info helps search engines like google and yahoo return language particular outcomes, and it’s also utilized by display screen readers that change language profiles to supply the right accent and pronunciation.”
– Léonie Watson

Some time in the past, John Mueller stated Google ignores the HTML lang attribute and beneficial the usage of hyperlink hreflang as a substitute. The Google Search Console documentation states that Google makes use of hreflang tags to match the person’s language choice to the suitable variation of your pages.

lang-vs-hreflang.pngBar chart displaying that 65% of the 7.5 million index pages use the lang attribute on the html factor, on the similar time 21.6% use not less than a hyperlink hreflang.

Of the 7.5 million index pages that we have been capable of look into, four,903,665 use the lang attribute on the html factor. That’s about 65%!

On the subject of the hreflang attribute, suggesting the existence of a multilingual web site, we discovered about 1,631,602 pages — meaning round 21.6% index pages use not less than a hyperlink rel=”alternate” href=”*” hreflang=”*” factor.

Google Tag Supervisor

From the start, Google Analytics’ most important process was to generate reviews and statistics about your web site. However if you wish to group sure pages collectively to see how individuals are navigating by that funnel, you want a singular Google Analytics tag. That is the place issues get sophisticated.

Google Tag Supervisor makes it simpler to:

Handle this mess of tags by letting you outline customized guidelines for when and what person actions your tags ought to fireChange your tags everytime you need with out really altering the supply code of your web site, which typically is usually a headache attributable to gradual launch cyclesUse different analytics/advertising instruments with GTM, once more with out touching the web site’s supply code

We looked for *googletagmanager.com/gtm.js references and noticed that about 345,979 pages are utilizing the Google Tag Supervisor.


“Nofollow” offers a method for site owners to inform search engines like google and yahoo “do not comply with hyperlinks on this web page” or “do not comply with this particular hyperlink.”

Google doesn’t comply with these hyperlinks and likewise doesn’t switch fairness. Contemplating this, we have been interested by rel=”nofollow” numbers. We discovered a complete of 12,828,286 rel=”nofollow” hyperlinks inside 7.5 million index pages, with a computed common of 1.69 rel=”nofollow” per web page.

Final month, Google introduced two new hyperlink attributes values that must be used so as to mark the nofollow property of a hyperlink: rel=”sponsored” and rel=”ugc”. I’d advocate you go learn Cyrus Shepard’s article on how Google’s nofollow, sponsored, & ugc hyperlinks impression web optimization, be taught why Google modified nofollow,  the rating impression of nofollow hyperlinks, and extra.

A desk displaying how Google’s nofollow, sponsored, and UGC hyperlink attributes impression web optimization, from Cyrus Shepard’s article.

We went a bit additional and regarded up these new hyperlink attributes values, discovering 278 rel=”sponsored” and 123 rel=”ugc”. To ensure we had the related knowledge for these queries, we up to date the index pages knowledge set particularly two weeks after the Google announcement on this matter. Then, utilizing Moz authority metrics, we sorted out the highest URLs we discovered that use not less than one of many rel=”sponsored” or rel=”ugc” pair:



Accelerated Cell Pages (AMP) are a Google initiative which goals to hurry up the cellular internet. Many publishers are making their content material out there parallel to the AMP format.

To let Google and different platforms find out about it, you should hyperlink AMP and non-AMP pages collectively.

Inside the thousands and thousands of pages we checked out, we discovered solely 24,807 non-AMP pages referencing their AMP model utilizing rel=amphtml.


We wished to understand how shareable or social an internet site is these days, so figuring out that Josh Buchea made an superior listing with every little thing that might go within the head of your webpage, we extracted the social sections from there and obtained the next numbers:

Fb Open Graph

chart.pngBar chart displaying the Fb Open Graph meta tags distribution, described intimately within the desk under.



meta property=”fb:app_id” content material=”*”


meta property=”og:url” content material=”*”


meta property=”og:sort” content material=”*”


meta property=”og:title” content material=”*”


meta property=”og:picture” content material=”*”


meta property=”og:picture:alt” content material=”*”


meta property=”og:description” content material=”*”


meta property=”og:site_name” content material=”*”


meta property=”og:locale” content material=”*”


meta property=”article:writer” content material=”*”


Twitter card

chart (1).pngBar chart displaying the Twitter Card meta tags distribution, described intimately within the desk under.



meta title=”twitter:card” content material=”*”


meta title=”twitter:website” content material=”*”


meta title=”twitter:creator” content material=”*”


meta title=”twitter:url” content material=”*”


meta title=”twitter:title” content material=”*”


meta title=”twitter:description” content material=”*”


meta title=”twitter:picture” content material=”*”


meta title=”twitter:picture:alt” content material=”*”


And talking of hyperlinks, we grabbed all of them that have been pointing to the preferred social networks.

chart (2).pngPie chart displaying the exterior social hyperlinks distribution, described intimately within the desk under.







Apparently there are many web sites that also hyperlink to their Google+ profiles, which might be an oversight contemplating the not-so-recent Google+ shutdown.


Based on Google, utilizing rel=prev/subsequent just isn’t an indexing sign anymore, as introduced earlier this 12 months:

“As we evaluated our indexing alerts, we determined to retire rel=prev/subsequent. Research present that customers love single-page content material, goal for that when attainable, however multi-part can also be positive for Google Search.”
– Tweeted by Google Site owners

Nonetheless, in case it issues for you, Bing says it makes use of them as hints for web page discovery and website construction understanding.

“We’re utilizing these (like most markup) as hints for web page discovery and website construction understanding. At this level, we’re not merging pages collectively within the index primarily based on these and we’re not utilizing prev/subsequent within the rating mannequin.”
– Frédéric Dubut from Bing

However, listed below are the utilization stats we discovered whereas taking a look at thousands and thousands of index pages:




That is just about it!

Realizing how the common internet web page seems utilizing knowledge from about eight million index pages may give us a clearer concept of developments and assist us visualize frequent utilization of HTML in terms of web optimization fashionable and rising strategies. However this can be a unending saga — whereas having a lot of numbers and stats to discover, there are nonetheless a lot of questions that want answering:

We all know how structured knowledge is used within the wild now. How will it evolve and the way a lot structured knowledge shall be thought of sufficient?Ought to we anticipate AMP utilization to extend someplace sooner or later? How will rel=”sponsored” and rel=“ugc” change the best way we write HTML each day? When coding exterior hyperlinks, apart from the goal=”_blank” and rel=“noopener” combo, we now have to think about the rel=”sponsored” and rel=“ugc” combos as properly.Will we ever be taught to all the time add alt attributes values for photos which have a objective past ornament? What number of extra further meta tags or attributes will we now have so as to add to an internet web page to please the major search engines? Do we actually wanted the newly introduced data-nosnippet HTML attribute? What’s subsequent, data-allowsnippet?

There are different issues we might have preferred to deal with as properly, like “time-to-first-byte” (TTFB) values, which correlates extremely with rating; I might extremely advocate HTTP Archive for that. They periodically crawl the highest websites on the internet and report detailed details about virtually every little thing. Based on the most recent information, they’ve analyzed four,565,694 distinctive web sites, with full Lighthouse scores and having saved explicit applied sciences like jQuery or WordPress for the entire knowledge set. Enormous props to Rick Viscomi who does an incredible job as its “steward,” as he likes to name himself.

Performing this large-scale research was a enjoyable journey. We discovered quite a bit and we hope you discovered the above numbers as attention-grabbing as we did. If there’s a tag or attribute specifically you wish to see the numbers for, please let me know within the feedback under.

As soon as once more, take a look at the total HTML research outcomes and let me know what you suppose!

Leave a Reply

Your email address will not be published. Required fields are marked *