Google: Content Stitching Or Quilting Is Not Near Duplicate Content

Jun 21, 2017 - 8:11 am 9 by

Google Content Stiching Quilting

Dawn Anderson followed up on a topic around what is near duplicate content with Google's Gary Illyes - asking if it is similar to content stitching and quilting. As Dawn suspected, Gary said no, it is not. Here it is on Twitter where Dawn asked "'Content stitching / quilting'... this is not the same as near-duplicate as defined in ur prev tweet?" and Gary responded that she is correct.

Here are the tweets:

Dawn then sent me some more technical information on this. She said that Marc Najork, who is now at Google, wrote a paper on this while at Microsoft named Detecting Quilted Web Pages at Scale. Here is the abstract:

Web-based advertising and electronic commerce, combined with the key role of search engines in driving visitors to ad-monetized and e-commerce web sites, has given rise to the phenomenon of web spam: web pages that are of little value to visitors, but that are created mainly to mislead search engines into driving traffic to target web sites. A large fraction of spam web pages is automatically generated, and some portion of these pages is generated by stitching together parts (sentences or paragraphs) of other web pages. This paper presents a scalable algorithm for detecting such “quilted” web pages. Previous work by the author and his collaborators introduced a sampling-based algorithm that was capable of detecting some, but by far not all quilted web pages in a collection. By contrast, the algorithm presented in this work identifies all quilted web pages, and it is scalable to very large corpora. We tested the algorithm on the half-billion page English-language subset of the ClueWeb09 collection, and evaluated its effectiveness in detecting web spam by manually inspecting small samples of the detected quilted pages. This manual inspection guided us in iteratively refining the algorithm to be more efficient in detecting real-world spam.

There is no doubt Google and other search engines are on to this type of behavior but it is always nice pointing to research papers when we can. Thanks Dawn.

Forum discussion at Twitter.

 

Popular Categories

The Pulse of the search community

Follow

Search Video Recaps

 
Google Core Update Volatility, Helpful Content Update Gone, Dangerous Search Results & Ads Confusion - YouTube
Video Details More Videos Subscribe to Videos

Most Recent Articles

Google

Google Notes On Search Won't Necessarily Go Away In May

Apr 18, 2024 - 7:51 am
Google Maps

Google Maps Releases New Directions, Travel & EV Features

Apr 18, 2024 - 7:41 am
Google Ads

Google Ads Reminds Advertisers Some Ad Customizers Will Go Away May 31st

Apr 18, 2024 - 7:31 am
Google Search Engine Optimization

Google Drops Video Carousel Markup

Apr 18, 2024 - 7:21 am
Google Maps

Google Business Profiles Register Your Defibrillator (AED)

Apr 18, 2024 - 7:11 am
Search Forum Recap

Daily Search Forum Recap: April 17, 2024

Apr 17, 2024 - 4:00 pm
Previous Story: Google Got An Interactive Fidget Spinner