Google To Newspapers: Here, Let Me Introduce You To Robots.txt

from the snappy dept

With the silly introduction last week of the AP’s attempt to create a weird and totally unnecessary new data feed to keep out aggregators and search engines, it seems that Google has gotten fed up. Google execs and employees have made similar statements on various panels and discussions, but Senior Business Product Manager Josh Cohen put up a blog post directed at newspapers, that can be summarized as: Dear newspapers: let me introduce you to a tool that’s been around forever. It’s called robots.txt. If you don’t like us indexing you, use it. Otherwise, shut up. In only slightly nicer language.

Filed Under: ,
Companies: google

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “Google To Newspapers: Here, Let Me Introduce You To Robots.txt”

Subscribe: RSS Leave a comment
27 Comments
Petréa Mitchell says:

The response I expect to hear...

…is that robots.txt is only a request. A polite aggregator will respect it, but the wicked, devious, pirating bloggers and scrapers that the AP is fighting an urgent battle against won’t.

And you have to acknowledge that there are occasional unscrupulous people who don’t pay attention to robots.txt. So then the AP goes, “Ha, we were right!”

So I think Google allowing itself to be drawn into an argument about technical details is a mistake here. Better to keep the focus on the disporportionate harm to positive, legitimate use caused by an attempt to guard against a small number of true pirates.

technomage (profile) says:

Re: The response I expect to hear...

While I agree with your overall assumption, I know that those “unscrupulous” people will find ways around the paywalls as well. The newspapers are trying to make a blanket rule against everyone, but the problem is any blanket always has holes. The fact that these newspaper corps refuse to use the tools already at hand, but would rather force the government to get involved to make new rules and tools, shows just how out of touch they really are concerning technology. Creating new standards to benefit only one thing does not solve the underlying issue: People want news, they want it now, and they don’t want to jump through hoops to get it.

Yakko Warner says:

Re: The response I expect to hear...

That may be true, but their argument will be with the aggregators that do not respect robots.txt, not with Google (which, according to the point of the blog, is a “polite aggregator” and respects robots.txt).

Not that you can trust the newspapers not to confuse “any other random (misbehaving, non-robots.txt-respecting) search engine” with “Google”.

I more expect to see one or more of the following:

* Newspapers write an invalid robots.txt file that ends up allowing Google to index their site, and they blame Google for their own technical ineptitude.

* Newspapers complain that they have to write a robots.txt file or META tag on their page, and demand Google adopt an “opt-in” policy for indexing (rather than this “opt-out” policy that is robots.txt).

* Some newspapers do everything correctly, stop their content from being indexed, and then blame Google when their traffic goes down (especially compared to the sites that aren’t blocking Google, which end up getting more traffic)

* Newspapers block their content from being indexed, but other papers or sites take the same stories and republish them and allow them to be indexed, and the original papers blame Google for indexing the republishing sites.

* Newspapers ignore this blog post completely and simply continue to blame Google for indexing content they don’t want indexed.

Anonymous Coward says:

The reason the robots.txt argument is flawed is because the newspaper don’t want google to remove them from search. They want to show up in google. They just want google to pay them as well. It’s nothing more than a money grab. Google says, “If you have good, useful content, we will rate you high in our index. You will get traffic and we will make a little money from ads on the search page.” Newspapers say, “Sounds like a good deal. Except, we want you to pay us as well.” Google says, “No, we don’t think we should have to pay you. If you don’t like the deal you can opt out.” Newspapers say, “No, we like the deal, but you still have to pay us.” Google says, “WTF.”

Hulser (profile) says:

Re: Re: Re:

Well, that doesn’t make the argument flawed, does it? It just means the AP doesn’t have a leg to stand on

Exactly. Google knows that the AP knows about robots.txt. So the purpose of the Google blog post is not to let the AP know about robots.txt, it’s to let everyone else know that the AP knows about robots.txt, which will result in undermining the AP’s arguments for a legislative “solution”.

Anonymous Coward says:

Re: Re:

I completely agree with google here, these newspaper companies are being evil and selfish. If they don’t want Google linking to them then MAKE A ROBOTS.TXT file or TELL google to remove them from their index. I’m sure Google will be more than glad to remove them from their index. But you can’t force someone to use your product and then force them to pay for it. That’s like when the RIAA tried to force people to buy their music (and not boycott it) and then they tried to force them to pay for it ( http://www.techdirt.com/articles/20090616/1527385253.shtml ). Nonsense.

Ryan (profile) says:

yeah but

the problem with robots.txt is that nothing in the specification involves people paying the AP.

The AP doesn’t want people to stop using their content – they want to change the way the web works so that they can be paid whenever they think they should be.

They know that blocking google would be devastating to their industry, so instead they bitch and whine hoping that somebody will pay them to shut up.

Anonymous Coward says:

ACAP "protocol"

have you looked through the ACAP document? it’s like 40 pages long, and all they do is explain how to use robots.txt for the first 35 pages or so. then they introduce a few new tags for inline meta types and markup for robots.txt… while disallowing *.

i encourage these guys to disallow * just so they can die faster and get replaced by better news outlets (newscientist, courthousenews, etc).

besides, nothing prevents someone from using a spider that sets the user agent as one of the standard IE/FF user agent strings. then you’re stuck taking a javascript or IP address route which are also both unreliable.

Hulser (profile) says:

Re: Re:

Google also does not respect robots.txt 100% of the time

You tell me which is a more compeling argument…

A) We’re really pissed off that Google is linking to our web site but we can’t be bothered to implement a simple technical solution that would stop this.

B) We don’t want Google linking to our web site, but they’re ignoring our configuration and linking to it anyway.

Because the AP is choosing option A, it’s all but irrelevent whether Google respects robots.txt 100% of the time. Right now, the ball is in the AP’s court.

Ryan says:

Re: Re: Re:

Yeah, I don’t see this…Google is going to code their bots the same way, so it’ll treat every site the same way. Unless they added in exceptions to specific sites, although I don’t know why they’d do that. Do they have a shit list of webmasters they don’t like that they periodically update in their scrapers? Seems to me like an exception would be an improperly used robots.txt file.

Ryan (profile) says:

google DOES follow robots.txt

“Google also does not respect robots.txt 100% of the time either”

I think you misunderstand crawling vs indexing. Robots.txt says don’t crawl my site. It doesn’t mean Google can’t index it – it just means they won’t cache it, or visit it, or anything.

They will still show it in the search results, but only as a URL – with no snippet or text under it.

You’re thinking of the noindex tag if you don’t want to be listed.

Anonymous Coward says:

Re: google DOES follow robots.txt

Robot.txt has many commands that can say many different things INCLUDING don’t index my site. See the post by Google’s blog.

“Webmasters who do not wish their sites to be indexed can and do use the following two lines to deny permission:

User-agent: *
Disallow: /”

http://googlepublicpolicy.blogspot.com/2009/07/working-with-news-publishers.html

They can have their website not INDEXED on google if they so choose just by a simple robot.txt file.

william says:

Okay, let’s put it this way. problem = opportunity = money.

The Internet and search engines are fine the way it is with REP…etc. However, Newspapers want a share of THAT “internet money” without having to do any work or use their brain to come up with something new and novel.

What do they do? Create an artificial problem by pretending they know nothing about the current Internet technology. Create another standard that’s inferior to what we have right now. Whine to create pressure to make people use them.

Then everyone will have to PAY THEM to NOT USE that sh*t standard.

Business model or extortion? You tell me.

MattP says:

Opportunity Lost

“Today, more than 25,000 news organizations across the globe make their content available in Google News and other web search engines. They do so because they want their work to be found and read — Google delivers more than a billion consumer visits to newspaper web sites each month.”

I’m sure there are plenty of sources wanting a share of 1 billion visitors a month. Let AP die and move on.

Add Your Comment

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Ctrl-Alt-Speech

A weekly news podcast from
Mike Masnick & Ben Whitelaw

Subscribe now to Ctrl-Alt-Speech »
Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...