Petréa Mitchell

The response I expect to hear...

…is that robots.txt is only a request. A polite aggregator will respect it, but the wicked, devious, pirating bloggers and scrapers that the AP is fighting an urgent battle against won’t.

And you have to acknowledge that there are occasional unscrupulous people who don’t pay attention to robots.txt. So then the AP goes, “Ha, we were right!”

So I think Google allowing itself to be drawn into an argument about technical details is a mistake here. Better to keep the focus on the disporportionate harm to positive, legitimate use caused by an attempt to guard against a small number of true pirates.

technomage (profile)

July 16, 2009 at 9:30 am

Re: The response I expect to hear...

While I agree with your overall assumption, I know that those “unscrupulous” people will find ways around the paywalls as well. The newspapers are trying to make a blanket rule against everyone, but the problem is any blanket always has holes. The fact that these newspaper corps refuse to use the tools already at hand, but would rather force the government to get involved to make new rules and tools, shows just how out of touch they really are concerning technology. Creating new standards to benefit only one thing does not solve the underlying issue: People want news, they want it now, and they don’t want to jump through hoops to get it.

Yakko Warner

July 16, 2009 at 9:45 am

Re: The response I expect to hear...

That may be true, but their argument will be with the aggregators that do not respect robots.txt, not with Google (which, according to the point of the blog, is a “polite aggregator” and respects robots.txt).

Not that you can trust the newspapers not to confuse “any other random (misbehaving, non-robots.txt-respecting) search engine” with “Google”.

I more expect to see one or more of the following:

* Newspapers write an invalid robots.txt file that ends up allowing Google to index their site, and they blame Google for their own technical ineptitude.

* Newspapers complain that they have to write a robots.txt file or META tag on their page, and demand Google adopt an “opt-in” policy for indexing (rather than this “opt-out” policy that is robots.txt).

* Some newspapers do everything correctly, stop their content from being indexed, and then blame Google when their traffic goes down (especially compared to the sites that aren’t blocking Google, which end up getting more traffic)

* Newspapers block their content from being indexed, but other papers or sites take the same stories and republish them and allow them to be indexed, and the original papers blame Google for indexing the republishing sites.

* Newspapers ignore this blog post completely and simply continue to blame Google for indexing content they don’t want indexed.

stat_insig (profile)

July 16, 2009 at 9:46 am

Re: The response I expect to hear...

Well……… There is an easy solution if you don’t want people to access your content….. keep it offline!

MattP

July 16, 2009 at 2:50 pm

Re: The response I expect to hear...

FTA: “REP isn’t specific to Google; all major search engines honor its commands.”

Anonymous Coward

July 16, 2009 at 9:34 am

The reason the robots.txt argument is flawed is because the newspaper don’t want google to remove them from search. They want to show up in google. They just want google to pay them as well. It’s nothing more than a money grab. Google says, “If you have good, useful content, we will rate you high in our index. You will get traffic and we will make a little money from ads on the search page.” Newspapers say, “Sounds like a good deal. Except, we want you to pay us as well.” Google says, “No, we don’t think we should have to pay you. If you don’t like the deal you can opt out.” Newspapers say, “No, we like the deal, but you still have to pay us.” Google says, “WTF.”

Ryan Z

July 16, 2009 at 9:37 am

Re: Re:

Well, that doesn’t make the argument flawed, does it? It just means the AP doesn’t have a leg to stand on, except the politicians they’ve paid off, of course.

Hulser (profile)

July 16, 2009 at 10:34 am

Re: Re: Re:

Well, that doesn’t make the argument flawed, does it? It just means the AP doesn’t have a leg to stand on

Exactly. Google knows that the AP knows about robots.txt. So the purpose of the Google blog post is not to let the AP know about robots.txt, it’s to let everyone else know that the AP knows about robots.txt, which will result in undermining the AP’s arguments for a legislative “solution”.

John Doe

July 16, 2009 at 9:46 am

Re: Re:

This is exactly what is going on. Newspapers are trying to get the government to point their guns at Google to make them hand over bags full of money. It is an attempt at “legal” extortion.

Anonymous Coward

July 16, 2009 at 9:55 am

Re: Re:

I completely agree with google here, these newspaper companies are being evil and selfish. If they don’t want Google linking to them then MAKE A ROBOTS.TXT file or TELL google to remove them from their index. I’m sure Google will be more than glad to remove them from their index. But you can’t force someone to use your product and then force them to pay for it. That’s like when the RIAA tried to force people to buy their music (and not boycott it) and then they tried to force them to pay for it ( http://www.techdirt.com/articles/20090616/1527385253.shtml ). Nonsense.

Ryan (profile)

July 16, 2009 at 9:51 am

yeah but

the problem with robots.txt is that nothing in the specification involves people paying the AP.

The AP doesn’t want people to stop using their content – they want to change the way the web works so that they can be paid whenever they think they should be.

They know that blocking google would be devastating to their industry, so instead they bitch and whine hoping that somebody will pay them to shut up.

duderino

July 16, 2009 at 9:54 am

sad

This is just sad that Google has to keep making these kind of public statements while the AP doesn’t read it, and then they keep digging themselves a bigger hole.

Ryan (profile)

July 16, 2009 at 10:01 am

i wish

I wish that google would call the papers bluff and completely remove them from search results – offering only to re-include them if and when they put up a robots.txt file

They’ll never do it, users would complain about not being able to find their news, but man would it be hilarious.

Anonymous Coward

July 16, 2009 at 10:30 am

Re: i wish

I have a better idea. Remove them from the search results/news/everything and make them pay to come back in.

Anonymous Coward

July 16, 2009 at 10:10 am

Where can I write the politicians that these newspapers are lobbying, everyone should write the politicians and explain to them the technology and how absurd it is for these stupid newspapers to come crying to them for money grabs from google.

Anonymous Coward

July 16, 2009 at 10:17 am

ACAP "protocol"

have you looked through the ACAP document? it’s like 40 pages long, and all they do is explain how to use robots.txt for the first 35 pages or so. then they introduce a few new tags for inline meta types and markup for robots.txt… while disallowing *.

i encourage these guys to disallow * just so they can die faster and get replaced by better news outlets (newscientist, courthousenews, etc).

besides, nothing prevents someone from using a spider that sets the user agent as one of the standard IE/FF user agent strings. then you’re stuck taking a javascript or IP address route which are also both unreliable.

Anonymous Coward

July 16, 2009 at 10:18 am

Fine block your sites from Google – and I’ll just find another – no big deal. That’s the POINT of Google – finding another site.

Anonymous Coward

July 16, 2009 at 10:31 am

Google also does not respect robots.txt 100% of the time either, often indexing internal pages that are blocked by the robots file because of direct external links.

Do no evil. Right.

Anonymous Coward

July 16, 2009 at 10:43 am

Re: Re:

“Google also does not respect robots.txt 100% of the time either”

Do you have any examples of this? Or are you just making things up.

Hulser (profile)

July 16, 2009 at 10:44 am

Re: Re:

Google also does not respect robots.txt 100% of the time

You tell me which is a more compeling argument…

A) We’re really pissed off that Google is linking to our web site but we can’t be bothered to implement a simple technical solution that would stop this.

B) We don’t want Google linking to our web site, but they’re ignoring our configuration and linking to it anyway.

Because the AP is choosing option A, it’s all but irrelevent whether Google respects robots.txt 100% of the time. Right now, the ball is in the AP’s court.

Anonymous Coward

July 16, 2009 at 10:46 am

Re: Re: Re:

It’s time for everyone to boycott the AP.

The Buzz Saw (profile)

July 16, 2009 at 10:47 am

Re: Re:

I’d be interested in seeing proof of this statement. I’m not trying to be obnoxious or anything; I’m just genuinely interested to see this happen. Several people have mentioned that Google does not always honor robots.txt, but I have never seen any matching evidence.

Source?

Ryan

July 16, 2009 at 10:56 am

Re: Re: Re:

Yeah, I don’t see this…Google is going to code their bots the same way, so it’ll treat every site the same way. Unless they added in exceptions to specific sites, although I don’t know why they’d do that. Do they have a shit list of webmasters they don’t like that they periodically update in their scrapers? Seems to me like an exception would be an improperly used robots.txt file.

Ryan (profile)

July 16, 2009 at 11:22 am

google DOES follow robots.txt

“Google also does not respect robots.txt 100% of the time either”

I think you misunderstand crawling vs indexing. Robots.txt says don’t crawl my site. It doesn’t mean Google can’t index it – it just means they won’t cache it, or visit it, or anything.

They will still show it in the search results, but only as a URL – with no snippet or text under it.

You’re thinking of the noindex tag if you don’t want to be listed.

Anonymous Coward

July 16, 2009 at 11:53 am

Re: google DOES follow robots.txt

Robot.txt has many commands that can say many different things INCLUDING don’t index my site. See the post by Google’s blog.

“Webmasters who do not wish their sites to be indexed can and do use the following two lines to deny permission:

User-agent: *
Disallow: /”

http://googlepublicpolicy.blogspot.com/2009/07/working-with-news-publishers.html

They can have their website not INDEXED on google if they so choose just by a simple robot.txt file.

william

July 16, 2009 at 1:08 pm

Okay, let’s put it this way. problem = opportunity = money.

The Internet and search engines are fine the way it is with REP…etc. However, Newspapers want a share of THAT “internet money” without having to do any work or use their brain to come up with something new and novel.

What do they do? Create an artificial problem by pretending they know nothing about the current Internet technology. Create another standard that’s inferior to what we have right now. Whine to create pressure to make people use them.

Then everyone will have to PAY THEM to NOT USE that sh*t standard.

Business model or extortion? You tell me.

MattP

July 16, 2009 at 2:57 pm

Opportunity Lost

“Today, more than 25,000 news organizations across the globe make their content available in Google News and other web search engines. They do so because they want their work to be found and read — Google delivers more than a billion consumer visits to newspaper web sites each month.”

I’m sure there are plenty of sources wanting a share of 1 billion visitors a month. Let AP die and move on.

Tuesday
20:00	David Chang Issues C&Ds Over 'Chile Crunch' Products, Then Apologizes And Promises To Stop (0)
15:34	Because It's Done Such A Great Job Policing Illegal Drugs, The DEA Decides It's Time To Start Engaging In Legal Drug Hysteria (12)
13:38	When You Need To Post A Lengthy Legal Disclaimer With Your Parody Song, You Know Copyright Is Broken (16)
12:09	No One Can Own The Law—So Why Is Congress Advancing A Bill To Extend Copyright To It? (12)
10:52	Top Lawyer In Texas Doesn't Understand Court Rulings, Celebrates Obvious SCOTUS Loss As A Win (16)
10:48	Daily Deal: The 2024 Complete Godot Stack Development Bundle (0)
09:35	Any Privacy Law Is Going To Require Some Compromise: Is APRA The Right Set Of Tradeoffs? (8)
05:31	The Future Of Streaming TV: More Pointless Mergers And Making It Harder To Cancel (31)
Monday
20:15	More Open Access Training For Academics Would Lead To More Open Access (5)
15:56	First Approved Emulator App Appears In Apple's App Store Under New Rules (4)

Google To Newspapers: Here, Let Me Introduce You To Robots.txt

from the snappy dept

Comments on “Google To Newspapers: Here, Let Me Introduce You To Robots.txt”

The response I expect to hear...

Re: The response I expect to hear...

Re: The response I expect to hear...

Re: The response I expect to hear...

Re: The response I expect to hear...

Re: Re:

Re: Re: Re:

Re: Re:

Re: Re:

yeah but

sad

i wish

Re: i wish

ACAP "protocol"

Re: Re:

Re: Re:

Re: Re: Re:

Re: Re:

Re: Re: Re:

google DOES follow robots.txt

Re: google DOES follow robots.txt

Opportunity Lost

Add Your Comment Cancel reply

Comment Options:

What's this?

The Techdirt Greenhouse

Trending Posts

Tuesday

Monday

More

Tools & Services

Company

Contact

More

Google To Newspapers: Here, Let Me Introduce You To Robots.txt

from the snappy dept

Comments on “Google To Newspapers: Here, Let Me Introduce You To Robots.txt”

Add Your Comment Cancel reply

Comment Options:

What's this?

Techdirt Daily Newsletter

The Techdirt Greenhouse

Trending Posts

Tuesday

Monday

More

Email This Story

Tools & Services

Company

Contact

More