Preventing Search Engines From Indexing Your CS Webpages

Revision as of 14:07, 2 February 2022 by Steve410 (talk | contribs) (Created page with "==INTRO== There might be times you want to maintain webpages on our CS webserver, but you don't necessarily want search engines like Google from indexing those pages. For in...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

INTRO

There might be times you want to maintain webpages on our CS webserver, but you don't necessarily want search engines like Google from indexing those pages. For instance, you might have private pages you want to only make available via links you send to specific people.

You might have seen before the concept of using a robots.txt file to protect your web pages from being indexed. However, CS doesn't have the resources to allow that for all of our web site users.

An Easy Solution

Instead of using a robots.txt file, you can easily add X-Robots-Tag: headers to your files on your own.

This works best if you put all of the affected files into their own directory, separate from everything else.

Then you just need to create a file named .htaccess in that directory (note the leading dot in ".htaccess"!) and put the following line in it:

    Header add X-Robots-Tag "noindex"

That will add the "X-Robots-Tag: noindex" header to every file in that directory.

Is this foolproof?

No. You need to remember that the web pages you maintain are still publicly available, just not indexable by most search engines... Except, a misbehaving web crawler (or even just a buggy one) can just ignore the header and index your page anyway.


So, the above is an easy, quick way that should help prevent many search engines from indexing your files. But, if you want to make sure you prevent all of them from doing so, The better way to protect your pages from being indexed...

Require a login to access the webpage

In order to access your web pages, you can require a login. The only way a search engine could index your page, then, would require a person to enter a password.

If only JHU affiliates need access to your webpages, you can password-protect them per our Shibboleth page.

Otherwise, if you need people from in or outside of JHU to have access to your pages, visit the following website for information:

https://httpd.apache.org/docs/2.4/howto/auth.html

You would be adding all of the directives to a .htaccess file you would place in the same directory as the files.

Then you'd just need to share the login password with anyone with a legitimate need to access the files. This might not work for all circumstances, but if it will work for you, it's a better approach than making the files fully public and just asking, for instance, Google not to index them.

Turning Off Directory Indexing

Something else to consider... make sure you've turned off the ability to see the contents of web directories you have in your website. Fortunately, by default, CS has disabled Directory Indexing. However, at some point, if you've previously enabled Directory Indexing for your CS web directories consider disabling that feature.