Preventing Search Engines From Indexing Your CS Webpages
INTRO
There might be times you want to maintain webpages on our CS webserver, but you don't necessarily want search engines like Google from indexing those pages. For instance, you might have private pages you want to only make available via links you send to specific people.
You might have seen before the concept of using a robots.txt file to protect your web pages from being indexed. However, CS does not have the resources to allow that for all of our web site users.
An Easy Solution
Instead of using a robots.txt file, you can easily add X-Robots-Tag: headers to your files on your own.
This works best if you put all of the affected files into their own directory, separate from everything else.
Then you just need to create a file named .htaccess in that directory (note the leading dot in ".htaccess"), and put the following line in it:
Header add X-Robots-Tag "noindex"
then, save that file.
That will add the "X-Robots-Tag: noindex" header to every file in that directory.
Is this foolproof?
No. You need to remember that the web pages you maintain are still publicly available, just not indexable by most search engines... Except, a misbehaving web crawler (or even just a buggy one) can just ignore the header and index your page anyway.
So, the above is an easy, quick way that should help prevent many search engines from indexing your files. But, if you want to make sure you prevent all of them from doing so, The better way to protect your pages from being indexed is to ...
Require a login to access the webpage
In order to more securely prevent unauthorized access your web pages, you can require a login. The only way a search engine could index your page, then, would require a the search engine (or a person) to enter a password.
If only JHU affiliates need access to your webpages, you can password-protect them per our Shibboleth page.
Otherwise, if you need people from anywhere to have access to your pages, visit the following website for information:
https://httpd.apache.org/docs/2.4/howto/auth.html
You would be adding all of the directives to a .htaccess file you would place in the same directory as the files.
And if you add a passwords-related file, please remember to set the proper permissions. Permissions information can be found on our page regarding file permissions for CS webpages
Then you would just need to share the login password with anyone with a legitimate need to access the files. This might not work for all circumstances, but if it will work for you, it's a better approach than making the files fully public and just asking, for instance, Google not to index them.
You can apply password protection to either an entire directory and its contents (and all of its subdirectories and their contents, etc.) or to specific files within a directory.
Turning Off Directory Indexing
Something else to consider... make sure you've turned off the ability to see the contents of web directories you have in your website. Fortunately, by default, CS has disabled Directory Indexing. However, at some point, if you've previously enabled Directory Indexing for your CS web directories, consider disabling that feature.