Permalinks

People have generally discovered on today’s Web, that even if they just right-click on any item and choose “Copy Link Location” as offered by their browser, they get a URL of such a sort, that next, when they embed that URL into a posting or into a site of their own, within some time the real URL changes and the one they copied / dragged no longer works.

This is why URLs / Hyperlinks have long been reinvented in the form of so-called “Permalinks”.

All my postings and pages offer Permalinks to the reader. On a blog posting, right-clicking or tap-holding on the Title, will offer to Copy the Permalink. For pages, right-clicking or tap-holding on the entry In the Sidebar or Header of my Main Page, will do same.


 

(Edit 01/14/2017 : This blog is generated by a collection of PHP server-scripts – aka CGI scripts – and uses HTML-5. This is a form of HTML which advises content-providers against using the old formatting tags for Italic, Underscore and Bold, instead suggesting Strong or Emphasized.

Because this blogging engine observes HTML-5 100%, the pieces of text which seem underlined, are generally URLs which the reader can click on.

When we use a Desktop or Laptop, our browser allows us to hover over these hyperlinks with our mouse, and displays a bubble which describes what sort of link it is.

But, when we use a tablet or a smart-phone to read a Web-page, there is no hover-support, because we usually do not use a Bluetooth Mouse. In such a case, some readers might have overlooked the fact, that each underlined segment of text is in fact a link, unless they were to tap on that link.

Generally, readers do not tap on random places within pages of text, unless they already know that the pages contain hyperlinks.)


 

(Edit 06/05/2017 : )

I suppose that there is another piece of information which I can offer, which actually describes permalinks.

According to more-old-fashioned thinking in HTML and Web-design, a site is organized into folders, which contain either HTML Files or CGI-Scripts. URLs would have put the folder-names into their path, leading up to the file-name of either the HTML File or the CGI-Script.

According to that rule, the following URL should have a nonsensical meaning:

http://dirkmittler.homeip.net/blog/archives/3051

According to ages-old wisdom, my site has as its root folder, a folder named ‘blog’, and it supposedly has a sub-folder to that one, named ‘archives’, and another sub-folder to that one named ‘3051’. Further, it would seem that our URL does not specify what file, or what type of file to open, either belonging to the folder ‘archives’ or belonging to the folder ‘3051’.

This way of representing URLs is used often today, by sites that actually manage a large collection of pages. The reason these URLs work, is the fact that before executing server-side CGI-Scripts, sophisticated Web-servers apply “Rewrite Rules”. These Rewrite Rules are specific to one site – such as to my blog – and consist of ‘Regular Expressions’, by which the server recognizes patterns in the URL, and by which it replaces the pattern systematically with another pattern. So the above URL gets rewritten by my Web-server, to do exactly as the following URL does, without your browser getting to see that this happens:

http://dirkmittler.homeip.net/blog/?p=3051

What the browser would be requesting with the above URL, is that the default CGI-Script for the root folder be executed on the server, using the GET-Method, and setting the parameter ‘p’ to the value ‘3051’.

The latter types of (more old-fashioned) URLs generally also work, with two main drawbacks:

  1. They are humanly undecipherable,
  2. Web-Masters are likely to change them. If they were to change, these would no longer qualify as permalinks.

‘WordPress’ offers its bloggers a selection of types of permalinks, including:

http://dirkmittler.homeip.net/blog/2017/06/Linear-Predictive-Coding/

Again, there is no file in any folder ‘…/2017/06/’ by the name ‘Linear-Predictive-Coding’. But, subscribers to newspapers and some blogs would benefit from this rewrite rule, because if they were Copying and Pasting numerous URLs, they would be able to tell at a glance, which one was which. This could be more useful to some readers, than just to see that one of them was ‘…/3051′. And so this specific feature, of ‘A Descriptive URL’, has also become synonymous in some people’s minds, with the concept of ‘Permalinks’.

If the reader needs to know in greater detail how this works, This is an external explanation, specific to WordPress. It highlights the fact that PHP-Scripts can access what the original URL was, using the environment (array) variable 'REQUEST_URI', which indexes the array $_SERVER .

(Update 08/15/2015 : )

This last detail is important, because it means that the CGI-script, regardless of whether it’s written in the language ‘PHP’ or not, has access to the original URL’s text, and is therefore able to analyze that to whatever level of complexity required, in order to determine what HTML document to send to the browser, when that URL is opened from the browser.

Therefore, even with Rewriting, the URL remains a mechanism to pass parameters from the browser, to the CGI-script.


 

(As Of 06/05/2017 : )

I suppose I should add another detail. When a ‘WordPress’ blogger changes his Permalink-Type, because the site’s PHP-Scripts are generating URLs as references to its own components, the blogger is mainly changing what type of permalinks the site is generating. But in general, the rewrite rules and CGI-Scripts are flexible enough, to parse any of them, if they arise as a URL. Only, what the scripts are also coded to do is recognize, If the requested URL is using a different type of permalink, from the currently-selected type. If so, the script makes sure that a permalink of the current type displays in the browser’s URL-field.

The reason this is done, is the fact that some readers will actually use the current URL which the browser sees, when they Copy and Paste URLs, and in this case the blogger would rather have it as well, that his currently-chosen type of permalinks are received by the reader.

This can be accomplished, when the server sends to browser an HTML-page which is mainly empty, except for a Header, which instructs the browser to request a redirect, with a time-delay of zero (= Client-Pull).


 

(Update 08/15/2018 : )

Given the reality that rewrite-compatible URLs are both commonplace today, and versatile, I suppose that the question could be asked, of why then, the older ‘GET’ Method is still used at all. And my answer would be as follows:

URLs that are suitable for rewriting, either need to be a part of the HTML-document that the browser has loaded, or to be provided by JavaScript in some way, so that when the user clicks on the resulting link, that URL is ready to be loaded in replacement of what the browser already has.

At the same time, the Actions of GUI-buttons can be programmed, to execute some JavaScript.

But, browsers are not able to provide URLs suitable for rewriting, natively, when the user Submits a simple HTML Form. And in some cases, simplicity in Web-design, such as by using the Form-tags in order to collect some user-originated information, still trumps the day. And it often still does so, either through the ‘GET’ or the ‘POST’ Method.


 

One idea which I can visualize happening today though, is just slightly more-complex, than either using the ‘POST’ or the ‘GET’ Method:

A subscriber’s Web-page could first submit some text to the server via the ‘POST’ Method, and the HTML which the corresponding server-script prints, would both have the naked URL of this script, and contain a redirect request to the browser, the URL of which already points to a document kept on the server. This second URL could either be a ‘GET’ URL, or a permalink, meant to be served by a different, second server-script, but meant always to serve up the same, on-line content, identified by a shorter ID-code in the URL.

This last idea would be in-keeping, with the present fashion-trend, by which personal devices and messages mainly communicate either URLs or URIs, and by which much of our content would be kept on cloud-servers. But this last idea automatically introduces complexity, in the form of Access Control issues, because in certain cases, one subscriber would presumably not want any other subscriber, to be able to fetch the first subscriber’s document.


 

Believe it or not, even those sorts of problems often have solutions in Computer Science, although the solutions that exist, also tend to be more complex than what most regular users might care to imagine. Specifically, the ‘GET’ Method appends information to the actual URL, which even in the case of ‘SSL’ or ‘TLS’, may not be encrypted, yet, the requests which such URLs make, need to be secured. Similarly, ‘AJAX’ will often send requests to a server-script, that are not encrypted by default, yet need to be secured.

And, even if the assumption was made that a client always establishes an encrypted Web-socket to the server, before even specifying the URL to retrieve, the following fact should be considered:

In the server log files, The URL of every HTTP request, regardless of whether that came in as an http:// or as an httpS:// URL, is logged in full. This would mean that If credit-card numbers or passwords had ever been submitted using the ‘GET’ Method, they would end up written in server-logs, in clear-text, for potentially anybody to read !

I believe that the way this (validation, not encryption) problem is solved most often, is similar in its nature to how Challenge-Response Authentication works. In principle, two components are needed for this to work:

  1. A piece of data which is a shared secret between the client and the server – for example, a password.
  2. A smaller piece of data, which will simply never be reused. This smaller piece of data may be communicated openly.

In short, the smaller piece of data gets appended to the shared secret, and the result hashed. In my linked posting, my main focus was to use a date-time stamp as the ‘challenge’. But in certain cases, the actual date-time stamp may be considered too lengthy to communicate, and instead, a single 32-bit integer may be used, which always advances (by 1). In such a case, this integer is also referred to as a ‘Nonce’. The ‘GET’ URL needs to contain:

  • Whatever field identifies the piece of information to be retrieved,
  • The smaller, non-repeating piece of data (The Nonce),
  • The hash-code.

What the server needs to do is:

  • Look up the object to be retrieved in a database and find out which User owns it,
  • Look up what the last successfully-used Nonce for that User was, which the submitted Nonce must be greater than,
  • Recompute the hash-code based on the additional, shared secret of the User, and compare the result with the submitted hash-code,
  • If successful, enter the submitted Nonce as the last, successfully-used Nonce for the User.

I suppose that this scheme could get messed up in some way, if the same User had more than one Session with the server at the same time, and, at some point the last, successfully-used Nonce must be communicated to the client to begin with, let’s say when the client logged in, or when the client pulled up a page, from which the ‘GET’ Request, or the ‘AJAX’ Request is to be made. More correctly, If the assumption can be made that such a scheme was to run entirely from a Web-browser, then Web-applications can be written which assume that the last-used Nonce can be communicated to the client securely via SSL, so that when the client makes a request to the server-script, (that Nonce +1) can be used to compute the Hash-Code, which will be visible. But according to the same assumption, the SSL-secured HTML document can also just communicate a new shared secret to the client at any time, so that the Nonce would not need to become a large number.

This scheme could be made more viable, if the shared secret used was not actually the password. One reason would be the degree of distrust that many users would have, of having transmitted a hash-code of their password, regardless of how secure the hashing-algorithm is supposed to be. Another is the fact that as I just described it, the solution would not scale well enough. The solution presented on this page might need to be applied to a large number of client-devices, that are not always Web-browsers.

Instead, the shared secret could be a Session-Key, that is held uniquely by any one client-device, for any one User. The modified set of data which the ‘GET’ URL needs to submit would become:

  • Whatever fields identify the piece of information to be retrieved,
  • The Session ID,
  • The Nonce,
  • (If ‘Bcrypt’ is being used,) The Level Of Difficulty with which the Hash was computed,
  • The Hash-Code.

Instead of looking up the last-used Nonce as an entry that belongs to one User, the server would need to look up the last-used Nonce as an entry belonging to one Session…

Further, if the hashing algorithm to use happened to be ‘Bcrypt‘, this is an algorithm that can hash [1..72] bytes of text, with an explicitly-supplied 16-byte “Salt”, to arrive at a 24-byte Hash-Code. In this case, the Nonce could be fed to the algorithm, as the Salt to use. But just to avoid any possible overlap with other uses, I would add an established constant such as 231 to the Nonce named in the ‘GET’ URL, or in the ‘AJAX’ Call, to arrive at the Salt which is actually fed to ‘Bcrypt’.

(Update 08/16/2018 : )

I should add, that if the purpose of the challenge-response approach is to create mere URLs or URIs, then an important design-objective is, to keep the actual cyphers as short as possible. Therefore, even though I know that ‘Bcrypt’ allows for a (16-byte) Salt, the maximum unsigned (4-byte) value is close to 4 billion, which takes 10 decimal-digits to express. If the URLs only contain short decimal-notation values, then the system is not broken. But, if the URL actually needed to state a 20 decimal-digit value, then I’d consider this system to be broken. So I wouldn’t conclude that I actually need to use 8 bytes, out of 16 available bytes. The most-significant bits would simply remain zeroes.

Along the same lines, even though ‘Bcrypt’ generates a (24-byte == 192-bit) Hash Code, nothing would prevent a Software Engineer from only using a (16-byte == 128-bit) sub-field of the original Hash-Code, as if that was the Hash-Code. Doing so would keep these validated URLs shorter, and might even thwart some yet-unknown attacker’s attempts, to break the Hash Code, because such a hypothetical attacker would actually be missing 8 bytes from the final Hash-Code used.

Dirk

 

Print Friendly, PDF & Email

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

This site uses Akismet to reduce spam. Learn how your comment data is processed.