• Subscribe to this RSS Feed
  • rewriteTextLinks() - a function to make links in blocks of text "clickable"
    11/22/2010 5:05PM

    I'm posting this in the hopes that it may be of some benefit. Basically, I need to replace URL's in plain text with clickable links. My first attempts at generating a regular expression to transform URLs in plain text into clickable links were not all that successful. Granted, I really didn't try that hard.

    With the work that I'm doing for the AppUpdate.com and Miles-by-Motorcycle.com sites, the need to process link inside free flowing text has become much more acute. My old regex's were just no longer cutting it.

    What makes pulling URLs out of text challenging has to do with how people tend to write the links as parts of sentences such as:

    "Check out this cool link, http://formvista.com/gallery.html, and let me know what you think."

    Comma is a valid character in a link. Unfortunately, so is a period:

    "Cool site. http://formvista.com/.";

    To make matters worse users are also fond of enclosing links in parentheses such as:

    "I have found that boot in the Sidi Canyon Gortex boot (http://www.motorcycle-superstore.com/1/1/36/6535/ITEM/Sidi-Canyon-Gore-Tex-Boots.aspx)."

    which was posted on the Miles-By-Motorcycle.com site, and as of this writing, is not correctly identified by the current regex used by the formVista forum code.

    And you might want to get perverse and put punctuation inside parentheses by mistake such as:

    "Check it out. (http://formvista.com.)."

    Deciding that I really needed to fix this so users can post links however they feel like it and "have the code do the right thing", I spent a few days googling around trying to see if I could find a better regular expression for URLs in blocks of plain text. Unfortunately, each and every regex I've found didn't handle the edge cases I was looking to handle.

    From a technical purists point of view, parsing URLs in blocks of pain text is not solvable using regular expressions. Some have stated that it can be solved using a compiler compiler like YACC, but I'm not sure about that because there is no way to figure out if a period or a comma is a punctuation mark or if it's a valid portion of the URL.

    Using a combination of regexes I found online as a rought starting point, I decided to bite the bullet and try to come up with a solution of my own that addresses the most common cases that occur in forums or text comments. I am not a regex wizard by any stretch of the imagination, but I think my solution seems to work better than any I've seen. I'm certain, however, that with further testing it'll be clear that it needs further refinement.

    My initial solution, which is inadequate and is currently used in formVista, attempts to parse the URLs to see if they are correct. I realized that there's no need to do this. For a block of text, all that's really necessary is to identify anything that looks like a link and link it, or potentially return it for verification by some calling function.

    As such, my new solution does not attempt to validate the link. It concerns itself with attempting to find the beginning and end of a link in a block of text taking into account that the user may have added punctuation. Heuristically, URL's with trailing "." or "," characters are very rare, so erring on the side of assuming trailing "." or "," characters are sentence enders is probably valid. For these outlier cases, formVista includes something I call fvCode which is similar to BBCode and allows you to specify that a link exactly by encasing it in [url]...[/url] tags.

    Since these are plain links and not used in any src="" lines I don't suspect there is any XSS issue, but to be on the safe side I require the domain portion of the link be plain text. This code also does not support unicode domains.

    A period at the end of a domain name is always punctuation. (e.g. http://formvista.com.) However, a period at the end of the path of query string may or may not be. In this case we err on the side of assuming that it is.

    One of the big difficulties I had in evaluating the larger regular expressions you would find on the net is that no one seems to document their expressions. Given how cryptic and error prone regular expression syntax it's no wonder that most regex's don't seem to do what the author intended.

    I put together a solution that seems to work in the test cases I've come up with. As I've mentioned, the intent is to find things that look like links, not to validate that they are correctly formatted links. I have applied some heuristics where I thought they made sense. I have also documented the regex extensively which makes it easier to figure out what the edge cases are.

    I've put it up here for download - rewriteTextLinks-0.9.tar.gz. it's released under a BSD Style license so you are free to do with it as you see fit. I've included a test harness.

    The function is ridiculously simple to use:

    <?php  include_once( "rewriteTextLinks.php" );  
    
    $text = "here is some text with (http://formvista.com/about.html) embedded link, http://dtlink.com.";  
    
    $linkedText = rewriteTextLinks( $text );  
    
    print( $linkedText );  
    ?> 

    I have not put it into production use yet, but that's next on the todo list. If you have any questions, comments or find edge cases that this does not handle, please register for an account here and post to the forum.

    Here's a listing of the code:

    <?php
    
    /*
    ---------------------------------------------------------------
    Copyright (c) 2010, DTLink,LLC
    All rights reserved.
    
    Redistribution and use in source and binary forms, with or
    without modification, are permitted provided that the following
    conditions are met:
    
    * Redistributions of source code must retain the above copyright
      notice, this list of conditions and the following disclaimer.
    
    * Redistributions in binary form must reproduce the above
      copyright notice, this list of conditions and the following
      disclaimer in the documentation and/or other materials provided
      with the distribution.
    
    * Neither the name of the <ORGANIZATION> nor the names of its
      contributors may be used to endorse or promote products derived
      from this software without specific prior written permission.
    
    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
    CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES,
    INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
    MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
    DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
    CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
    SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
    LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
    USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
    AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
    LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
    IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
    THE POSSIBILITY OF SUCH DAMAGE.
    ---------------------------------------------------------------
    */
    
    /**
    * Rewrite Links in plain text string as HTML links. 
    *
    * @package rewriteTextLinks
    * @copyright DTLink, LLC 2010
    * @author Yermo Lamers
    *
    * @link http://formvista.com/web-and-ajax-development.html
    * @link http://formvista.com/fv-b-12-170/rewriteTextLinks-----a-function-to-make-links-in-blocks-of-text--quot-clickable-quot-.html
    *
    * @since 2010-11-22
    *
    * @version 0.9
    */
    
    /**
     * rewrites URL's in a string of text with the URL wrapped in link tags.
     *
     * This function finds substrings in a block of plain text that appear to be URLs
     * and rewrites them as HTML links. 
     *
     * This function is intended to be used to process text entered by users into blog comments,
     * discussion forums, and the like. 
     *
     * The problem of processing links in free form text is a bit challenging as it is not possible
     * to create a correct parser using regular expressions for this purpose. As such, this is a 99%
     * solution that involves a bit of heuristics to cover the typical ways users enter links into
     * posts.
     *
     * For example, a link such as http://formvista.com?var=value. is a valid URL even with the '.' included.
     * As a result, there is no way to determine from the link itself whether or not the '.' is significant
     * for the link or is just a mark of punctuation. The same applies to commas. For example, "This site, 
     * http://formvista.com?var=value, is cool."
     *
     * It is also important to note that the regular expression used here does not do any
     * validation. In the context of it's intended use, anything that looks like a link is
     * assumed to be a link. For example xyz://foo/bar will be matched.
     *
     * @param string $source_text String possibly containing embedded URL's
     * @return string string with URL's replaced with <a> </a> wrapped links.
     *
     * @see test_rewriteTextLinks.php
     */
    
    function rewriteTextLinks( $source_string ) 
       {
    
       $source_string = ' ' . $source_string; // makes it easier to create the look-behind assertion to find the
                                              // start of a link.
    
    	/*
    	=========================================================================================
    
       $regex = '#';         // using # as the regex delimiter.
    
       $regex .= '(?<=[\s\<])';   // preceded by any whitespace character or open <. 
                                  // 
                                  // We don't want to rewrite URL's enclosed in ", or any that are part of an already present 
                                  // <a href="..">url</a> tag.
    
       $regex .= '(\([\s]*)?'; // matching pattern \1, handle links enclosed in parentheses, which is complicated by the fact
                               // that parens are valid characters in URLs and are in fact used on some sites, but are also
                               // often used in forums by users to separate out the url from the text.
                               //
                               // followed by 0 or more spaces, to catch links in parentheses such as (http://formvista.com)
                               // see conditional non-capturing sub pattern at the end which references this one. 
                               //
                               // imperfect heuristc, something like ( http://formvista.com) is much more likely than
                               // having a case like ( http://formvista.com/(foo) ).
    
       $regex .= '(';          //   matching pattern \2 which will match the whole URL
    
       // SCHEME -----------------------
    
       $regex .= '(?:([\w]+?://)|(www.)|(mailto:)|(ftp.))';  //   SCHEME 
                                                             // 
                                                             // followed by one or more strings of word characters, ungreedy 
                                                             // matching, to make up the scheme part, followed by :// which 
                                                             // means it will match any scheme http://blah xyz://blah etc.
                                                             //
                                                             // imperfect heuristic. consider any string that begins with
                                                             // www., mailto: or ftp. to also be a URL.
    
       $regex .= '(?:';
    
       // DOMAIN -------------------------
    
       $regex .= '[\w\d@-]';            // DOMAIN. 
                                        //
                                        // restrict to alphanums and "normal" characters. Doesn't validate that
                                        // the domain is correctly formed. Will match http://test, http://1, etc. For domain only
                                        // urls the match will stop here.
                                        //
                                        // http://user@domain
                                        // http://domain:8080
                                        //
       $regex .= '|';                    // OR match a terminating character followed by the negative assertion
                                        // below, otherwise the last character gets consumed. This also has the positive 
                                        // benefit that it correctly prevents the . from being included in a domain only
       $regex .= '[.,;:]';              // name.
    
       $regex .= '(?!';            // look ahead and make certain that it's NOT 
          
       $regex .= '[\s/?.,;:]|(\))?'; // followed by a whitespace character/ or ? or any terminal or 1 close parenthesis.
                                     // parentheses are never allowed in the domain part.
                                     //
                                     // another heuristic. http://formvista.com.. should not match either period or any 
                                     // combination http://formvista.com;. http://formvista.com,: etc. This only applies
                                     // to the domain part. double dots, commands the like are allowed in the path and
                                     // query string portions of the link and therefore cannot be considered a terminator.
    
       $regex .= '([\s]|$)';      // followed by a whitespace character or end of line
          
       $regex .= ')';             // end of negative look ahead looking for URL end.
    
       $regex .= ')+';            // end of DOMAIN part.
    
       // PATH AND/OR QUERY_STRING ----------------------------
    
       $regex .= '(?:((/)|(\?))'; // leading slash of path or query string start. The PATH and QUERY_STRING is handled in
                                  // one expression. Basically we don't care what the URL consists of, we just care that
                                  // we can more often than not correctly recognize it's end in plain text. 
    
       $regex .= '(?:';        // non-capturing sub-pattern. start of pattern that matches everything from the end of 
                               // domain to the end of the URL
    
       $regex .= '[\w\\x80-\\xff\#$%&~/=?@\[\](+-]';   // followed by any word character, extended ascii code character,
                                                       // any of #$%&~/=?@[](+-
                                                       //
                                                       // NOTE absence of > and < characters which are not allowed in 
                                                       // URLs.
    
    
       $regex .= '|';          // OR set up a heuristic for handling the end of a link when it's not a space such as
                               // http://formvista.com/test. (e.g. period), or (http://formvista.com/test) or 
                               // http://formvista.com/test, etc.
       
       $regex .= '[.,;:]';     // match one of these characters and
    
       $regex .= '(?!';        // look ahead and make certain that it's NOT 
          
       $regex .= '[\s]|(\))?';    // followed by a whitespace character or 0 or 1 close parenthesis.
    
       $regex .= '([\s]|$)';      // followed by a whitespace character or end of line
          
       $regex .= ')';             // end of negative look ahead looking for URL end.
          
       $regex .= '|';             // OR handle special case that we matched an open paren (i.e. link is in parens
                                  // such as (http://formvista.com/test.html)
       
       $regex .= '(?(1)\)(?![\s<.,;:]|$)|\))';   // conditional subpattern if we matched an open parenthesis before 
                                                 // the scheme in the first parenthesized expression, then match a 
                                                 // close parenthesis NOT followed by a whitespace character, <.,;: 
                                                 // or end of line OR match a close parenthesis.
                                                 //
                                                 // i.e. if we didn't open with a paren as in (http://formvista.com) then 
                                                 // we'll match   all close parens unconditionally even if the url ends in one, 
                                                 // otherwise we match close parens that are NOT followed by a space, period, 
                                                 // etc. Another heuristic.
    
       $regex .= ')*';   //   END OF PATH, which is optional
    
       $regex .= ')?';   // END OF MATCH INCLUDING / or ?, which may or may not be present. (side effect, any other /'s or 
                         // ?'s are consumed by the PATH clause above.)
    
       $regex .= ')';      // end of whole URL match.
    
       $regex .= '#is';
    
    	======================================================================================
    	*/
    
    	$regex = '#(?<=[\s\<])(\([\s]*)?((?:([\w]+?://)|(www.)|(mailto:)|(ftp.))(?:[\w\d@-]|[.,;:](?![\s/?.,;:]|(\))?([\s]|$)))+(?:((/)|(\?))(?:[\w\x80-\xff\#$%&~/=?@\[\](+-]|[.,;:](?![\s]|(\))?([\s]|$))|(?(1)\)(?![\s<.,;:]|$)|\)))*)?)#is';
    
       // END --------------------------------
    
       $transformed_string = preg_replace_callback( $regex, '_rewriteLink', $source_string );
    
       // remove the space we added above.
    
       $transformed_string = trim( $transformed_string );
    
       return $transformed_string;
    
       }   // end of rewriteTextLinks()
    
    // -----------------------------------------------------------
    
    /**
    * callback function to rewrite matched URL's
    *
    * @param array $matches matches array from preg_replace_callback. URL is in $matches[2]. A matched open paren may be in $matches[1]
    * @return string rewritten URL.
    */
    
    function _rewriteLink( $matches )
       {
    
       // foreach( $matches as $name => $value )   print( "$name => $value\n" );
    
       // matches[1] is the, optional, opening paren
    
       return $matches[1] . "<a href=\"" . $matches[2] . "\">" . $matches[2] . "</a>";
    
       }
    
    ?>
    
  • The problem with using regex's to parse URLs in text
    11/20/2010 8:51PM
  • Regex Library Site
    11/20/2010 6:20PM

    I hate writing regex's. 

    Here's a library site of pre-built regexs: http://regexlib.com/

  • LInks about PHP APC
    11/15/2010 8:05PM
  • Application Servers for/in PHP
    11/14/2010 7:29PM