HTML to database with a Perl script
Clean Start
At work, I was tasked to come up with a solution to a problem presented by a web developer who needed some 1,200 static HTML files stripped of the HTML markup, leaving behind only text. Because 1,200 files introduce too much complexity and are generally cumbersome to keep on a web server, a database-driven approach was necessary. Only a few Perl scripts would be needed to serve the pages. Although 1,200 files is perhaps an attractive approach to a low-tech solution, its simple bulk is unruly. Also, editing the text as database entries would be much simpler, employing only a few more fairly simple-to-implement Perl scripts.
In the solution described in this article, a Perl script strips the HTML markup with a simple regular expression and creates the text files that are put into a database.
The Solution
Initially, an HTML file employing three iframes is created (Listing 1) [1]. Each window is given a name with the name
attribute (line 7). A simple target attribute within an anchor element sends the data to the specified iframe. The first iframe, designated menu
, is a list of reference categories. The links from the menu
iframe are targeted to the iframe below it, list
(line 11), which contains the commands for the categories representing the language that was identified in the menu
iframe.
Listing 1
HTML iframes
01 <table border="0" cellpadding="0" cellspacing="0" align="center"> 02 <tr> 03 <td align="left" valign="top"> 04 <table border="0" cellpadding="0" cellspacing="0" align="center"> 05 <tr> 06 <td align="left" valign="top"> 07 <iframe width="150" height="249" src="menu.html" name="menu"></iframe> 08 </td> 09 </tr><tr> 10 <td align="left" valign="top"> 11 <iframe width="150" height="249" src="list.html" name="list"></iframe> 12 </td> 13 </tr> 14 </table> 15 <td align="left" valign="top"> 16 <iframe width="400" height="500" src="data.html" name="data"> </iframe> 17 </td> 18 </tr> 19 </table>
Once the list of commands is loaded, the target attribute is again used in a list of links to display the language reference for one command. Designated data
(line 16), this HTML file is the end of the decision tree – an intuitively obvious interface.
Originally, the data was displayed in a static HTML file. The Perl scripts employed once the flat files are inserted into the database are simple lists, as well, but with the capability to introduce features like hit counts, revision notes, and other cool and interesting features.
The decision was made that each line of each file should be a table row, with the content of each file being a table. The name of the file is also the table name, which is added to a lookup table for easy access later.
The beauty and flexibility of Perl is demonstrated very well in this small construction. The entire process took about a minute to perform on my little server and resulted in an easily referenced database, owing to the simplicity of the database model.
By nesting three foreach
loops (Listing 2, lines 51, 55, 65), you can perform the step-by-step operation with a minimum of fuss, cycling first through @dirlist
for the directory names, and then through @filelist
for each file, which is then opened and read. The file data is split on the newlines and placed into the @split_text
array. For each $filelist[$filelist_count]
, you add a table and populate it if the if
statement evaluates to true.
Listing 2
The Master Database
001 #!/usr/bin/perl 002 003 use CGI; 004 use DBI; 005 use CGI::Carp (fatalsToBrowser); 006 007 $tv = new CGI; 008 009 ### make the directory list 010 @dirlist = (CssBackgroundProperties, 011 CssBorderProperties, 012 CssClassificationProperties, 013 CssFontProperties, 014 CssClassificationProperties, 015 CssFontProperties, 016 CssMarginProperties, 017 CssPaddingProperties, 018 CssTextProperties, 019 DomCollectionsAndArrays, 020 DomEventHandlers, 021 DomMethods, 022 DomObjects, 023 DomProperties, 024 HtmlAttributes, 025 HtmlEventHandlers, 026 HtmlTags, 027 JavaScriptControlStatements, 028 JavaScriptGlobalFunctions, 029 JavaScriptObjectMethods, 030 JavaScriptObjectProperties, 031 JavaScriptObjects, 032 JavaScriptOperators, 033 JavaScriptStatements); 034 035 ### connect to the database 036 $dbh = DBI->connect('DBI:mysql:host=mysql.domain.com;database=databasename', 'username', 'password', {'RaiseError' => 1}) or die "Cannot Connect to Database"; 037 038 ### create the master table 039 $query = qq{CREATE TABLE master ( 040 id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, 041 tablename VARCHAR (200) NOT NULL, 042 category VARCHAR (200) 043 )}; 044 $sth = $dbh->prepare($query); 045 $sth->execute(); 046 047 $count = "0"; 048 $dirlist_count = "0"; 049 $filelist_count = "0"; 050 $type = "text"; 051 foreach (@dirlist) { 052 while ($thisfile = <$dirlist[$dirlist_count]/*.txt>) { 053 push @filelist, $thisfile; 054 } 055 foreach (@filelist) { 056 open (FILEHANDLE, "$dirlist[$dirlist_count]/$filelist[$filelist_count]"); 057 $text = ""; 058 $newtext = ""; 059 while (read (FILEHANDLE, $newtext, 1)) { 060 $text .= $newtext; 061 } 062 close FILEHANDLE; 063 @split_text = split "\n", $text; 064 $split_text_count = "0"; 065 foreach (@split_text) { 066 $split_text[$split_text_count] =~ s/\n+//g; 067 $this_split = $split_text[$split_text_count]; 068 $this_split =~ s/[ ]+//g; 069 if ($this_split != "") { 070 $query = qq{CREATE TABLE $filelist[$filelist_count] ( 071 type VARCHAR (200) NOT NULL, 072 text BLOB 073 )}; 074 $sth = $dbh->prepare($query); 075 $sth->execute(); 076 077 $query = qq{INSERT INTO $filelist[$filelist_count] ( 078 type, 079 text 080 ) VALUES ( 081 '$type', 082 '$split_text[$split_text_count]' 083 )}; 084 $sth = $dbh->prepare($query); 085 $sth->execute(); 086 087 $query = qq{INSERT INTO master ( 088 tablename, 089 category 090 ) VALUES ( 091 '$filelist[$filelist_count]', 092 '$dirlist[$dirlist[$count]' 093 )}; 094 $sth = $dbh->prepare($query); 095 $sth->execute(); 096 097 ++$split_text_count; 098 ++$count; 099 } 100 ++$filelist_count; 101 } 102 ++$dirlist_count; 103 } 104 105 ### print da shtuff 106 print qq{Content-type: text/html\n\n}; 107 print qq{I'm Done!<br>}; 108 print qq{I have created $count tables}; 109 exit;
The @dirlist
array holds the names of the directories in which the files are placed, and each file within the directory is glob'd for the filenames, which are placed in @filelist
(lines 52, 53). A foreach
loop then opens and reads each file, byte-by-byte, the content of which is placed in the aptly name scalar $text
.
Next, the code splits $text
on the newlines and adds each new line of text into @split_text
, removes the newline characters (line 66), and proceeds to make a copy of each item within $this_split
to later test for whether the array item has no data (lines 63-68). Because each original text file has several blank lines, so will the @split_text
array have empty items (file rows) that you won't want to see as part of the database.
Every space is removed from the copied array item. If the copied item is empty, the script will not do anything but continue on to the next item in @split_text
through the use of the aforementioned foreach
loop.
If $this_split != ""
evaluates to true, the program creates a table using the filename as the table name and populates it with rows, each row of the text file being a row in the table. Each table name is then placed in the master lookup table.
Conclusion
Replacing the many HTML files that were used to display the menu items with one Perl script is a great solution. The only change you would have to make to the iframe is to the src
attributes, which will now be .cgi
files, of course designating a Perl script.
As touched on earlier, you can add features like hit counts, notes, and credits for those that work on the online reference just created. A very wide range of options like colors, fonts, and general overall look can be tailored to each individual user through the use of maybe one or two new tables or a more involved table column alteration.
The approach of looping through a multitiered decision tree made this project fun and was completed in one morning. The entire reference, now in a database, is served in an intuitively obvious way which makes it a joy to work with for users. Since this project was completed, many new features have been employed.
One of the greatest new capabilities that has become possible and very popular recently is a simple and easy-to-use language translator engine. Real-time language translators can translate the original English language entries on the fly into pretty much any language spoken today. These language translators can be accessed from a browser or a service, such as an application employed from a Perl script.
Happy coding!
Infos
- Code for this article: ftp://ftp.linux-magazine.com/pub/listings/admin-magazine.com/60/
Buy this article as PDF
(incl. VAT)
Buy ADMIN Magazine
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Most Popular
Support Our Work
ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.