« Previous 1 2
Security data analytics and visualization with R
Data Analysis
R XML Translator
The last scenario is less about visualization, although it is an effort to make machine-readable code human-readable, which in itself is a sort of visualization.
In my line of work, data from tools such as vulnerability and service scanners, as well as threat intelligence and information sharing resources, is exported in XML format. XML is fabulous for machine-to-machine conversations, but a human analyst might not have the means to "translate" such a file, such as the appropriate application to import it with or the XSLT needed to render it in a human-readable manner via a browser, for example.
Another file type that falls into this category is the STIX (Structured Threat Information eXpression) file, which was developed by Mitre and often shared via the TAXII (Trusted Automated eXchange of Indicator Information) transport mechanism. Although I've written an R STIX translator, I do not have a good sample file to share for a demonstration (most STIX-based threat intelligence information is not meant for public consumption), so I need a similar exemplar.
Luckily, my friend and compatriot in all things security tooling, NJ Ouchn (@toolswatch ), wrote vFeed [16], which exports Common Vulnerabilities and Exposures (CVE) as XML data. Per the toolswatch GitHub page, the vFeed framework "provides extra structured detailed third-party references and technical characteristics for a CVE entry through an extensible XML schema. It also improves the reliability of CVEs by providing a flexible and comprehensive vocabulary for describing the relationship with other standards and security references." With vFeed I ran
vfeedcli.py export CVE_2013_3893
which resulted in the file shown in Figure 7. This is the raw XML that I want to make a bit more friendly for analysis.
Herein lurks a golden opportunity to share a bit about the R XML
and xlsx
packages and how to use them. The XML package includes tools for parsing and generating XML, whereas the xlsx package allows you to read and write XLSX files (similar to using XLConnect to control spreadsheet appearance with data formats, fonts, colors, and border settings). To that end, the script in Listing 3 manipulates CVE_2013_3893.xml
so that a human analyst can read a quick summary as a nicely formatted spreadsheet.
Listing 3
vFeedConverter.0.2.R
01 ## vFeed XML (machine) to XLSX (human) converter 02 ## Russ McRee, @holisticinfosec 03 04 library(XML) 05 library(data.table) 06 library(xlsx) 07 08 ## Set working directory 09 setwd("~/R/vFeed") 10 11 ### Remove all the XML comment entities in vFeed file without parsing the XML 12 ## Read in the vFeed file 13 txt <- readLines(file("CVE_2013_3893.xml")) 14 15 ## Regex to remove line with <!-- 16 txt <- gsub("(?i)<!--[^>]*>","", txt) 17 18 ## Write results back to XML file 19 writeLines(txt, file("CVE_2013_3893_summary.xml")) 20 21 ## XML processing 22 doc <- xmlTreeParse("CVE_2013_3893_summary.xml") 23 rootNode <- xmlRoot(doc) 24 25 Name <- xmlSApply(rootNode[[1]][["name"]],xmlValue) 26 Summary <- xmlSApply(rootNode[[2]][["summary"]],xmlValue) 27 CVE_Reference <- xmlSApply(rootNode[[2]][["cve_ref"]],xmlValue) 28 29 data <- data.table(Name,Summary,CVE_Reference) 30 outwb <- createWorkbook() 31 32 # Define cell styles within workbook 33 csTitle <- CellStyle(outwb) + Font(outwb, heightInPoints=14, isBold=TRUE) 34 csSubTitle <- CellStyle(outwb) + Font(outwb, heightInPoints=12, isItalic=TRUE, isBold=FALSE) 35 csTableColNames <- CellStyle(outwb) + Font(outwb, isBold=TRUE) + Alignment(wrapText=TRUE) + Border(color="black", position=c("TOP", "BOTTOM"), pen=c("BORDER_THIN", "BORDER_THICK")) 36 csBody <- CellStyle(outwb) + Alignment(wrapText=TRUE) 37 38 # Create sheet 39 sheet <- createSheet(outwb, sheetName = "vFeed data") 40 41 ## Sheet title 42 rows <- createRow(sheet,rowIndex=1) 43 sheetTitle <- createCell(rows,colIndex=2) 44 setCellValue(sheetTitle[[1,1]], "Threat Intelligence & Engineering") 45 setCellStyle(sheetTitle[[1,1]], csTitle) 46 47 # Sheet subtitle 48 rows <- createRow(sheet,rowIndex=2) 49 sheetSubTitle <- createCell(rows,colIndex=2) 50 setCellValue(sheetSubTitle[[1,1]], "vFeed to XLSX") 51 setCellStyle(sheetSubTitle[[1,1]], csSubTitle) 52 53 # Body 54 rows <- addDataFrame(data,sheet,startRow=4, startColumn=1, colnamesStyle = csTableColNames, colStyle=list('2'=csBody, '3'=csBody)) 55 setColumnWidth(sheet,colIndex=1,colWidth=5) 56 setColumnWidth(sheet,colIndex=2,colWidth=30) 57 setColumnWidth(sheet,colIndex=3,colWidth=50) 58 setColumnWidth(sheet,colIndex=4,colWidth=75) 59 setColumnWidth(sheet,colIndex=5,colWidth=75) 60 61 saveWorkbook(outwb, "vFeed.xlsx")
Once again I call the appropriate packages and set the working directory in lines 4-9. The vFeed XML extract has numerous XML comments in the file body. This makes for more complicated parsing, and the comments aren't necessary for the summary, so I simply wiped them out. To remove all the XML comment entities I first read the vFeed file into txt
as raw text (line 13) then use regex to remove all lines with comment tags <!--
:
txt <- gsub("(?i)<!--[^>]*>","", txt)
(line 16). I then write the results back to XML file with writeLines
(line 19). Now with a cleaner XML file to parse, I do some XML processing with the XML
package. If you'd like a great primer, read Tobi Bosede's Informit article [17]. Line 22 parses the XML and generates an R structure representing the XML tree with xmlTreeParse
. The xmlRoot
function (line 23) provides access to the top-level XMLNode object generated with xmlTreeParse
and assigned to the doc
variable.
The Name
, Summary
, and CVE_Reference
variables pull the content needed for the summary from the data now defined in the rootNode
variable. For example, line 26 uses xmlSApply
, an sapply
function wrapper for XML.
If I were to run xmlName(rootNode)
, I would discover that the node is named vFeed
. With xmlSize(rootNode)
, I would learn that there are two child nodes. To learn their names, I would use
xmlName(rootNode[[1]]) xmlName(rootNode[[2]])
resulting in release and entry .
After further enumerating the subnodes of the children, I know I want to return the xmlValue
of "name"
, "summary"
, and "cve_ref"
. I write the results of the three xmlSApply
runs to the data
variable (line 29) to create a data table, then I begin the createWorkbook
function from the xlsx package.
The cell style variables in lines 33-36 are self-evident: They help define title appearance, where we want bold and italics applied, and text wrapping. Note that on Linux systems, the text wrapping definition doesn't take when you open the resulting XLSX file with Gnumeric or LibreOffice Calc; however, it works quite nicely on Microsoft Excel.
The remainder of the script creates the worksheet and its name; sets the title and subtitle, including calls to the cell style (cs
) variables defined earlier; then defines the body data frame, including column width by position. The last line saves the results as a XLSX file. The results are seen in Figure 8. This spreadsheet output works a lot better if you're a person.
Whereas the XML file might have hundreds of nodes and children – as is often the case with STIX files – they'd all be tidily rendered in the spreadsheet as long as you've enumerated and defined them properly in your R script.
Although all this information might seem somewhat imposing, if you're neither an information security analyst nor an R programmer, I do hope this introduction to R has intrigued you at a minimum. Use this as an opportunity to install R, RStudio, the system dependencies, and the packages I used, and experiment with the scripts and the example datasets.
If you're properly motivated, you can read the Cookbook for R [9] and buy the R Cookbook [18] and R Graphics Cookbook [19]. Also remember that the Coursera Data Science offerings are a fabulous way to get underway with R. If you're ready to work specifically with information security data and R, remember to refer to the Data Driven Security website [20], which also includes related Python activities.
If you subscribe to the principles of "a picture is worth a thousand words," you should now be sufficiently compelled to begin your R practice or further enhance that which you've already established. Cheers, and feel free to ping me if you have questions.
Infos
- Coursera Data Science courses: https://www.coursera.org/specialization/jhudatascience/1?utm_medium=listingPage
- Tufte, Edward R. Beautiful Evidence , Chapter 5. Graphics PR, 2006
- Jacobs, Jay, and Bob Rudis. Data-Driven Security: Analysis, Visualization and Dashboards . Wiley, 2014
- toolsmith – Jay and Bob Strike Back: Data-Driven Security: http://holisticinfosec.blogspot.com/2014/09/toolsmith-jay-and-bob-strike-back-data.html
- Code for this article: ftp://ftp.linux-magazin.com/pub/listings/admin-magazine.com/24
- R Project: http://www.r-project.org
- CRAN mirrors: http://cran.r-project.org/mirrors.html
- RStudio: http://www.rstudio.com/
- Cookbook for R : http://www.cookbook-r.com/
- R-bloggers: http://www.r-bloggers.com
- MAGento server MAGMI plugin – RFI: http://www.exploit-db.com/exploits/35052/
- Help topics: http://docs.ggplot2.org/current/
- "A Very Quick Introduction to ggplot2" by Christophe Ladroue: http://chrisladroue.com/extra/ggplot2Intro/introductionGGplot2/
- "R vs Excel for Data Analysis" by Chris Leonard: http://r-dir.com/blog/2013/11/r-vs-excel-for-data-analysis.html
- "Introduction to dplyr" by Hadley Wickham: http://rpubs.com/hadley/dplyr-intro
- vFeed at GitHub: https://github.com/toolswatch/vFeed/
- "Working with XML Data in R" by Tobi Bosede, http://www.informit.com/articles/article.aspx?p=2215520
- Teetor, Paul. R Cookbook . O'Reilly Media, 2011
- Chang, Winston. R Graphics Cookbook . O'Reilly Media, 2013
- Data-driven security: http://datadrivensecurity.info/
« Previous 1 2
Buy this article as PDF
(incl. VAT)