Security data analytics and visualization with R

Data Analysis

R XML Translator

The last scenario is less about visualization, although it is an effort to make machine-readable code human-readable, which in itself is a sort of visualization.

In my line of work, data from tools such as vulnerability and service scanners, as well as threat intelligence and information sharing resources, is exported in XML format. XML is fabulous for machine-to-machine conversations, but a human analyst might not have the means to "translate" such a file, such as the appropriate application to import it with or the XSLT needed to render it in a human-readable manner via a browser, for example.

Another file type that falls into this category is the STIX (Structured Threat Information eXpression) file, which was developed by Mitre and often shared via the TAXII (Trusted Automated eXchange of Indicator Information) transport mechanism. Although I've written an R STIX translator, I do not have a good sample file to share for a demonstration (most STIX-based threat intelligence information is not meant for public consumption), so I need a similar exemplar.

Luckily, my friend and compatriot in all things security tooling, NJ Ouchn (@toolswatch ), wrote vFeed [16], which exports Common Vulnerabilities and Exposures (CVE) as XML data. Per the toolswatch GitHub page, the vFeed framework "provides extra structured detailed third-party references and technical characteristics for a CVE entry through an extensible XML schema. It also improves the reliability of CVEs by providing a flexible and comprehensive vocabulary for describing the relationship with other standards and security references." With vFeed I ran

vfeedcli.py export CVE_2013_3893

which resulted in the file shown in Figure 7. This is the raw XML that I want to make a bit more friendly for analysis.

Figure 7: Raw vFeed XML data.

Herein lurks a golden opportunity to share a bit about the R XML and xlsx packages and how to use them. The XML package includes tools for parsing and generating XML, whereas the xlsx package allows you to read and write XLSX files (similar to using XLConnect to control spreadsheet appearance with data formats, fonts, colors, and border settings). To that end, the script in Listing 3 manipulates CVE_2013_3893.xml so that a human analyst can read a quick summary as a nicely formatted spreadsheet.

Listing 3

vFeedConverter.0.2.R

01 ## vFeed XML (machine) to XLSX (human) converter
02 ## Russ McRee, @holisticinfosec
03
04 library(XML)
05 library(data.table)
06 library(xlsx)
07
08 ## Set working directory
09 setwd("~/R/vFeed")
10
11 ### Remove all the XML comment entities in vFeed file without 
       parsing the XML
12 ## Read in the vFeed file
13 txt <- readLines(file("CVE_2013_3893.xml"))
14
15 ## Regex to remove line with <!--
16 txt <- gsub("(?i)<!--[^>]*>","", txt)
17
18 ## Write results back to XML file
19 writeLines(txt, file("CVE_2013_3893_summary.xml"))
20
21 ## XML processing
22 doc <- xmlTreeParse("CVE_2013_3893_summary.xml")
23 rootNode <- xmlRoot(doc)
24
25 Name <- xmlSApply(rootNode[[1]][["name"]],xmlValue)
26 Summary <- xmlSApply(rootNode[[2]][["summary"]],xmlValue)
27 CVE_Reference <- xmlSApply(rootNode[[2]][["cve_ref"]],xmlValue)
28
29 data <- data.table(Name,Summary,CVE_Reference)
30 outwb <- createWorkbook()
31
32 # Define cell styles within workbook
33 csTitle <- CellStyle(outwb) 
   + Font(outwb, heightInPoints=14, isBold=TRUE)
34 csSubTitle <- CellStyle(outwb) 
   + Font(outwb, heightInPoints=12, isItalic=TRUE, isBold=FALSE)
35 csTableColNames <- CellStyle(outwb) 
   + Font(outwb, isBold=TRUE) + Alignment(wrapText=TRUE) 
   + Border(color="black", position=c("TOP", "BOTTOM"), 
     pen=c("BORDER_THIN", "BORDER_THICK"))
36 csBody <- CellStyle(outwb) + Alignment(wrapText=TRUE)
37
38 # Create sheet
39 sheet <- createSheet(outwb, sheetName = "vFeed data")
40
41 ## Sheet title
42 rows <- createRow(sheet,rowIndex=1)
43 sheetTitle <- createCell(rows,colIndex=2)
44 setCellValue(sheetTitle[[1,1]], "Threat Intelligence & Engineering")
45 setCellStyle(sheetTitle[[1,1]], csTitle)
46
47 # Sheet subtitle
48 rows <- createRow(sheet,rowIndex=2)
49 sheetSubTitle <- createCell(rows,colIndex=2)
50 setCellValue(sheetSubTitle[[1,1]], "vFeed to XLSX")
51 setCellStyle(sheetSubTitle[[1,1]], csSubTitle)
52
53 # Body
54 rows <- addDataFrame(data,sheet,startRow=4, 
   startColumn=1, colnamesStyle = csTableColNames, 
   colStyle=list('2'=csBody, '3'=csBody))
55 setColumnWidth(sheet,colIndex=1,colWidth=5)
56 setColumnWidth(sheet,colIndex=2,colWidth=30)
57 setColumnWidth(sheet,colIndex=3,colWidth=50)
58 setColumnWidth(sheet,colIndex=4,colWidth=75)
59 setColumnWidth(sheet,colIndex=5,colWidth=75)
60
61 saveWorkbook(outwb, "vFeed.xlsx")

Once again I call the appropriate packages and set the working directory in lines 4-9. The vFeed XML extract has numerous XML comments in the file body. This makes for more complicated parsing, and the comments aren't necessary for the summary, so I simply wiped them out. To remove all the XML comment entities I first read the vFeed file into txt as raw text (line 13) then use regex to remove all lines with comment tags <!--:

txt <- gsub("(?i)<!--[^>]*>","", txt)

(line 16). I then write the results back to XML file with writeLines (line 19). Now with a cleaner XML file to parse, I do some XML processing with the XML package. If you'd like a great primer, read Tobi Bosede's Informit article [17]. Line 22 parses the XML and generates an R structure representing the XML tree with xmlTreeParse. The xmlRoot function (line 23) provides access to the top-level XMLNode object generated with xmlTreeParse and assigned to the doc variable.

The Name, Summary, and CVE_Reference variables pull the content needed for the summary from the data now defined in the rootNode variable. For example, line 26 uses xmlSApply, an sapply function wrapper for XML.

If I were to run xmlName(rootNode), I would discover that the node is named vFeed . With xmlSize(rootNode), I would learn that there are two child nodes. To learn their names, I would use

xmlName(rootNode[[1]])
xmlName(rootNode[[2]])

resulting in release and entry .

After further enumerating the subnodes of the children, I know I want to return the xmlValue of "name", "summary", and "cve_ref". I write the results of the three xmlSApply runs to the data variable (line 29) to create a data table, then I begin the createWorkbook function from the xlsx package.

The cell style variables in lines 33-36 are self-evident: They help define title appearance, where we want bold and italics applied, and text wrapping. Note that on Linux systems, the text wrapping definition doesn't take when you open the resulting XLSX file with Gnumeric or LibreOffice Calc; however, it works quite nicely on Microsoft Excel.

The remainder of the script creates the worksheet and its name; sets the title and subtitle, including calls to the cell style (cs) variables defined earlier; then defines the body data frame, including column width by position. The last line saves the results as a XLSX file. The results are seen in Figure 8. This spreadsheet output works a lot better if you're a person.

Figure 8: The resulting vFeed XLSX human-readable summary.

Whereas the XML file might have hundreds of nodes and children – as is often the case with STIX files – they'd all be tidily rendered in the spreadsheet as long as you've enumerated and defined them properly in your R script.

Although all this information might seem somewhat imposing, if you're neither an information security analyst nor an R programmer, I do hope this introduction to R has intrigued you at a minimum. Use this as an opportunity to install R, RStudio, the system dependencies, and the packages I used, and experiment with the scripts and the example datasets.

If you're properly motivated, you can read the Cookbook for R [9] and buy the R Cookbook [18] and R Graphics Cookbook [19]. Also remember that the Coursera Data Science offerings are a fabulous way to get underway with R. If you're ready to work specifically with information security data and R, remember to refer to the Data Driven Security website [20], which also includes related Python activities.

If you subscribe to the principles of "a picture is worth a thousand words," you should now be sufficiently compelled to begin your R practice or further enhance that which you've already established. Cheers, and feel free to ping me if you have questions.

Infos

  1. Coursera Data Science courses: https://www.coursera.org/specialization/jhudatascience/1?utm_medium=listingPage
  2. Tufte, Edward R. Beautiful Evidence , Chapter 5. Graphics PR, 2006
  3. Jacobs, Jay, and Bob Rudis. Data-Driven Security: Analysis, Visualization and Dashboards . Wiley, 2014
  4. toolsmith – Jay and Bob Strike Back: Data-Driven Security: http://holisticinfosec.blogspot.com/2014/09/toolsmith-jay-and-bob-strike-back-data.html
  5. Code for this article: ftp://ftp.linux-magazin.com/pub/listings/admin-magazine.com/24
  6. R Project: http://www.r-project.org
  7. CRAN mirrors: http://cran.r-project.org/mirrors.html
  8. RStudio: http://www.rstudio.com/
  9. Cookbook for R : http://www.cookbook-r.com/
  10. R-bloggers: http://www.r-bloggers.com
  11. MAGento server MAGMI plugin – RFI: http://www.exploit-db.com/exploits/35052/
  12. Help topics: http://docs.ggplot2.org/current/
  13. "A Very Quick Introduction to ggplot2" by Christophe Ladroue: http://chrisladroue.com/extra/ggplot2Intro/introductionGGplot2/
  14. "R vs Excel for Data Analysis" by Chris Leonard: http://r-dir.com/blog/2013/11/r-vs-excel-for-data-analysis.html
  15. "Introduction to dplyr" by Hadley Wickham: http://rpubs.com/hadley/dplyr-intro
  16. vFeed at GitHub: https://github.com/toolswatch/vFeed/
  17. "Working with XML Data in R" by Tobi Bosede, http://www.informit.com/articles/article.aspx?p=2215520
  18. Teetor, Paul. R Cookbook . O'Reilly Media, 2011
  19. Chang, Winston. R Graphics Cookbook . O'Reilly Media, 2013
  20. Data-driven security: http://datadrivensecurity.info/

The Author

Russ McRee, GSE, MSISE, directs a Security Assessment team for the Microsoft Cloud Infrastructure & Operations organization. He writes "toolsmith," a monthly column for the ISSA Journal, and has spoken at conferences such as DEFCON, BlueHat, Black Hat, DerbyCon, SANSFIRE, and others.

Russ is a SANS Internet Storm Center handler, and his work includes service in the Washington State Guard as the Cybersecurity Advisor to the Washington Military Department. Russ (mailto:russ@holisticinfosec.org) advocates a holistic approach to the practice of information assurance and, as such, maintains http://holisticinfosec.org.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus