diff --git a/website/static/aeri_stacktraces/incidents_analysis.pdf b/website/static/aeri_stacktraces/incidents_analysis.pdf new file mode 100644 index 0000000000000000000000000000000000000000..bea9ad3fbe30bca218c62c3426b7f162e7b890ab Binary files /dev/null and b/website/static/aeri_stacktraces/incidents_analysis.pdf differ diff --git a/website/static/aeri_stacktraces/incidents_analysis.rmd b/website/static/aeri_stacktraces/incidents_analysis.rmd new file mode 100644 index 0000000000000000000000000000000000000000..c08e3fc31149527d10f3f9781e7b04b20ec37d37 --- /dev/null +++ b/website/static/aeri_stacktraces/incidents_analysis.rmd @@ -0,0 +1,581 @@ +--- +title: "StackTraces -- Incidents" +subtitle: "R Analysis document" +author: "Boris Baldassari -- Castalia Solutions" +output: + pdf_document: + toc: yes + toc_depth: 3 + keep_tex: true + extra_dependencies: + - grffile + html_document: + toc: yes + toc_depth: 2 + word_document: + toc: yes + toc_depth: '2' +--- + +```{r init, message=FALSE, echo=FALSE, cache=FALSE} +library(ggplot2) +library(ggthemes) +library(knitr) + +library(kableExtra) +options(knitr.table.format = "latex") + +library(parsedate) +library(magrittr) + +require(xts) +``` + +```{r init.read, message=FALSE, echo=FALSE, cache=TRUE} +# Read csv file +file.in <- "incidents_extract.csv" +myincidents <- read.csv(file.in, header=T, quote='"') +file.in.bundles <- "incidents_bundles_extract.csv" +mybundles <- read.csv(file.in.bundles, header=T, quote='"') + +# Create xts object +myincidents <- myincidents[myincidents$timestamp != '',] +myincidents <- myincidents[myincidents$savedOn != '',] +myp.xts <- xts(x = myincidents, order.by = parse_iso_8601(myincidents$timestamp)) +``` + +# Introduction + +## About this dataset + +The [Automated Error Reporting](https://wiki.eclipse.org/EPP/Logging) (AERI) system retrieves [information about exceptions](https://www.codetrails.com/error-analytics/manual/). It is installed by default in the [Eclipse IDE](http://www.eclipse.org/ide/) and has helped hundreds of projects better support their users and resolve bugs. + +This dataset is a dump of all records over a couple of years, with useful information about the exceptions and environment. + +* **Generated date**: `r date()` +* **First date**: `r first(index(myp.xts))` +* **Last date**: `r last(index(myp.xts))` +* **Number of incidents**: `r nrow(myp.xts)` +* **Number of attributes**: `r ncol(myp.xts)` + +## Terminology + +* **Incidents** When an exception occurs and is trapped by the AERI system, it constitutes an incident (or error report). An incident can be reported by several different people, can be reported multiple times, and can be linked to different environments. +* **Problems** As soon as an error report arrives on the server, it will be analyzed and subsequently assigned to one or more problems. A problem thus represents a set of (similar) error reports which usually have the same root cause – for example a bug in your software. (Extract from the [AERI system documentation](https://www.codetrails.com/error-analytics/manual/concepts/error-reports-problems-bugs-projects.html)) + +This dataset targets only the Incidents of the AERI dataset. There is another dedicated document for the Problems. + +## Privacy concerns + +We value privacy and intend to make everything we can to prevent misuse of the dataset. If you think we failed somewhere in the process, please [let us know](https://www.crossminer.org/contact) so we can do better. + +The AERI system itself doesn't gather much private information, and takes a great care of it. Ths dataset goes a step further and removes all identifiable information. + +* There is **no email address** in this dataset, **nor any UUID**. +* People not willing to share their traces to the AERI system can tick the private option. This choice has been respected, and all classes that do not belong to public hierarchy have been hidden thanks to an anonymisation mechanism. + +The anonymisation technique used basically encrypts information and then throws away the private key. Please refer to the [documentation published on github](https://github.com/borisbaldassari/data-anonymiser) for more details. + + +## About this document + +This document is a [R Markdown document](http://rmarkdown.rstudio.com) and is composed of both text (like this one) and dynamically computed information (mostly in the Anaysis section below) executed on the data itself. This ensures that the information is always synchronised with the data, and serves as a test suite for the dataset. + + +# Structure of data + +The plugin collects a [lot of useful information](https://www.codetrails.com/error-analytics/manual/misc/sent-data.html). We only use a subset of it, as required by research interest and privacy protection concerns. + +The Incidents dataset comes in two flavours: `All incidents`, in JSON format, and `incidents extract`, in CSV format. There is also a list of bundles discovered in the data dump with their version and number of attached incidents. + + +## All incidents (JSON) + +**All incidents** is the most complete dataset, with all attributes, stacktraces and bundles. Since the stacktraces and bundles structures are too complex for CSV, only the JSON export contains them. The dataset comes as a quite large compressed archive, with one JSON file per incident This represents a total of `r nrow(myincidents)` files (incidents). + + +The structure of an incident file is examplified below: + + { + "eclipseBuildId": "4.6.1.M20160907-1200", + "eclipseProduct": "org.eclipse.epp.package.jee.product", + "fingerprint": "cd03d068798d141412b1d1605892fbec", + "fingerprint2": "12166d864efb7adcccc187034deb7dbf", + "javaRuntimeVersion": "1.8.0_112-b15", + "kind": "NORMAL", + "osgiArch": "x86_64", + "osgiOs": "Windows7", + "osgiOsVersion": "6.1.0", + "osgiWs": "win32", + "presentBundles": [ + [ "bundle" ] + ], + "savedOn": "2016-11-08T10:23:01.914Z", + "severity": "UNKNOWN", + "stacktraces": [ + [ "stacktrace" ] + ], + "status": { + "code": 0, + "fingerprint": "98631af2ddb2d197ebdca532f19d082b", + "message": "Failed to retrieve default libraries for C:\\Program Files\\Java\\jre1.8.0_111", + "pluginId": "org.eclipse.jdt.launching", + "pluginVersion": "3.8.100.v20160505-0636", + "severity": 4 + }, + "timestamp": "2016-11-08T10:22:59.204Z" + } + +The structure used in the mongodb for stacktraces has been kept as is: it is composed of fields with all information relevant to each line of the stacktrace. Each stacktrace is an array of objects as shown below: + + [ + { + "cN": "sun.net.www.http.HttpClient", + "mN": "parseHTTPHeader", + "fN": "HttpClient.java", + "lN": 786, + } + ] + +Bundles have the following format: + + { + "name": "org.eclipse.egit.core", + "version": "4.1.1.201511131810-r" + } + + +## Incidents extract (CSV) + +The **Incidents extract** CSV dataset provides the same information as the full JSON dataset, excluding complex structures that cannot be easily formatted in CSV: stacktraces, bundles, products. + +Attributes are: ``r names(myincidents)``. + +Examples are provided at the end of this file to demonstrate how to use it in R. + + +## Bundles extract (CSV) + +The **Bundles extract** CSV dataset lists the Eclipse bundles and versions associated to incidents, with the number of incidents for each pair. + +```{r bundles.table, message=FALSE, echo=FALSE, warning=FALSE, results='asis', cache=TRUE} +#library(DT) +#datatable(mybundles[seq(1,10),], options=list(paging=F)) +kable(mybundles[seq(1,30),]) +``` + +# Attributes + +## Message + +* Description: A short text summarising the error. +* Type: String + +## Code + +* Description: The numeric status code logged with the error. +* Type: Integer + +```{r attr.code.init, warning=FALSE, echo=FALSE, cache=TRUE} +mysum <- summary(myincidents$code) +``` + +Statistical summary: + +* Range [ `r format(mysum[[1]], scientific = FALSE)` : `r format(mysum[[6]], scientific = FALSE)` ] +* 1st Quartile `r mysum[[2]]` +* Median `r mysum[[3]]` +* Mean `r format(mysum[[4]], scientific = FALSE)` +* 3rd Quartile `r format(mysum[[5]], scientific = FALSE)` + +## Severity + +* Description: An estimate by the user reporting the error about its perceived severity. +* Type: Factors + +```{r attr.severity.init, warning=FALSE, echo=FALSE, cache=TRUE} +mysum <- summary(myincidents$severity) +``` + +Distribution: + +* CRITICAL `r mysum[[c('CRITICAL')]]` +* MAJOR `r mysum[[c('MAJOR')]]` +* MINOR `r mysum[[c('MINOR')]]` +* NO_BUG `r mysum[[c('NO_BUG')]]` +* TRIVIAL `r mysum[[c('TRIVIAL')]]` +* UNKNOWN `r mysum[[c('UNKNOWN')]]` + + +## Kind {#attr_kind} + +* Description: The type of error recorded, as identified by the AERI system. +* Type: Factors + +The possible values found in the dataset for this attributes are: + +```{r attr.kind, message=FALSE, echo=FALSE, warning=FALSE, results='asis', cache=TRUE} +kinds <- table(myincidents$kind) +kinds <- kinds[kinds != 0] +kinds <- kinds[order(kinds, decreasing = TRUE)] +t <- lapply(names(kinds), function(x) paste('* ', x, ' (count: ', kinds[[x]], ")", sep='')) +t <- paste(t, collapse="\n") +cat(t) +``` + +**Notes** + +There are different kinds of incidents described in the [official documentation](https://www.codetrails.com/error-analytics/manual/concepts/incident-kinds.html): + +* Normal Error: Normal errors are all exceptions that were reported by a client but that are not of kind defined below. Common examples of a normal error are a `NullPointerException` or `IllegalArgumentException`. + - An `OutOfMemoryError` is a special kind of exception. Unlike for normal errors, the stack frame (implicitly) throwing the exception is only sometimes indicative of the root cause of the problem. + - A `StackOverflowError` is a special kind of exception, whose unique characteristic is a repeating pattern of stack frames near the top of the stack trace. +* UI Freeze: A UI freeze is caused by a long-running operation or even a deadlock on the UI thread. +* Third-Party Error: Third-party errors are reports that were received by the Codetrails Error Analytics Server, which deemed neither the configured projects nor their dependencies at fault. +* Third-Party UI Freeze: Third-Party UI Freezes are UI freezes that were received by the Codetrails Error Analytics Server, which deemed neither the configured projects nor their dependencies at fault. + +## Plugin ID {#attr_plugin_id} + +* Description: The ID of the Eclipse plugin that threw the exception. +* Type: Factors + +The possible values found in the dataset for this attributes are: + +```{r attr.plugin.id, message=FALSE, echo=FALSE, warning=FALSE, results='asis', cache=TRUE} +occurences.max.pi <- 500 +pis <- data.frame(table(myincidents$pluginId)) +pis <- pis[order(-pis$Freq),] +pis.top <- pis[pis[,c('Freq')] >= occurences.max.pi,] +t <- lapply(pis.top$Var1, function(x) paste('* ', x, ' (count: ', pis.top[pis.top$Var1 == x,c("Freq")], ")", sep='')) +t <- paste(t, collapse="\n") +cat(t) +``` + +Visualisation of the most used Eclipse Build IDs in the dataset: + +```{r attr.plugin.id.plot, echo=FALSE, cache=TRUE} +ggplot(pis.top[seq(1,30),], aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat='identity') + coord_flip() + + theme_tufte() + xlab("Plugin IDs") + ggtitle("Repartition of most impacted plugin IDs in dataset") +``` + + +## Plugin version {#attr_plugin_version} + +* Description: The ID of the Eclipse plugin that threw the exception. +* Type: Factors + +```{r attr.pluginversion.init, echo=F, cache=TRUE} +occurences.max.pv <- 500 +mypvs <- data.frame(table(myincidents$pluginVersion)) +mypvs <- mypvs[order(-mypvs$Freq),] +mypvs.top <- mypvs[mypvs[,c('Freq')] >= occurences.max.pv,] +``` + +There are `r nrow(mypvs)` different values found in the dataset for this attribute. The following bar plot only displays the values with more than `r occurences.max.pv` occurrences: + +```{r attr.pluginversion.plot, echo=FALSE, message=FALSE, cache=TRUE} +mypvs.df <- as.data.frame(mypvs.top) +ggplot(mypvs.df[seq(1,30),], aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat='identity') + coord_flip() + + theme_tufte() + xlab("Plugin version") + ggtitle("Repartition of top Eclipse plugin versions in dataset") +``` + + + +## Status fingerprint {#attr_status_fingerprint} + +* Description: An identifier for the status of the incident. Used for [duplicates detection](https://www.codetrails.com/error-analytics/manual/features/server/duplicate-detection.html). +* Type: String + + +## Incident fingerprint {#attr_fingerprint} + +* Description: An identifier for the incident. Used for [duplicates detection](https://www.codetrails.com/error-analytics/manual/features/server/duplicate-detection.html). +* Type: String + + +## Incident fingerprint2 {#attr_fingerprint2} + +* Description: An identifier for the incident. Used for [duplicates detection](https://www.codetrails.com/error-analytics/manual/features/server/duplicate-detection.html). +* Type: String + + +## Timestamp {#timestamp} + +* Description: The time of creation of the incident. +* Type: Date (ISO8601) + +```{r attr.ts, echo=FALSE, cache=TRUE} +myp.xts.ts <- xts(x = data.frame(c <- rep.int(1,nrow(myincidents))), order.by = parse_iso_8601(myincidents$timestamp)) +``` + +Dates range from `r first(index(myp.xts.ts))` to `r last(index(myp.xts.ts))`. + +```{r attr.ts.plot, echo=FALSE, cache=TRUE} +#xts.ts <- as.xts(apply.weekly(myp.xts.ts, sum)) +xts.ts <- apply.weekly(myp.xts.ts, sum) +autoplot(xts.ts, geom='line') + + theme_bw() + ylab("Incidents Timestamp") + ggtitle("Weekly number of Incidents timestamp") +``` + + +## Saved On {#attr_saved_on} + +* Description: The time of last save of the problem. +* Type: Date (ISO8601) + +```{r attr.savedOn, echo=FALSE, cache=TRUE} +myp.xts.savedOn <- xts(x = data.frame(c <- rep.int(1,nrow(myincidents))), order.by = parse_iso_8601(myincidents$savedOn)) +``` + +Dates range from `r first(index(myp.xts.savedOn))` to `r last(index(myp.xts.savedOn))`. + +```{r attr.savedOn.plot, echo=FALSE, cache=TRUE} +xts.savedOn <- as.xts(apply.weekly(myp.xts.savedOn, sum)) +autoplot(xts.savedOn, geom='line') + + theme_bw() + ylab("Problems SavedOn") + ggtitle("Weekly number of Problems SavedOn") +``` + + +## OSGi Architecture {#attr_osgi_arch} + +* Description: The architecture of the host, as specified in the OSGi bundle definition. +* Type: Factors + +Possible values found in the dataset for this attribute are: + +```{r attr.osgi.arch, message=FALSE, echo=FALSE, warning=FALSE, results='asis', cache=TRUE} +archs <- table(myincidents$osgiArch) +archs <- archs[order(archs, decreasing = TRUE)] +t <- lapply(names(archs), function(x) paste('* ', x, ' (count: ', archs[[x]], ")", sep='')) +t <- paste(t, collapse="\n") +cat(t) +``` + +Repartition of architectures: + +```{r osgiArch, echo=FALSE, message=FALSE, cache=TRUE} +archs.df <- as.data.frame(archs) +ggplot(archs.df[seq(1,30),], aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat='identity') + coord_flip() + + theme_tufte() + xlab("OSGi Architecture") + ggtitle("Repartition of OSGi Architectures in dataset") +``` + + +## OSGi OS {#attr_osgi_os} + +* Description: The host operating system, as reported in OSGi. +* Type: Factors + +The possible values found in the dataset for this attributes are: + +```{r attr.osgi.os, message=FALSE, echo=FALSE, warning=FALSE, results='asis', cache=TRUE} +oses <- table(myincidents$osgiOs) +oses <- oses[order(oses, decreasing = TRUE)] +t <- lapply(names(oses), function(x) paste('* ', x, ' (count: ', oses[[x]], ")", sep='')) +t <- paste(t, collapse="\n") +cat(t) +``` + +Visualisation of the various operating systems used in the dataset: + +```{r attr.osgi.os.plot, echo=FALSE, cache=TRUE} +oses.df <- as.data.frame(oses) +ggplot(oses.df[seq(1,30),], aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat='identity') + coord_flip() + + theme_tufte() + xlab("OSGi Operating System") + ggtitle("Repartition of OSGi OS in dataset") +``` + + +## OSGi OS Version {#attr_osgi_os_version} + +* Description: The host operating system version, as reported in OSGi. +* Type: Factors + +The possible values found in the dataset for this attributes are: + +```{r attr.osgi.os.version, message=FALSE, echo=FALSE, warning=FALSE, results='asis', cache=TRUE} +occurences.max.osv <- 500 +oses <- data.frame(table(myincidents$osgiOsVersion)) +oses <- oses[order(-oses$Freq),] +oses.top <- oses[oses[,c('Freq')] >= occurences.max.osv,] +t <- lapply(oses.top$Var1, function(x) paste('* ', x, ' (count: ', oses.top[oses.top$Var1 == x,c("Freq")], ")", sep='')) +t <- paste(t, collapse="\n") +cat(t) +``` + +Visualisation of the various operating system versions used in the dataset: + +```{r attr.osgi.os.version.plot, echo=FALSE, cache=TRUE} +ggplot(oses.top[seq(1,30),], aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat='identity') + coord_flip() + + theme_tufte() + xlab("OSGi Operating System Version") + ggtitle("Repartition of most used OSGi OS versions in dataset") +``` + + +## OSGi Window Manager {#attr_osgi_ws} + +* Description: The Window Manager used by the host, as reported in OSGi. +* Type: Factors + +The possible values found in the dataset for this attributes are: + +```{r attr.osgi.ws, message=FALSE, echo=FALSE, warning=FALSE, results='asis', cache=TRUE} +oses <- table(myincidents$osgiWs) + +# We don't want an empty col name. +names(oses)[names(oses) == ''] <- 'UNKNOWN' + +oses <- oses[order(oses, decreasing = TRUE)] +t <- lapply(names(oses), function(x) paste('* ', x, ' (count: ', oses[[x]], ")", sep='')) +t <- paste(t, collapse="\n") +cat(t) +``` + +Visualisation of the various Window managers used in the dataset: + +```{r attr.osgi.ws.plot, echo=FALSE, cache=TRUE} +oses.df <- as.data.frame(oses) +ggplot(oses.df[seq(1,30),], aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat='identity') + coord_flip() + + theme_tufte() + xlab("OSGi Window Managers") + ggtitle("Repartition of OSGi Window managers in dataset") +``` + +## Eclipse Build ID {#attr_eclipse_build_id} + +* Description: The Build ID of the Eclipse instance running when the exception occurred. +* Type: Factors + +The possible values found in the dataset for this attributes are: + +```{r attr.eb.id, message=FALSE, echo=FALSE, warning=FALSE, results='asis', cache=TRUE} +occurences.max.ebi <- 500 +ebis <- data.frame(table(myincidents$eclipseBuildId)) +ebis <- ebis[order(-ebis$Freq),] +ebis.top <- ebis[ebis[,c('Freq')] >= occurences.max.ebi,] +t <- lapply(ebis.top$Var1, function(x) paste('* ', x, ' (count: ', ebis.top[ebis.top$Var1 == x,c("Freq")], ")", sep='')) +t <- paste(t, collapse="\n") +cat(t) +``` + +Visualisation of the most used Eclipse Build IDs in the dataset: + +```{r attr.eb.id.plot, echo=FALSE, cache=TRUE} +ggplot(ebis.top[seq(1,30),], aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat='identity') + coord_flip() + + theme_tufte() + xlab("Eclipse Builds") + ggtitle("Repartition of most used Eclipse Build IDs in dataset") +``` + +## Eclipse Product {#attr_eclipse_product} + +* Description: The Eclipse product impacted by the exception. +* Type: Factors + +The possible values found in the dataset for this attributes are: + +```{r attr.ep, message=FALSE, echo=FALSE, warning=FALSE, results='asis', cache=TRUE} +occurences.max.eps <- 500 +eps <- data.frame(table(myincidents$eclipseBuildId)) +eps <- eps[order(-eps$Freq),] +``` + +There are `r nrow(eps)` different values found in the dataset for this attribute. The following table and bar plot only display the values with more than `r occurences.max.eps` occurrences: + +```{r attr.ep.table, message=FALSE, echo=FALSE, warning=FALSE, results='asis', cache=TRUE} +eps.top <- eps[eps[,c('Freq')] >= occurences.max.eps,] +t <- lapply(eps.top$Var1, function(x) paste('* ', x, ' (count: ', eps.top[eps.top$Var1 == x,c("Freq")], ")", sep='')) +t <- paste(t, collapse="\n") +cat(t) +``` + +Visualisation of the most used Eclipse Build IDs in the dataset: + +```{r attr.ep.plot, echo=FALSE, cache=TRUE} +ggplot(eps.top[seq(1,30),], aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat='identity') + coord_flip() + + theme_tufte() + xlab("Eclipse Products") + ggtitle("Repartition of most used Eclipse Products in dataset") +``` + + + +## Java runtime version {#attr_javaruntime} + +* Description: The Java runtime of the host. +* Type: Factors + +```{r jrv.kable.init, echo=F, cache=TRUE} +occurences.max.jrv <- 500 +myjrvs <- data.frame(table(myincidents$javaRuntimeVersion)) +myjrvs <- myjrvs[order(-myjrvs$Freq),] +myjrvs.top <- myjrvs[myjrvs[,c('Freq')] >= occurences.max.jrv,] +``` + +There are `r nrow(myjrvs)` different values found in the dataset for this attribute. The following bar plot only displays the values with more than `r occurences.max.jrv` occurrences: + +```{r jrv.kable, eval=FALSE, include=FALSE, results='asis', cache=TRUE} +kable(data.frame(myjrvs.top), row.names = F) %>% + kable_styling(full_width = T, latex_options = c("striped", "hold_position")) +``` + +```{r jrv.plot, echo=FALSE, message=FALSE, cache=TRUE} +myjrvs.df <- as.data.frame(myjrvs.top) +ggplot(myjrvs.df[seq(1,30),], aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat='identity') + coord_flip() + + theme_tufte() + xlab("Java runtime version") + ggtitle("Repartition of top Java runtime versions in dataset") +``` + +## Comment Quality + +* Description: An estimate of the user comment's quality (throughfulness). User comments help people better understand the context of the exception. +* Type: Factors + +```{r attr.cq.init, warning=FALSE, echo=FALSE, cache=TRUE} +mysum <- summary(myincidents$commentQuality) +``` + +Distribution: + +* HIGH `r mysum[[c('HIGH')]]` +* MEDIUM `r mysum[[c('MEDIUM')]]` +* LOW `r mysum[[c('LOW')]]` +* UNKNOWN `r mysum[[c('UNKNOWN')]]` + + +# Using the dataset + +## Reading CSV file + +Reading file from `r file.in`. + +``` +myincidents <- read.csv(file.in, header=T) +myincidents[,c("bug", "status")] <- NULL +``` + +There are ``r ncol(myincidents)`` columns and ``r nrow(myincidents)`` entries in this dataset: + +```{r examples.ncol, echo=T} +ncol(myincidents) +``` + +```{r examples.nrow, echo=T} +nrow(myincidents) +``` + +Names of columns: + +```{r examples.names, echo=T} +names(myincidents) +``` + + +## Using time series (xts) + +The dataset needs to be converted to a `xts` object. We can use one of the 2 dates: `timestamp` or `savedOn`. + +``` +require(xts) +myp.xts <- xts(x = myincidents, order.by = parse_iso_8601(myincidents$savedOn)) +``` + +## Plot time series + +Plot the number of weekly saves (attribute savedOn). + +```{r use.ts.plot, echo=FALSE, cache=TRUE} +autoplot(xts.ts, geom='line') + + theme_bw() + ylab("Incidents Timestamp") + ggtitle("Weekly number of Incidents timestamp") +``` + diff --git a/website/static/aeri_stacktraces/problems_analysis.pdf b/website/static/aeri_stacktraces/problems_analysis.pdf new file mode 100644 index 0000000000000000000000000000000000000000..76aedb436fbcfca6fbdbbc96736927e18aa15315 Binary files /dev/null and b/website/static/aeri_stacktraces/problems_analysis.pdf differ diff --git a/website/static/aeri_stacktraces/problems_analysis.rmd b/website/static/aeri_stacktraces/problems_analysis.rmd new file mode 100644 index 0000000000000000000000000000000000000000..987aee11131751da57017b4cf2346dd77a27ccf6 --- /dev/null +++ b/website/static/aeri_stacktraces/problems_analysis.rmd @@ -0,0 +1,571 @@ +--- +title: "StackTraces -- Problems" +subtitle: "R Analysis document" +author: "Boris Baldassari -- Castalia Solutions" +output: + pdf_document: + toc: yes + toc_depth: 3 + keep_tex: true + extra_dependencies: + - grffile + html_document: + toc: yes + toc_depth: 2 + word_document: + toc: yes + toc_depth: '2' +--- + +```{r init, message=FALSE, echo=FALSE} +library(ggplot2) +library(ggthemes) +library(knitr) +library(kableExtra) +library(parsedate) +library(magrittr) + +# Read csv file +file.in <- "../problems_extract.csv" +myproblems <- read.csv(file.in, header=T) + +# Create xts object +require(xts) +myp.xts <- xts(x = myproblems, order.by = parse_iso_8601(myproblems$createdOn)) +``` + +# Introduction + +## About this dataset + +The [Automated Error Reporting](https://wiki.eclipse.org/EPP/Logging) (AERI) system retrieves [information about exceptions](https://www.codetrails.com/error-analytics/manual/). It is installed by default in the [Eclipse IDE](http://www.eclipse.org/ide/) and has helped hundreds of projects better support their users and resolve bugs. + +This dataset is a dump of all records over a couple of years, with useful information about the exceptions and environment. + +* **Generated date**: `r date()` +* **First date**: `r first(index(myp.xts))` +* **Last date**: `r last(index(myp.xts))` +* **Number of problems**: `r nrow(myp.xts)` +* **Number of attributes**: `r ncol(myp.xts)` + +## Terminology + +* **Incidents** When an exception occurs and is trapped by the AERI system, it constitutes an incident (or error report). An incident can be reported by several different people, can be reported multiple times, and can be linked to different environments. +* **Problems** As soon as an error report arrives on the server, it will be analyzed and subsequently assigned to one or more problems. A problem thus represents a set of (similar) error reports which usually have the same root cause – for example a bug in your software. (Extract from the [AERI system documentation](https://www.codetrails.com/error-analytics/manual/concepts/error-reports-problems-bugs-projects.html)) + +This dataset targets only the Problems of the AERI dataset. There is another dedicated document for the Incidents. + +## Privacy concerns + +We value privacy and intend to make everything we can to prevent misuse of the dataset. If you think we failed somewhere in the process, please [let us know](https://www.crossminer.org/contact) so we can do better. + +The AERI system itself doesn't gather much private information, and takes a great care of it. Ths dataset goes a step further and removes all identifiable information. + +* There is **no email address** in this dataset, **nor any UUID**. +* People not willing to share their traces to the AERI system can tick the private option. This choice has been respected, and all classes that do not belong to public hierarchy have been hidden thanks to an anonymisation mechanism. + +The anonymisation technique used basically encrypts information and then throws away the private key. Please refer to the [documentation published on github](https://github.com/borisbaldassari/data-anonymiser) for more details. + + +## About this document + +This document is a [R Markdown document](http://rmarkdown.rstudio.com) and is composed of both text (like this one) and dynamically computed information (mostly in the Anaysis section below) executed on the data itself. This ensures that the documentation is always synchronised with the data, and serves as a test suite for the dataset. + + +# Structure of data + +The plugin collects a [lot of useful information](https://www.codetrails.com/error-analytics/manual/misc/sent-data.html). We only use a subset of it, as required by research interest and privacy protection concerns. + +The Problems dataset comes in two flavours: `All problems`, in JSON format, and `Problems extract`, in CSV format. + +## All problems (JSON) + +**All problems** is the most complete dataset, with all attributes, stacktraces and bundles. Since the stacktraces and bundles structures are too complex for CSV, only the JSON export contains them. The dataset comes as a quite large compressed archive, with one JSON file per problem. This represents a total of `r nrow(myproblems)` files (problems). + +The structure of a problem file is examplified below: + + { + "summary": "NoStackTrace in RedeliveryErrorHandler.logFailedDelivery", + "kind": "NORMAL", + "v1status", + "osgiArch": "x86_64", + "osgiOs": "MacOSX", + "osgiOsVersion": "10.9.4", + "osgiWs": "cocoa", + "createdOn": "2014-09-14T05:39:21.554Z", + "modifiedOn": "2014-09-14T05:39:21.554Z", + "savedOn": "2016-05-23T07:22:10.479Z", + "eclipseBuildId": "4.4.0.I20140606-1215", + "eclipseProduct": "org.eclipse.epp.package.standard.product", + "javaRuntimeVersion": "1.8.0-b132", + "numberOfIncidents": 0, + "numberOfReporters": 74, + "products": [ + { product }, + { product } + ], + "bundles": [ + { bundle }, + { bundle } + ], + "stacktraces": [ + [ "stacktrace for incident" ], + [ "stacktrace for cause" ], + [ "stacktrace for exception" ] + ] + } + +The structure used in the mongodb for stacktraces has been kept as is: it is composed of fields with all information relevant to each line of the stacktrace. Each stacktrace is an array of objects as shown below: + + [ + { + "cN": "sun.net.www.http.HttpClient", + "mN": "parseHTTPHeader", + "fN": "HttpClient.java", + "lN": 786, + } + ] + +Bundles have the following format: + + { + "bundleFrequency": 1, + "bundleName": "org.eclipse.egit.core", + "bundleVersion": "4.1.1.201511131810-r" + }, + +Products have the following format: + + { + "buildId": "4.5.2.M20160212-1500", + "frequency": 3, + "productId": "org.eclipse.epp.package.jee.product" + } + + +## Problems extract (CSV) + +The **Problems extract** CSV dataset provides the same information as the full JSON dataset, excluding complex structures that cannot be easily formatted in CSV: stacktraces, bundles, products. + +Attributes are: ``r names(myproblems)``. + +Examples are provided at the end of this file to demonstrate how to use it in R. + + +# Attributes + +## Summary + +* Description: A short text summarising the error. +* Type: String + +## Number of reporters {#attr_number_of_reporters} + +* Description: The number of people who reported this incident or problem. +* Type: integer + +```{r numberRep.init, warning=FALSE, echo=FALSE} +mysum <- summary(myproblems$numberOfReporters) +``` + +Statistical summary: + +* Range [ `r mysum[[1]]` : `r format(mysum[[6]], scientific = FALSE)` ] +* 1st Quartile `r mysum[[2]]` +* Median `r mysum[[3]]` +* Mean `r format(mysum[[4]], scientific = FALSE)` +* 3rd Quartile `r format(mysum[[5]], scientific = FALSE)` + + +## Number of incidents {#attr_number_of_incidents} + +* Description: The number of times this problem was identified in incidents. +* Type: Integer + +```{r numberInc.init, warning=FALSE, echo=FALSE} +mysum <- summary(myproblems$numberOfIncidents) +``` + +Statistical summary: + +* Range [ `r mysum[[1]]` : `r format(mysum[[6]], scientific = FALSE)` ] +* 1st Quartile `r mysum[[2]]` +* Median `r mysum[[3]]` +* Mean `r format(mysum[[4]], scientific = FALSE)` +* 3rd Quartile `r format(mysum[[5]], scientific = FALSE)` +* NAs `r mysum[[7]]` + +## V1 Status {#attr_status} + +* Description: The status of the problem attached to the error report. +* Type: Factors + +The possible values found in the dataset for this attributes are: + +```{r attr.status, message=FALSE, echo=FALSE, warning=FALSE, results='asis'} +statuses <- table(myproblems$v1status) +t <- lapply(names(statuses), function(x) paste('* ', x, ' (count: ', statuses[[x]], ")", sep='')) +t <- paste(t, collapse="\n") +cat(t) +``` + +Note: The name of this attribute in the original file is `v1status`. + +## Kind {#attr_kind} + +* Description: The type of error recorded, as identified by the AERI system. +* Type: Factors + +The possible values found in the dataset for this attributes are: + +```{r attr.kind, message=FALSE, echo=FALSE, warning=FALSE, results='asis'} +kinds <- table(myproblems$kind) +t <- lapply(names(kinds), function(x) paste('* ', x, ' (count: ', kinds[[x]], ")", sep='')) +t <- paste(t, collapse="\n") +cat(t) +``` + +**Notes** + +There are different kinds of incidents described in the [official documentation](https://www.codetrails.com/error-analytics/manual/concepts/incident-kinds.html): + +* Normal Error: Normal errors are all exceptions that were reported by a client but that are not of kind defined below. Common examples of a normal error are a `NullPointerException` or `IllegalArgumentException`. + - An `OutOfMemoryError` is a special kind of exception. Unlike for normal errors, the stack frame (implicitly) throwing the exception is only sometimes indicative of the root cause of the problem. + - A `StackOverflowError` is a special kind of exception, whose unique characteristic is a repeating pattern of stack frames near the top of the stack trace. +* UI Freeze: A UI freeze is caused by a long-running operation or even a deadlock on the UI thread. +* Third-Party Error: Third-party errors are reports that were received by the Codetrails Error Analytics Server, which deemed neither the configured projects nor their dependencies at fault. +* Third-Party UI Freeze: Third-Party UI Freezes are UI freezes that were received by the Codetrails Error Analytics Server, which deemed neither the configured projects nor their dependencies at fault. + + +## Created On {#attr_created_on} + +* Description: The time of first appearance of the problem in an incident. +* Type: Date (ISO8601) + +```{r attr.createdOn, echo=FALSE} +myp.xts.createdOn <- xts(x = data.frame(c <- rep.int(1,nrow(myproblems))), order.by = parse_iso_8601(myproblems$createdOn)) +``` + +Dates range from `r first(index(myp.xts.createdOn))` to `r last(index(myp.xts.createdOn))`. + +```{r attr.createdOn.plot} +xts.createdOn <- as.xts(apply.weekly(myp.xts.createdOn, sum)) +autoplot(xts.createdOn, geom='line') + + theme_bw() + ylab("Problems CreatedOn") + ggtitle("Weekly number of Problems CreatedOn") +``` + + +## Modified On {#attr_modified_on} + +* Description: The time of last update of the problem in an incident. +* Type: Date (ISO8601) + +```{r attr.modifiedOn, echo=FALSE} +myp.xts.modifiedOn <- xts(x = data.frame(c <- rep.int(1,nrow(myproblems))), order.by = parse_iso_8601(myproblems$modifiedOn)) +``` + +Dates range from `r first(index(myp.xts.modifiedOn))` to `r last(index(myp.xts.modifiedOn))`. + +```{r attr.modifiedOn.plot} +xts.modifiedOn <- as.xts(apply.weekly(myp.xts.modifiedOn, sum)) +autoplot(xts.modifiedOn, geom='line') + + theme_bw() + ylab("Problems ModifiedOn") + ggtitle("Weekly number of Problems ModifiedOn") +``` + + +## Saved On {#attr_saved_on} + +* Description: The time of last save of the problem. +* Type: Date (ISO8601) + +```{r attr.savedOn, echo=FALSE} +myp.xts.savedOn <- xts(x = data.frame(c <- rep.int(1,nrow(myproblems))), order.by = parse_iso_8601(myproblems$savedOn)) +``` + +Dates range from `r first(index(myp.xts.savedOn))` to `r last(index(myp.xts.savedOn))`. + +```{r attr.savedOn.plot} +xts.savedOn <- as.xts(apply.weekly(myp.xts.savedOn, sum)) +autoplot(xts.savedOn, geom='line') + + theme_bw() + ylab("Problems SavedOn") + ggtitle("Weekly number of Problems SavedOn") +``` + + +## OSGi Architecture {#attr_osgi_arch} + +* Description: The architecture of the host, as specified in the OSGi bundle definition. +* Type: Factors + +Possible values found in the dataset for this attribute are: + +```{r attr.osgi.arch, message=FALSE, echo=FALSE, warning=FALSE, results='asis'} +archs <- table(myproblems$osgiArch) +t <- lapply(names(archs), function(x) paste('* ', x, ' (count: ', archs[[x]], ")", sep='')) +t <- paste(t, collapse="\n") +cat(t) +``` + +Repartition of architectures: + +```{r osgiArch, echo=FALSE, message=FALSE} +archs.df <- as.data.frame(archs) +ggplot(archs.df, aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat='identity') + coord_flip() + + theme_minimal() + xlab("OSGi Architecture") + ggtitle("Repartition of OSGi Architectures in dataset") +``` + + +## OSGi OS {#attr_osgi_os} + +* Description: The host operating system, as reported in OSGi. +* Type: Factors + +The possible values found in the dataset for this attributes are: + +```{r attr.osgi.os, message=FALSE, echo=FALSE, warning=FALSE, results='asis'} +oses <- table(myproblems$osgiOs) +oses <- oses[order(oses, decreasing = TRUE)] +t <- lapply(names(oses), function(x) paste('* ', x, ' (count: ', oses[[x]], ")", sep='')) +t <- paste(t, collapse="\n") +cat(t) +``` + +Visualisation of the various operating systems used in the dataset: + +```{r attr.osgi.os.plot, echo=FALSE} +oses.df <- as.data.frame(oses) +ggplot(oses.df, aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat='identity') + coord_flip() + + theme_minimal() + xlab("OSGi Operating System") + ggtitle("Repartition of OSGi OS in dataset") +``` + + +## OSGi OS Version {#attr_osgi_os_version} + +* Description: The host operating system version, as reported in OSGi. +* Type: Factors + +The possible values found in the dataset for this attributes are: + +```{r attr.osgi.os.version, message=FALSE, echo=FALSE, warning=FALSE, results='asis'} +occurences.max.osv <- 500 +oses <- data.frame(table(myproblems$osgiOsVersion)) +oses <- oses[order(-oses$Freq),] +oses.top <- oses[oses[,c('Freq')] >= occurences.max.osv,] +t <- lapply(oses.top$Var1, function(x) paste('* ', x, ' (count: ', oses.top[oses.top$Var1 == x,c("Freq")], ")", sep='')) +t <- paste(t, collapse="\n") +cat(t) +``` + +Visualisation of the various operating system versions used in the dataset: + +```{r attr.osgi.os.version.plot, echo=FALSE} +ggplot(oses.top, aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat='identity') + coord_flip() + + theme_minimal() + xlab("OSGi Operating System Version") + ggtitle("Repartition of most used OSGi OS versions in dataset") +``` + + +## OSGi Window Manager {#attr_osgi_ws} + +* Description: The Window Manager used by the host, as reported in OSGi. +* Type: Factors + +The possible values found in the dataset for this attributes are: + +```{r attr.osgi.ws, message=FALSE, echo=FALSE, warning=FALSE, results='asis'} +oses <- table(myproblems$osgiWs) +oses <- oses[order(oses, decreasing = TRUE)] +t <- lapply(names(oses), function(x) paste('* ', x, ' (count: ', oses[[x]], ")", sep='')) +t <- paste(t, collapse="\n") +cat(t) +``` + +Visualisation of the various Window managers used in the dataset: + +```{r attr.osgi.ws.plot, echo=FALSE} +oses.df <- as.data.frame(oses) +ggplot(oses.df, aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat='identity') + coord_flip() + + theme_minimal() + xlab("OSGi Window Managers") + ggtitle("Repartition of OSGi Window managers in dataset") +``` + + +## Eclipse Product {#attr_eclipse_product} + +* Description: The Eclipse product impacted by the exception. +* Type: Factors + +```{r attr.ep.init, message=FALSE, echo=FALSE, warning=FALSE, results='asis'} +occurences.max.ep <- 500 +eps <- data.frame(table(myproblems$eclipseProduct)) +eps <- eps[order(-eps$Freq),] +``` + +There are `r nrow(eps)` different values found in the dataset for this attribute. The following table and bar plot only display the values with more than `r occurences.max.ep` occurrences: + +```{r attr.ep, message=FALSE, echo=FALSE, warning=FALSE, results='asis'} +eps.top <- eps[eps[,c('Freq')] >= occurences.max.osv,] +t <- lapply(eps.top$Var1, function(x) paste('* ', x, ' (count: ', eps.top[eps.top$Var1 == x,c("Freq")], ")", sep='')) +t <- paste(t, collapse="\n") +cat(t) +``` + +```{r attr.ep.plot, echo=FALSE} +ggplot(eps.top, aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat='identity') + coord_flip() + + theme_minimal() + xlab("Eclipse Products") + ggtitle("Repartition of most used Eclipse Products in dataset") +``` + + +## Eclipse Build ID {#attr_eclipse_build_id} + +* Description: The Build ID of the Eclipse instance running when the exception occurred. +* Type: Factors + +```{r attr.eb.id.init, message=FALSE, echo=FALSE, warning=FALSE, results='asis'} +occurences.max.ebi <- 500 +ebis <- data.frame(table(myproblems$eclipseBuildId)) +ebis <- ebis[order(-ebis$Freq),] +``` + +There are `r nrow(ebis)` different values found in the dataset for this attribute. The following table and bar plot only display the values with more than `r occurences.max.ebi` occurrences: + +```{r attr.eb.id, message=FALSE, echo=FALSE, warning=FALSE, results='asis'} +ebis.top <- ebis[ebis[,c('Freq')] >= occurences.max.osv,] +t <- lapply(ebis.top$Var1, function(x) paste('* ', x, ' (count: ', ebis.top[ebis.top$Var1 == x,c("Freq")], ")", sep='')) +t <- paste(t, collapse="\n") +cat(t) +``` + +```{r attr.eb.id.plot, echo=FALSE} +ggplot(ebis.top, aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat='identity') + coord_flip() + + theme_minimal() + xlab("Eclipse Builds") + ggtitle("Repartition of most used Eclipse Build IDs in dataset") +``` + + +## Java runtime version {#attr_javaruntime} + +* Description: The Java runtime of the host. +* Type: Factors + +```{r jrv.kable.init} +occurences.max.jrv <- 500 +myjrvs <- data.frame(table(myproblems$javaRuntimeVersion)) +myjrvs <- myjrvs[order(-myjrvs$Freq),] +myjrvs.top <- myjrvs[myjrvs[,c('Freq')] >= occurences.max.jrv,] +``` + +There are `r nrow(myjrvs)` different values found in the dataset for this attribute. The following bar plot only displays the values with more than `r occurences.max.jrv` occurrences: + +```{r jrv.kable, eval=FALSE, include=FALSE, results='asis'} +kable(data.frame(myjrvs.top), row.names = F) %>% + kable_styling(full_width = T, latex_options = c("striped", "hold_position")) +``` + +```{r jrv.plot, echo=FALSE, message=FALSE} +myjrvs.df <- as.data.frame(myjrvs.top) +ggplot(myjrvs.df, aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat='identity') + coord_flip() + + theme_minimal() + xlab("Java runtime version") + ggtitle("Repartition of top Java runtime versions in dataset") +``` + + +# Using the dataset + +## Reading CSV file + +Reading file from `r file.in`. + +```{r examples.init, echo=T} +myproblems <- read.csv(file.in, header=T) +myproblems[,c("bug", "status")] <- NULL +``` + +There are ``r ncol(myproblems)`` columns and ``r nrow(myproblems)`` entries in this dataset: + +```{r examples.ncol, echo=T} +ncol(myproblems) +``` + +```{r examples.nrow, echo=T} +nrow(myproblems) +``` + +Names of columns: + +```{r examples.names, echo=T} +names(myproblems) +``` + + +## Using time series (xts) + +The dataset needs to be converted to a `xts` object. We can use one of the 3 dates + +``` +require(xts) +myp.xts <- xts(x = myproblems, order.by = parse_iso_8601(myproblems$createdOn)) +``` + + +## Raw Reporters + +Let's plot the number of reporters for each error report on a timeline. + +```{r xts.plot.reporters} +xts.reporters <- xts(as.integer(myp.xts[,c("numberOfReporters")]), order.by = index(myp.xts)) +autoplot(xts.reporters, geom='line') + + theme_minimal() + ylab("Number of Reporters") + xlab("Time") + ggtitle("Raw number of distinct reporters") +``` + +## Weekly reporters + +The previous plots used the `xts` object as it is, which is there is one point for each error report. When considering the timeline of the dataset, it can be misleading when there several submissions on a short period of time, compared to sparse time ranges. We'll use the `apply.weekly` function from `xts` to normalise the total number of weekly submissions. + +Applied to the `numberOfReporters` attribute summed up with a week range, we get the following plot: + +```{r xts.weekly.reporters} +xts.reporters.weekly <- as.xts(apply.weekly(xts.reporters, sum)) +autoplot(xts.reporters.weekly, geom='line') + + theme_minimal() + ylab("Number of Reporters") + xlab("Time") + ggtitle("Weekly number of distinct reporters") +``` + +## Raw Number of Incidents + +Let's plot the number of incidents for each error report on a timeline. + +```{r xts.plot.incidents} +xts.incidents <- xts(as.integer(myp.xts[,c("numberOfIncidents")]), order.by = index(myp.xts)) +autoplot(xts.incidents, geom='line') + + theme_minimal() + ylab("Number of Incidents") + xlab("Time") + ggtitle("Raw number of reported incidents") +``` + + +## Weelky Number of Incidents + +The previous plots used the `xts` object as it is, which is there is one point for each error report. When considering the timeline of the dataset, it can be misleading when there several submissions on a short period of time, compared to sparse time ranges. We'll use the `apply.weekly` function from `xts` to normalise the total number of weekly submissions. + +Applied to the `numberOfIncidents` attribute summed up with a week range, we get the following plot: + +```{r xts.weekly.incidents} +xts.incidents.weekly <- as.xts(apply.weekly(xts.incidents, sum)) +autoplot(xts.incidents.weekly, geom='line') + + theme_minimal() + ylab("Number of Incidents") + xlab("Time") + ggtitle("Weekly number of reported incidents") +``` + + +## Scatter plot + +A scatter plot that compares the number of incidents reported and the number of distinct reporters. + +```{r numberInc.qplot, warning=FALSE} +qplot(myproblems$numberOfReporters, myproblems$numberOfIncidents) + + theme_minimal() + ylab("Number of Incidents") + + xlab("Number of Reporters") + ggtitle("Number of Reporters vs. Number of Incidents reported to AERI") +``` + + + + + + + + + + +