---
title: "cmpsR-vignette"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{cmpsR-vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "figures/vignette-",
  echo=FALSE,
  warning = FALSE,
  message = FALSE,
  fig.width = 7,
  fig.height = 5,
  out.width = "80%"
)

library(tidyverse)
theme_set(theme_bw())
```


The `cmpsR` package is an implementation of the Congruent Matching Profile Segments (CMPS) method (Chen et al. 2019). In general, it can be used for objective comparison of striated tool marks, but in our examples, we mainly use it for bullet signatures comparison. The CMPS score is expected to be large if two signatures are similar. So it can also be considered as a feature that measures the similarity of two bullet signatures.

## Installation

You can install the released version of cmpsR from [CRAN](https://CRAN.R-project.org) with:

```{r, eval=FALSE, echo=TRUE}
install.packages("cmpsR")
```

And the development version from [GitHub](https://github.com/) with:

```{r, eval=FALSE, echo=TRUE}
# install.packages("devtools")
devtools::install_github("willju-wangqian/cmpsR")
```

## Example and Summary of the Algorithm

In this section we use a known match (KM) compasison (of two bullets) to illustrate the main ideas of CMPS algorithm and to showcase the `cmpsR` implementation.
The `cmpsR` package includes a simple data set of 12 bullet signatures generated from two bullets (each bullet has 6 bullet signatures). These bullet data come from the James Hamby Consecutively Rifled Ruger Barrel Study (Brundage 1998; Hamby, Brundage, and Thorpe 2009; Hamby et al. 2019), and the two bullets included in `cmpsR` are just a subset of Hamby set 252. 
These bullet data in their original format can also be found in Chapter 3.5 of [Open Forensic Science in R](https://sctyner.github.io/OpenForSciR/bullets.html#case-study-1).

```{r, echo=TRUE}
library(cmpsR)
data("bullets")
```

A comparison of two bullets is considered as a match if two bullets are fired from the same barrel (come from the same source). The gun barrel used in the Hamby study has 6 lands, and during the firing process striation marks will be engraved on the bullet by these lands. A bullet signature is a numerical representation of the striation marks engraved by a land. This is why each bullet can generate 6 bullet signatures. Two bullet signatures are a match if they are originally engraved by the same land in a gun barrel. Therefore, two bullets of a known-match comparison will have 36 pairwise bullet signature comparisons, and 6 of them are matching bullet signature comparisons while 30 of them are non-matching bullet signature comparisons. 

Here we plot the twelve bullet signatures of the two bullets. The bullet signatures are aligned so that the top figure and the bottom figure of the same column are a matching bullet signature comparison.

```{r signature_plot}
signatures <- bullets %>% unnest(sigs)
signatures <- signatures %>% mutate(
  bulletland = factor(bulletland, levels=c(paste(rep(c(1,2), each=6), c(1:6,2:6,1), sep="-")))
)
signatures %>% ggplot(aes(x = x/1000, y = sig)) + geom_line() + facet_wrap(~bulletland, ncol=6) +
  theme_bw() +
  xlab("Length in mm") +
  ylab("Relative height in micron")
```

To further illustrate the idea of the CMPS algorithm, let's consider one matching bullet signature comparison: bullet signature of bullet 1 land 2 and bullet signature of bullet 2 land 3 (the second column), and compute the CMPS score of this comparison: 

```{r, echo=TRUE}
library(cmpsR)
data("bullets")

x <- bullets$sigs[bullets$bulletland == "2-3"][[1]]$sig
y <- bullets$sigs[bullets$bulletland == "1-2"][[1]]$sig

cmps <- extract_feature_cmps(x, y, include = "full_result")
cmps$CMPS_score
```

And we have the plot of `x` and `y`.

<!-- # ```{r plot_example, fig.cap="A KM Comparison, x and y", out.width="80%", fig.keep="hold", fig.align='center'} -->

```{r plot_example, fig.cap="A KM Comparison, x and y", fig.align='center'}
# knitr::include_graphics("man/figures/step0.png")
signatures %>% filter(bulletland %in% c("2-3", "1-2")) %>% 
  ggplot(aes(x = x/1000, y = sig)) + geom_line(aes(color = bulletland)) +
  theme_bw() +
  xlab("Length in mm") +
  ylab("Relative height in micron") +
  labs(color = "bullet land")
```

#### Main Idea

The main idea of the CMPS method is that: 

1. we take the first signature as the comparison signature (`x` or bullet signature of "2-3") and cut it into consecutive and non-overlapping basis segments of the same length. In this case, we set the length of a basis segment to be 50 units, and we have 22 basis segments in total for bullet signature `x`.

```{r plot_cut_x, fig.cap="Cut x into consecutive and non-overlapping basis segments of the same length. Only 4 basis segments are shown here", out.width="80%", fig.keep="hold", fig.align='center'}
segments <- get_segs(x, len = 50)
nseg <- length(segments$segs)
df <- data.frame(value = unlist(segments$segs),
                 segs = rep(1:nseg, each = 50),
                 index = unlist(segments$index))
df$segs_tag <- paste("seg", df$segs)
df.partx <- data.frame(value = unlist(segments$segs),
                       segs = 0,
                       index = unlist(segments$index),
                       segs_tag = "x")
df.party <- data.frame(value = y,
                       segs = 0,
                       index = 1:length(y),
                       segs_tag = "y")
cutt <- sapply(segments$index, function(idx) {idx[1]})
rbind(df.partx, df) %>% filter(segs <= 4) %>% 
  ggplot() +
  geom_line(aes(x = index, y = value)) +
  geom_vline(xintercept = cutt, color = "red") +
  facet_grid(segs_tag ~ .) +
  xlab("position") +
  ylab("signature") +
  ggtitle("Cut Comparison Signature x into Consecutive and Non-overlapping Basis Segments")
```

2. for each basis segment, we compute the cross-correlation function (ccf) between the basis segment and the reference signature (`y` or bullet signature of "1-2")

```{r plot_y_and_seg, fig.cap="y and 7th basis segment", out.width="80%", fig.keep="hold", fig.align='center'}
seg3 <- df %>% filter(segs == 7)
seg3$segs <- "segment 7"
df.y <- data.frame(value = y, segs = "y", index = 1:length(y))
df.y <- rbind(df.y, seg3[,-4])
df.y$segs <- factor(df.y$segs, levels = c("y", "segment 7"))
df.y %>% filter(!is.na(value)) %>% 
  ggplot() +
  geom_line(aes(x = index, y = value, color = segs)) +
  labs(x = "position", y = "signature", color = "label") +
  ggtitle("Reference Signature y and the 7th Segment") +
  scale_color_manual(values = c("black", "red"))
```

```{r plot_ccf_y_seg, fig.cap="the cross-correlation function (ccf) between y and segment 7", out.width="80%", fig.keep="hold", fig.align='center'}
ccrpeaks <- get_ccr_peaks(y, segments = segments, 50, nseg = 7, npeaks = 1)
df.ccf <- data.frame(value = ccrpeaks$ccr$ccf, index = ccrpeaks$adj_pos)

df.ccf %>% ggplot() +
  geom_line(aes(index, value)) + 
  geom_vline(xintercept = ccrpeaks$peaks_pos, color = "red") +
  xlab("position") +
  ylab("ccf") +
  ggtitle("CCF of y and the 7th Basis Segment")
```

* for the `ccf` curve, the `position` represents the shift of the segment. A negative value means a shift to the left, a positive value means a shift to the right, and 0 means no shift (the segment stays at its original position in the reference signature);
* we are interested in the peaks in the ccf curve and the positions of those peaks (as indicated by the red vertical line in the plot above). In other words, if we shift the segment, which position would give us the "best fit"?

3. If two signatures are from a KM comparison, most of the basis segments should agree with each other on the position of the best fit. Then these segments are called the "**Congruent Matching Profile Segments (CMPS)**".

  Ideally, if two signatures are identical, we are expecting the position of the highest peak in the ccf curve remains the same across all ccf curves (we only show 7 segments here);

```{r plot_x_itself, fig.cap="ideal case: compare x to itself. The highest peak has value 1 and is marked by the blue dot", out.width="80%", fig.keep="hold", fig.align='center'}
comp <- x
npeaks <- 1
seg_outlength <- 50

ccf.df <- do.call(rbind, lapply(6:12, function(nss) {
  ccrpeaks <- get_ccr_peaks(comp, segments = segments, seg_outlength = seg_outlength, nseg = nss, npeaks = npeaks)
  df.ccf <- data.frame(value = ccrpeaks$ccr$ccf, index = ccrpeaks$adj_pos)
  df.ccf$segs <- nss
  df.ccf
}))
peak.df <- do.call(rbind, lapply(6:12, function(nss) {
  ccrpeaks <- get_ccr_peaks(comp, segments = segments, seg_outlength = seg_outlength, nseg = nss, npeaks = npeaks)
  df.ccf <- data.frame(value = ccrpeaks$peaks_heights, index = ccrpeaks$peaks_pos)
  df.ccf$segs <- nss
  df.ccf
}))

ccf.df %>% 
  ggplot() +
  geom_line(aes(x = index, y = value)) +
  geom_point(data = peak.df, aes(x = index, y = value),
             color = "blue") +
  geom_vline(xintercept = 0, color = "red") +
  facet_grid(segs ~ .) +
  # ggtitle("Real Case: x compares to y") +
  ggtitle("Ideal Case: x Compares to Itself") +
  xlab("position") +
  ylab("ccf")
```

  But in the real case, the basis segments might not achieve a final agreement, but we have the majority; 

```{r plot_real_xy, fig.cap="real case: compare x to y. The 5 highest peaks are marked by the blue dots", out.width="80%", fig.keep="hold", fig.align='center'}
comp <- y
npeaks <- 5
seg_outlength <- 50

ccf.df <- do.call(rbind, lapply(6:12, function(nss) {
  ccrpeaks <- get_ccr_peaks(comp, segments = segments, seg_outlength = seg_outlength, nseg = nss, npeaks = npeaks)
  df.ccf <- data.frame(value = ccrpeaks$ccr$ccf, index = ccrpeaks$adj_pos)
  df.ccf$segs <- nss
  df.ccf
}))
peak.df <- do.call(rbind, lapply(6:12, function(nss) {
  ccrpeaks <- get_ccr_peaks(comp, segments = segments, seg_outlength = seg_outlength, nseg = nss, npeaks = npeaks)
  df.ccf <- data.frame(value = ccrpeaks$peaks_heights, index = ccrpeaks$peaks_pos)
  df.ccf$segs <- nss
  df.ccf
}))

ccf.df %>% 
  ggplot() +
  geom_line(aes(x = index, y = value)) +
  geom_point(data = peak.df, aes(x = index, y = value),
             color = "blue") +
  geom_vline(xintercept = -6, color = "red") +
  facet_grid(segs ~ .) +
  ggtitle("Real Case: x compares to y") +
  # ggtitle("Ideal Case: x compares to itself") +
  xlab("position") +
  ylab("ccf")
```

  We mark the 5 highest peaks for each ccf curve because the position of the "highest peak" might not be the best one. 

4. each ccf curve votes for 5 candidate positions, then we ask two questions in order to obtain the CMPS number/score:

* which position receives the most votes? -> the best position (indicated by the red vertical line)

* how many segments have voted for the best position? -> CMPS score

  If we focus on these 7 segments only, and have a very short tolerance zone, the CMPS number is 6.
  
  (If we consider all 22 segments, and have a default tolerance zone (+/- 25 units), the CMPS number is 20.)

5. false positive: how can the segments vote more wisely? -> Multi Segment Lengths Strategy

* by increasing the segment length, one can reduce the number of "false positive" peaks.

* the first scale level is the original length of segment 7; for the second scale level, we double its length while keeping its center. That is, we include 25 more units from both the left and right side of the segment 7 to obtain a segment of 100 units length. For the third scale level, we double the segment length again to obtain a segment of length 200.

```{r plot_false_positive, fig.cap="Multi Segment Lengths Strategy - increasing the segment length could decrease the number of false positive peaks in ccf curves", out.width="80%", fig.keep="hold", fig.align='center'}
comp <- y
nseg <- 7
npeaks_set <- c(5, 3, 1)

multi.seg <- do.call(rbind, lapply(c(50, 100, 200), function(scale) {
  tt <- get_seg_scale(segments, nseg, out_length = scale)
  df.tmp <- data.frame(value = tt$aug_seg, index=tt$aug_idx, scale=paste("length of", scale))
}))

multi.seg$scale <- factor(multi.seg$scale, 
                          levels = c("length of 50", "length of 100", "length of 200"))

# rbind(data.frame(value = x, index = 1:length(x), scale = "x"), multi.seg) 
p1 <- multi.seg %>% filter(!is.na(value)) %>% 
  ggplot() +
  geom_line(aes(index, value)) +
  facet_grid(scale ~ .) + 
  ylab("signature") +
  xlab("position") +
  ggtitle("Segment 7 in 3 Different Scale Levels")

out_length <- c(50, 100, 200)

multi.df <- do.call(rbind, lapply(1:3, function(scale) {
  ccrpeaks <- get_ccr_peaks(comp, segments = segments,
                            seg_outlength = out_length[scale],
                            nseg = nseg, npeaks = npeaks_set[scale])
  df.tmp <- data.frame(value = ccrpeaks$ccr$ccf, index = ccrpeaks$adj_pos)
  df.tmp$scale <- paste("scale level", scale)
  df.tmp
}))
multi.peak.df <- do.call(rbind, lapply(1:3, function(scale) {
  ccrpeaks <- get_ccr_peaks(comp, segments = segments, seg_outlength = out_length[scale],
                            nseg = nseg, npeaks = npeaks_set[scale])
  df.tmp <- data.frame(value = ccrpeaks$peaks_heights, index = ccrpeaks$peaks_pos)
  df.tmp$scale <- paste("scale level", scale)
  df.tmp
}))
p2 <- multi.df %>% ggplot() +
  geom_line(aes(x=index, y=value)) +
  geom_point(data = multi.peak.df, aes(x = index, y = value),
           color = "blue") +
  geom_vline(xintercept = -7, color = "red") +
  facet_grid(scale ~ .) +
  scale_x_continuous(breaks=seq(-400,800,100)) +
  ylab("ccf") +
  xlab("position") +
  ggtitle("ccf of segment 7 and y in 3 different scales")

p1
p2
```

* we choose five peaks at scale level 1; three peaks at scale level 2; one peak at scale level 3

  the peak shared by all three scale levels is a **consistent correlation peak** (ccp). And the position of the ccp is our best choice. Sometimes a ccp might not be found. 
  Trying to identify a ccp for each basis segment is called a "multi segment lengths" strategy.
  
  The following plots (generated by `cmpsR::cmps_segment_plot`) summarize the information of the two above plots. It shows that segment 7 finds a consistent correlation peak (ccp) at a position near 0 (position `-6`).
  
    ```{r plot_ccf_seg7, echo=TRUE}
    cmps <- extract_feature_cmps(x, y, include = "full_result")
    cmps_plot_list <- cmpsR::cmps_segment_plot(cmps, seg_idx = 7)
    ggpubr::ggarrange(plotlist = unlist(cmps_plot_list, recursive = FALSE),
                      nrow = 3, ncol = 2)
    ```
  
* In this case, since segment 7 identifies a ccp, it casts a vote for position `-6`. Then we ask two questions:
    + which position receives the most votes (within a tolerance zone specified by `Tx`)?
    + how many segments have voted for this position? -> CMPS score
    
* by default, CMPS algorithm uses the multi-segment lengths strategy. Use `?cmpsR::extract_feature_cmps` to learn more about the function, including its default settings.

6. If we follow the procedure described above (using the multi-segment lengths strategy) and investigate all 22 basis segments, we can find that 18 of them have cast a vote for position `0` $\pm 25$ (since `Tx = 25` by default). Therefore, for this KM bullet signature comparison, the CMPS score is 18.

```{r, echo=TRUE, eval=TRUE}
cmps <- extract_feature_cmps(x, y, seg_length = 50, Tx = 25, 
                     npeaks_set = c(5, 3, 1), include = "full_result")
cmps$CMPS_score
```

* Segment 6 doesn't cast a vote. Take a look at the following plot to find out why.

    ```{r plot_ccf_seg6, echo=TRUE, eval=TRUE}
    cmps_plot_list <- cmpsR::cmps_segment_plot(cmps, seg_idx = 6)
    ggpubr::ggarrange(plotlist = unlist(cmps_plot_list, recursive = FALSE),
                      nrow = 3, ncol = 2)
    ```

  It doesn't identify a consistent correlation peak.

7. If we have a KNM (known non-match) comparison, e.g. compare bullet signature 2-3 with 1-3:

```{r, eval=TRUE, echo=TRUE}
land23 <- bullets$sigs[bullets$bulletland == "2-3"][[1]]
land13 <- bullets$sigs[bullets$bulletland == "1-3"][[1]]

cmps_knm <- extract_feature_cmps(land23$sig, land13$sig, seg_length = 50, Tx = 25, 
                     npeaks_set = c(5, 3, 1), include="full_result")
cmps_knm$CMPS_score
```

#### Full Comparison Between Two Bullets

`extract_feature_cmps()` can also be used in a pipeline fashion. The following code performs a full comparison of two bullets. That is, it evaluates all 36 pairwise bullet signature comparisons and computes the CMPS scores.

```{r eval = TRUE, echo=TRUE}
library(tidyverse)
library(cmpsR)

data("bullets")

lands <- unique(bullets$bulletland)

comparisons <- data.frame(expand.grid(land1 = lands[1:6], land2 = lands[7:12]),
                          stringsAsFactors = FALSE)

comparisons <- comparisons %>%
  left_join(bullets %>% select(bulletland, sig1=sigs),
            by = c("land1" = "bulletland")) %>%
  left_join(bullets %>% select(bulletland, sig2=sigs),
            by = c("land2" = "bulletland"))

comparisons <- comparisons %>% mutate(
  cmps = purrr::map2(sig1, sig2, .f = function(x, y) {
    extract_feature_cmps(x$sig, y$sig, include = "full_result")
  })
)

comparisons <- comparisons %>%
  mutate(
    cmps_score = sapply(comparisons$cmps, function(x) x$CMPS_score),
    cmps_nseg = sapply(comparisons$cmps, function(x) x$nseg)
  )
  
cp1 <- comparisons %>% select(land1, land2, cmps_score, cmps_nseg)
cp1  
```

The following plot summarizes the CMPS scores computed above.

```{r plot_all_pairwise}
comparisons <- comparisons %>%
  mutate(
    bulletA = gsub("(\\d)-\\d", "\\1", land1),
    landA = gsub("\\d-(\\d)", "\\1", land1),
    bulletB = gsub("(\\d)-\\d", "\\1", land2),
    landB = gsub("\\d-(\\d)", "\\1", land2)
  )

dframe <- comparisons %>% select(-sig1, -sig2)

cc.idx <- c(6,7, 14, 21, 28, 35)
dframe$samesource <- FALSE
dframe$samesource[cc.idx] <- TRUE

dframe <- dframe %>% mutate(
  landA = paste0("L", landA),
  landB = paste0("L", landB),
  landB = factor(landB, levels = paste0("L", c(2:6,1))),
  bulletA = paste0("Bullet ", bulletA),
  bulletB = paste0("Bullet ", bulletB)
)

dframe %>% ggplot(aes(x = landA, y = landB, fill = cmps_score)) + 
  geom_tile() + 
  geom_tile(aes(colour="same land"), fill=NA, data = dframe %>% filter(samesource), size=1) + 
  scale_fill_gradient2("CMPS score", low = "gray80", high = "darkorange", midpoint = 6) + 
  scale_colour_manual("Source", values="darkorange") +
  facet_grid(bulletB ~ bulletA) + xlab("Bullet1 Lands") + 
  ylab("Bullet2 Lands") + 
  geom_text(aes(label=cmps_score)) +
  theme_bw() +
  theme(aspect.ratio = 1)
```

## Reference

Chen, Zhe, Wei Chu, Johannes A Soons, Robert M Thompson, John Song, and Xuezeng Zhao. 2019. “Fired Bullet Signature Correlation Using the Congruent Matching Profile Segments (CMPS) Method.” Forensic Science International, December, #109964. doi.org/10.1016/j.forsciint.2019.109964.

Brundage, David J. 1998. “The Identification of Consecutively Rifled Gun Barrels.” AFTE Journal 30 (3): 438–44.

Hamby, James E., David J. Brundage, Nicholas D. K. Petraco, and James W. Thorpe. 2019. “A Worldwide Study of Bullets Fired From 10 Consecutively Rifled 9mm RUGER Pistol Barrels—Analysis of Examiner Error Rate.” Journal of Forensic Sciences 64 (2): 551–57. doi.org/10.1111/1556-4029.13916}.

Hamby, James E., David J. Brundage, and James W. Thorpe. 2009. “The Identification of Bullets Fired from 10 Consecutively Rifled 9mm Ruger Pistol Barrels: A Research Project Involving 507 Participants from 20 Countries.” AFTE Journal 41 (2): 99–110.