LCS

A strand of DNA consists of a string of molecules called bases, where the possible bases are adenine, guanine, cytosine, and thymine.

Representing each of these bases by its initial letter, we can express a strand of DNA as a string over the finite set ${A, C, G, T}$ .

One reason to compare two strands of DNA is to determine how “similar” the two strands are, as some measure of how closely related the two organisms are.

String Similarity

We can, and do, define similarity in many ways. For example, we can say that two DNA strands are similar:

If one is a substring of the other.
Alternatively, we could say that two strands are similar if the number of changes needed to turn one into the other is small.
Yet another way to measure the similarity of strands $S_{1}$ and $S_{2}$ is by finding a third strand $S_{3}$ in which the bases in $S_{3}$ appear in each of $S_{1}$ and $S_{2}$ .
1. These bases must appear in the same order, but not necessarily consecutively.
2. The longer the strand $S_{3}$ we can find, the more similar $S_{1}$ and $S_{2}$ are.

Example:

$S_{1} = A C C G G T C G A G T G C G C G G A A G C C G G C C G A A$
$S_{2} = G T C G T T C G G A A T G C C G T T G C T C T G T A A A$
$S_{3} = G T C G T C G G A A G C C G G C C G A A$

Subsequence

We formalize this last notion of similarity as the longest-common-subsequence problem. A subsequence of a given sequence is just the given sequence with zero or more elements left out.

Formally, given a sequence $X = ⟨ x_{1}, x_{2}, \dots, x_{m} ⟩$ , another sequence $Z = ⟨ z_{1}, z_{2}, \dots, z_{k} ⟩$ is a subsequence of $X$ if there exists a strictly increasing sequence $⟨ i_{1}, i_{2}, \dots, i_{k} ⟩$ of indices of X such that for all $j = 1, 2, \dots, k$ , we have $x_{j} = z_{j}$

Given X = ⟨ x_{1}, x_{2}, \dots, x_{m} ⟩, We have x_{i_{1}}, \dots, x_{i_{k}} such that

i_{1}, \dots, i_{k} \in {1, \dots, m} \land i_{1} < i_{2} < \dots < i_{k}

Example: $Z = ⟨ B, C, D, B ⟩$ is a subsequence of $X = ⟨ A, B, C, B, D, A, B ⟩$ with corresponding index sequence $⟨ 2, 3, 5, 7 ⟩$

Common Subsequence

Given two sequences $X$ and $Y$ , we say that a sequence $Z$ is a common subsequence of $X$ and $Y$ if $Z$ is a subsequence of both $X$ and $Y$

Example: $X = ⟨ A, B, C, B, D, A, B ⟩$ is a subsequence of $Y = ⟨ B, D, C, A, B, A ⟩$ , the sequence $⟨ B, C, A ⟩$ is a common subsequence of both $X$ and $Y$ .

The sequence $⟨ B, C, A ⟩$ is not a longest common subsequence (LCS) of $X$ and $Y$ , however, since it has length $3$ and the sequence $⟨ B, C, B, A ⟩$ , which is also common to both $X$ and $Y$ , has length $4$ . The sequence $⟨ B, C, B, A ⟩$ is an LCS of $X$ and $Y$ , as is the sequence $⟨ B, D, A, B ⟩$ since $X$ and $Y$ have no common subsequence of length $5$ or greater.

Problem Statement - Longest Common Subsequence

In the longest-common-subsequence problem, we are given two sequences (input)

X = ⟨ x_{1}, x_{2}, \dots, x_{m} ⟩ \land Y = ⟨ y_{1}, y_{2}, \dots, y_{n} ⟩

and wish to find a maximum length common subsequence of $X$ and $Y$ , $W$ (output).

Observation 1: Multiple possible LCS

If we define a set $L C S (X, Y)$ which represents the set containing all the longest common subsequences given two sequences $X$ and $Y$ , there might be multiple distinct subsequences of the same length, even though they are the longest.

Observation 2: Brute force is not an option $T (n) = Θ (2^{n})$

In a brute-force approach to solve the LCS problem:

We would enumerate all subsequences of $X$
Check each subsequence to see whether it is also a subsequence of $Y$
Keep track of the longest subsequence we find.

By generating all the possible sequences of $X$ , verifying the property of common subsequence of $Y$ and storing the longest one, would require an exponential time. This is because at most we have $2^{n}$ subsequences as each character $x_{i}$ could either appear or not.

Step 1 - Characterizing the longest common subsequence

Remember, we can apply dynamic programming here if we can express the solution in a polynomial count of sub-problems.

The LCS problem has an optimal-substructure property, however, as the following theorem shows. As we shall see, the natural classes of subproblems correspond to pairs of “prefixes” of the two input sequences.

Prefix

Given a sequence $X = ⟨ x_{1}, x_{2}, \dots, x_{m} ⟩$ , we define the $k$ -th prefix of $X$ , for $k \leq m$ , as the prefix of $X$ of length $k$ , $X^{k} = ⟨ x_{1}, x_{2}, \dots, x_{k} ⟩$

Example: $X = ⟨ A, B, C, B, D, A, B ⟩$ , then

$X^{4} = ⟨ A, B, C B ⟩$
And $X^{0}$ is the empty sequence
Also, when $k = m$ , the prefix corresponds to the whole sequence.

In general $X = ⟨ x_{1}, x_{2}, \dots, x_{m} ⟩$ has $m + 1$ prefixes, and by reducing the $L C S$ problem to the $L C S$ problem on prefixes the complexity goes down to $O (m n)$ sub-problems.

Where $X = ⟨ x_{1}, x_{2}, \dots, x_{m} ⟩$ and $Y = ⟨ y_{1}, y_{2}, \dots, y_{n} ⟩$

Theorem 15.1: Optimal Substructure of an LCS

Theroem 15.1

Let's rewrite it for analysis purposes

Let $X = ⟨ x_{1}, x_{2}, \dots, x_{m} ⟩ \land Y = ⟨ y_{1}, y_{2}, \dots, y_{n} ⟩$ be two sequences and $W = ⟨ w_{1}, w_{2}, \dots, w_{k} ⟩ \in L C S (X, Y)$ be one of the possible $L C S$

We consider $L C S$ as the optimal solution of the problem, if it does not apply to the main problem, it could still be valid for a smaller one. Then:

If $x_{m} = y_{n}$ , the last characters of LCS coincide,
1. The last characters of LCS coincide, $w_{k} = x_{m} = y_{n}$
2. The prefix of this common sequence is the LCS of the prefixes of $X$ and $Y$ , $W^{k - 1} \in L C S (X^{m - 1}, Y^{n - 1})$
If $x_{m} \neq y_{n}$ , Then:
1. If $w_{k} \neq x_{m}$ , then $W^{k} \in L C S (X^{m - 1}, Y)$
2. If $w_{k} \neq y_{n}$ , then $W^{k} \in L C S (X, Y^{n - 1})$

Demonstration

Demonstration Ad-Absurdum:

Part 1.1: $w_{k} = x_{m} = y_{n}$ ,
- Ad-absurdum, if this was not true we could build a sequence by chaining $x_{m}$ to $W$ , resulting in $W_{x_{m}}$ which is still a subsequence of $X$ and $Y$ .
- But, this means there should be a subsequence longer than $W$ , which is absurd because $W \in L C S (X, Y)$ , so it is already an optimal solution.
- Then, $w_{k} = x_{m} = y_{n}$ is true.
Part 1.2: $W^{k - 1} \in L C S (X^{m - 1}, Y^{n - 1})$
- Ad-absurdum, if this was not true we could have some $W^{'} \in L C S (X^{m - 1}, Y^{n - 1})$ and $| W^{'} | > | W^{k - 1} | = k - 1$
- Since we know, $w_{k} = x_{m} = y_{n}$ , If we concat the sequences with $w_{k}$ we would get $| W_{w k}^{k - 1} | < | W_{w k}^{^{'}} |$ (which are both $L C S$ ).
- But, $W_{w k}^{k - 1} = W$ as it is the string concatenated with the last character.
- We reach an absurd as $| W | < | W_{w k}^{^{'}} |$ , and $W$ would not be an optimal solution anymore.
Part 2.1: $x_{m} \neq y_{n} \land w_{k} \neq x_{m} \Rightarrow W^{k} \in L C S (X^{m - 1}, Y)$
- Ad-absurdum, we suppose there exists some $W^{'} \in L C S (X^{m - 1}, Y) \land | W^{'} | > | W |$ , it is not possible as $W$ is by definition an optimal solution of maximum length
Part 2.2: $x_{m} \neq y_{n} \land w_{k} \neq y_{n} \Rightarrow W^{k} \in L C S (X, Y^{n - 1})$
- Ad-absurdum, we suppose there exists some $W^{'} \in L C S (X, Y^{n - 1}) \land | W^{'} | > | W |$ , it is not possible as $W$ is by definition an optimal solution of maximum length

Conclusion

To sum up:

Thanks to this we managed to express the $L C S (X, Y)$ in terms of sub-problems, now we have a polynomial way to construct our solution.
The way that Theorem 15.1 characterizes the longest common subsequences tells us that an LCS of two sequences contains within it an LCS of prefixes of the two sequences.
This means we can start from prefixes of length $i = 1$ , and then proceed towards $i = m$ ( $m$ as the last index)
Thus, the LCS problem has an optimal-substructure property. A recursive solution also has the overlapping sub-problems property, as we shall see in a moment.

Step 2 - Recursive Solution

A recursive solution

If $x_{m} = y_{n}$
1. Find an $L C S (X^{m - 1}, Y^{n - 1})$
2. Appending $x_{m} = y_{n}$ , to this LCS yields an $L C S (X, Y)$
Else, we must solve two subproblems
1. Finding an $L C S (X^{m - 1}, Y)$
2. Finding an $L C S (X, Y^{n - 1})$
3. The longest is an $L C S (X, Y)$ and this exhaust all possibilities recursively.

Our recursive solution to the LCS problem involves establishing a recurrence for the value of an optimal solution. Let us define $c [i, j]$ to be the length of an $L C S (X^{i}, Y^{j})$

c [i, j] = {\begin{matrix} 0 & i = 0 \lor j = 0 \\ c [i - 1, j - 1] + 1 & i, j > 0 \land x_{i} = y_{j} \\ m a x {c [i - 1, j], c [i, j - 1]} & i, j > 0 \land x_{i} \neq y_{j} \end{matrix}

We ruled out some sub-problems due to how we defined the problem and the possible solutions. We have now, $n \cdot m$ distinct subproblems

Step 3 & 4 - Bottom Up

Computing the length of an LCS to be the length of an $L C S (X^{i}, Y^{j})$

Operation	`BU_LCS(X, Y) -> Pair(b,c)`
Input	Sequences $X, Y$
Output	Tables $b, c$ with $c [m, n]$ containing the length of an $L C S (X, Y)$
$c [0 \dots m, 0 \dots n]$	is a 2D vector that saves the lengths of $L C S (X^{i}, Y^{j})$
$b [i \dots m, j \dots n]$	is a 2D vector that helps us construct an optimal solution
$b [i, j]$	Points to the table entry corresponding to the optimal sub-problem solution chosen when computing $c [i, j]$

movements on the board = {\begin{matrix} ↖ & x_{i} = y_{j} & L C S (X^{i}, Y^{j}) ⇝ L C S (X^{i - 1}, Y^{j - 1}) \\ ↑ & x_{i} \neq y_{j} & L C S (X^{i}, Y^{j}) ⇝ L C S (X^{i - 1}, Y^{j}) \\ \leftarrow & x_{i} \neq y_{j} & L C S (X^{i}, Y^{j}) ⇝ L C S (X^{i}, Y^{j - 1}) \end{matrix}

LCS movement on matrix B

python

BU_LCS(x,y)
    c[0...m+1,0...n+1]
    b[1...m,1...n]
    m = x.length;
    n = y.length;
    for (i = 0 to m): # When i = 0
        c[i,0] = 0;
    for (j = 1 to n): # When j = 0
        c[0,j] = 0;
    for (i = 1 to m): 
        for (j = 1 to n):
            if(x[i] == y[j]): # CASE 1 
                c[i,j] = c[i-1,j-1] + 1;
                b[i,j] = ↖;
            else if (c[i-1, j] >= c[i,j-1]): # CASE 2
                c[i,j] = c[i-1, j];
                b[i,j] = ↑;
            else: # CASE 3
                c[i,j] = c[i, j - 1];
                b[i,j] = ←;
    return b,c;
$$
**Final Time Complexity** $T(n)= \Theta(m) + \Theta(n) + \Theta(n \cdot m) = \Theta(n \cdot m)$
* Polynomial

![examplebulcs](https://github.com/PayThePizzo/DataStrutucures-Algorithms/blob/main/Resources/examplebulcs.png?raw=TRUE)

Now that we have found the count of an LCS, we want to display which one it could be!

### Printing
Constructing an LCS

Let's print an LCS!
* We start from the $i,j$ position and decrease eithe $i$ or $j$
* We only print if there's an oblique arrow.
* Since the recursive call happens before the print, we get to the top from the bottom
and only print at the very end.

```python
printLCSAux(X, b, i, j)
    if(i > 0 && j > 0): #if not an empty string
        if(b[i,j] == ↖): #if we have a common char
            printLCSAux(X, b, i - 1, j - 1); #first we deal with the subproblem
            print(X[i]); 
        else if(b[i,j] == ↑): #if we have NOT a common char
            printLCSAux(X, b, i - 1, j);
        else:
            printLCSAux(X, b, i, j - 1);
$$
**Final Time Complexity** $T(n)= \mathcal{O}(i+j)$
* At every function call, we decrease either one of the two parameters.

```python
printLCS(X,Y)
    b,c = BU_LCS(X,Y);
    printLCSaux(X,b, X.length, Y.length);
$$
**Final Time Complexity** $T(n)= \Theta(n \cdot m) + \Theta(n + m)= \Theta(n \cdot m)$
* We need to go through LCS


### Improve memory
We can reduce the memory usage through two different optimizations

#### First Method

In the LCS algorithm, for example, we can eliminate the $b$ table altogether.

Given the value of $c[i,j]$, we can determine in $O(1)$ time which of
these three values was used to compute $c[i,j]$ without inspecting table $b$.
Each $c[i,j]$  entry **depends on only three other $c$ table entries**:
1. $c[i-1,j-1]$
2. $c[i-1,j]$
3. $c[i,j-1]$

Thus, we can reconstruct an $LCS$ in $\mathcal{O}(m+n)$ time using a procedure similar to printLCS.
The order here matters a lot!

```python
printLCSAux(X, c, i, j)
    if(i > 0 && j > 0):
        if(c[i,j] == c[i - 1,j]):
            printLCSAux(X, c, i - 1, j);
        else if(c[i,j] == c[i,j - 1]):
            printLCSAux(X, c, i, j - 1);
        else:
            printLCSAux(X, c, i - 1, j - 1);
            print(X[i]);
$$
Although we save $\Theta(n*m)$ space by this method, the auxiliary 
space requirement for computing an LCS does not asymptotically decrease, since 
we need $\Theta(n*m)$ space for the $c$ table anyway.


#### Second Method

We can, however, reduce the asymptotic space requirements for `LCS_Length`,
since it needs only two rows of table $c$ at a time:
* The row being computed, 
* and the previous row

This improvement works if **we need only the length of an LCS**; if we need to reconstruct
the elements of an LCS, the smaller table does not keep enough information to
retrace our steps in $\mathcal{O}(m+n)$ time and using only $\mathcal{O}(n)$.


## Step 3 & 4 - Top Down

```python
TD_LCSAux(x, y, c, i, j)
    if(c[i,j] == -1): # Problem not solved
        if(i == 0 || j == 0): 
            c[i,j] = 0;
        else if(x[i] == y[j]):
            c[i,j] = TD_LCSAux(x, y, i - 1, j - 1) + 1;
        else:
            c[i,j] = max(TD_LCSAux(x, y, i - 1, j),
                         TD_LCSAux(x, y, i, j - 1));
    return c[i,j];
$$
**Final Time Complexity** $T(n)= \mathcal{O}(n \cdot m)$
* This is directly proportional to the possible sub-problems 

```python
TD_LCS(X, Y)
    m = X.length
    n = Y.length
    c[0..m,0..n] = -1 #initialized with all elements equals to -1
    return TD_LCSAux(X, Y, c, m, n)
$$
**Final Time Complexity** $T(n)= \Theta(n \cdot m)$
* If the strings are equivalent, we are in $\mathcal{O}(n)$ rather than $\mathcal{O}(n^{2})$

![example lcs td](https://github.com/PayThePizzo/DataStrutucures-Algorithms/blob/main/Resources/exlcstd.png?raw=TRUE)

BU_LCS(x,y)
    c[0...m+1,0...n+1]
    b[1...m,1...n]
    m = x.length;
    n = y.length;
    for (i = 0 to m): # When i = 0
        c[i,0] = 0;
    for (j = 1 to n): # When j = 0
        c[0,j] = 0;
    for (i = 1 to m): 
        for (j = 1 to n):
            if(x[i] == y[j]): # CASE 1 
                c[i,j] = c[i-1,j-1] + 1;
                b[i,j] = ↖;
            else if (c[i-1, j] >= c[i,j-1]): # CASE 2
                c[i,j] = c[i-1, j];
                b[i,j] = ↑;
            else: # CASE 3
                c[i,j] = c[i, j - 1];
                b[i,j] = ←;
    return b,c;
$$
**Final Time Complexity** $T(n)= \Theta(m) + \Theta(n) + \Theta(n \cdot m) = \Theta(n \cdot m)$
* Polynomial

![examplebulcs](https://github.com/PayThePizzo/DataStrutucures-Algorithms/blob/main/Resources/examplebulcs.png?raw=TRUE)

Now that we have found the count of an LCS, we want to display which one it could be!

### Printing
Constructing an LCS

Let's print an LCS!
* We start from the $i,j$ position and decrease eithe $i$ or $j$
* We only print if there's an oblique arrow.
* Since the recursive call happens before the print, we get to the top from the bottom
and only print at the very end.

```python
printLCSAux(X, b, i, j)
    if(i > 0 && j > 0): #if not an empty string
        if(b[i,j] == ↖): #if we have a common char
            printLCSAux(X, b, i - 1, j - 1); #first we deal with the subproblem
            print(X[i]); 
        else if(b[i,j] == ↑): #if we have NOT a common char
            printLCSAux(X, b, i - 1, j);
        else:
            printLCSAux(X, b, i, j - 1);
$$
**Final Time Complexity** $T(n)= \mathcal{O}(i+j)$
* At every function call, we decrease either one of the two parameters.

```python
printLCS(X,Y)
    b,c = BU_LCS(X,Y);
    printLCSaux(X,b, X.length, Y.length);
$$
**Final Time Complexity** $T(n)= \Theta(n \cdot m) + \Theta(n + m)= \Theta(n \cdot m)$
* We need to go through LCS


### Improve memory
We can reduce the memory usage through two different optimizations

#### First Method

In the LCS algorithm, for example, we can eliminate the $b$ table altogether.

Given the value of $c[i,j]$, we can determine in $O(1)$ time which of
these three values was used to compute $c[i,j]$ without inspecting table $b$.
Each $c[i,j]$  entry **depends on only three other $c$ table entries**:
1. $c[i-1,j-1]$
2. $c[i-1,j]$
3. $c[i,j-1]$

Thus, we can reconstruct an $LCS$ in $\mathcal{O}(m+n)$ time using a procedure similar to printLCS.
The order here matters a lot!

```python
printLCSAux(X, c, i, j)
    if(i > 0 && j > 0):
        if(c[i,j] == c[i - 1,j]):
            printLCSAux(X, c, i - 1, j);
        else if(c[i,j] == c[i,j - 1]):
            printLCSAux(X, c, i, j - 1);
        else:
            printLCSAux(X, c, i - 1, j - 1);
            print(X[i]);
$$
Although we save $\Theta(n*m)$ space by this method, the auxiliary 
space requirement for computing an LCS does not asymptotically decrease, since 
we need $\Theta(n*m)$ space for the $c$ table anyway.


#### Second Method

We can, however, reduce the asymptotic space requirements for `LCS_Length`,
since it needs only two rows of table $c$ at a time:
* The row being computed, 
* and the previous row

This improvement works if **we need only the length of an LCS**; if we need to reconstruct
the elements of an LCS, the smaller table does not keep enough information to
retrace our steps in $\mathcal{O}(m+n)$ time and using only $\mathcal{O}(n)$.


## Step 3 & 4 - Top Down

```python
TD_LCSAux(x, y, c, i, j)
    if(c[i,j] == -1): # Problem not solved
        if(i == 0 || j == 0): 
            c[i,j] = 0;
        else if(x[i] == y[j]):
            c[i,j] = TD_LCSAux(x, y, i - 1, j - 1) + 1;
        else:
            c[i,j] = max(TD_LCSAux(x, y, i - 1, j),
                         TD_LCSAux(x, y, i, j - 1));
    return c[i,j];
$$
**Final Time Complexity** $T(n)= \mathcal{O}(n \cdot m)$
* This is directly proportional to the possible sub-problems 

```python
TD_LCS(X, Y)
    m = X.length
    n = Y.length
    c[0..m,0..n] = -1 #initialized with all elements equals to -1
    return TD_LCSAux(X, Y, c, m, n)
$$
**Final Time Complexity** $T(n)= \Theta(n \cdot m)$
* If the strings are equivalent, we are in $\mathcal{O}(n)$ rather than $\mathcal{O}(n^{2})$

![example lcs td](https://github.com/PayThePizzo/DataStrutucures-Algorithms/blob/main/Resources/exlcstd.png?raw=TRUE)

LCS ​

String Similarity ​

Subsequence ​

Common Subsequence ​

Problem Statement - Longest Common Subsequence ​

Observation 1: Multiple possible LCS ​

Observation 2: Brute force is not an option T(n)=Θ(2n) ​

Step 1 - Characterizing the longest common subsequence ​

Prefix ​

Theorem 15.1: Optimal Substructure of an LCS ​

Demonstration ​

Conclusion ​

Step 2 - Recursive Solution ​

Step 3 & 4 - Bottom Up ​

LCS