CS494 Lecture Notes - The Floyd-Warshall Algorithm
- James S. Plank
- Directory: /home/plank/cs494/Notes/Floyd
- Original notes: March, 2016
- Most recent revision:Wed Nov 3 14:34:41 EDT 2021
What these lecture notes cover
- Reference material.
- The algorithm summarized.
- Working through a detailed example.
- Exploration #1: The-Tips problem from Topcoder, Floyd-Warshall vs DFS.
- Exploration #2: All-pairs flow, Floyd-Warshall vs Dijkstra.
Reference Material
As always, the writeup in Wikipedia is simultaneously too much and too little. These notes are meant to augment theWikipedia writeup, not replace it.The Algorithm Summarized
You are going to manage a matrix of shortest paths. Call it SP, and SP[i][j], at the end of the algorithm, will contain the shortest path from node ito node j. You initialize SP to be the adjacency matrix for a graph. If there isno edge from i to j, then initialize SP[i][j] to be infinity or an appropriately high sentinel value. You should also initialize SP[i][i] to be zero.I'll now summarize the algorithm.
- For each node i in the graph:
- For each pair of nodes x, y in the graph:
- Check to see if SP[x][i] + SP[i][y] is less thanSP[x][y].
- If so, what we've discovered is a shorter path from x to y than theone we currently know, and that path goes through node i.
- In that case, update SP[x][y] to equal SP[x][i] + SP[i][y].
- For each pair of nodes x, y in the graph:
I'm not proving here that the Floyd-Warshall algorithm works, by the way. I'm just telling you that it works, and how to do it.
An Example
This is the same example as in the Wikipedia page (at least as of March, 2016. If Wikipediachanges, go ahead and use the Wayback Machine to make it match up). Here's the graphYou'll note first that it has negative edges. That's ok, as long as there are no negativecycles in the graph (which there aren't). Now, we're going to work through the algorithm,and what I'll do is at each step, show you SP both in graphical form and as a matrix.I'm going to sentinelize the lack of an edge with ∞.
1 2 3 4 ---------------------- 1 | 0 ∞ -2 ∞ 2 | 4 0 3 ∞ 3 | ∞ ∞ 0 2 4 | ∞ -1 ∞ 0 |
Step 1: i = 1. We start with i = 1. We next have to iteratethrough every pair of nodes (x,y), and test to see if the path from x toy through node i is shorter than SP[x][y]. Obviously, we can ignore the cases when x=i, y=i or x=y.Here are the values that we test: I've colored the "winners" red in each case:
x | y | SP[x][1] | SP[1][y] | SP[x][1]+SP[1][y] | Old SP[x][y] | New SP[x][y] |
223344 | 342423 | 44∞∞∞∞ | -2∞∞∞∞-2 | 2∞∞∞∞∞ | 342423 | 242423 |
As you can see, there is one place where SP is improved.I'll update the drawing and the matrix below. I've colored thechanged edges/values in red, and I colored node 1 in green to showthat it was the intermediate node in these paths:
1 2 3 4 ---------------------- 1 | 0 ∞ -2 ∞ 2 | 4 0 2 ∞ 3 | ∞ ∞ 0 2 4 | ∞ -1 ∞ 0 |
Step 2: i = 2. Let's make the same table as before, only now with i = 2:
x | y | SP[x][2] | SP[2][y] | SP[x][2]+SP[2][y] | Old SP[x][y] | New SP[x][y] |
113344 | 341413 | ∞∞∞∞-1-1 | 2∞4∞42 | ∞∞∞∞31 | -2∞∞2∞∞ | -2∞∞231 |
There are two places where SP is improved.I'll update the drawing and the matrix below:
1 2 3 4 ---------------------- 1 | 0 ∞ -2 ∞ 2 | 4 0 2 ∞ 3 | ∞ ∞ 0 2 4 | 3 -1 1 0 |
Step 3: i = 3. Here's our table:
x | y | SP[x][3] | SP[3][y] | SP[x][3]+SP[3][y] | Old SP[x][y] | New SP[x][y] |
112244 | 241412 | -2-22211 | ∞2∞2∞∞ | ∞0∞4∞∞ | ∞∞4∞3-1 | -20443-1 |
Once again there are two places where SP is improved.I'll update the drawing and the matrix below:
1 2 3 4 ---------------------- 1 | 0 ∞ -2 0 2 | 4 0 2 4 3 | ∞ ∞ 0 2 4 | 3 -1 1 0 |
The Last Step: i = 4. Here's our table:
x | y | SP[x][4] | SP[4][y] | SP[x][4]+SP[4][y] | Old SP[x][y] | New SP[x][y] |
112233 | 231312 | 004422 | -11313-1 | -117551 | ∞-242∞∞ | -1-24251 |
We're done -- the final drawing and matrix are below. As you can see, threevalues were changed, and there are no more big values on the graph at all.
1 2 3 4 ---------------------- 1 | 0 -1 -2 0 2 | 4 0 2 4 3 | 5 1 0 2 4 | 3 -1 1 0 |
The matrix now has your all-pairs shortest paths. If any of the diagonal entries are negative,then your graph has negative cycles, and the matrix is invalid.
Other problems solved by Floyd-Warshall
As noted in the Wikipedia page, you can solve other problems with this technique, includingthe following two, which we'll explore below:- Calculate the transitive closure of a binary relation. A binary relationR over a set of numbers is a definition such that for each i and jin the set, R(i,j) equals 0 or 1. The transitive closure of R is a newrelation R' such that if R(i,j) = 1, then R'(i,j) also equals 1.Moreover, R'(i,j) is minimally transitive.The transitive property means that if R'(i,k)=1 andR'(k,j)=1, then R'(i,j)=1. The "minimally transitive" part means that of allpossible definitions of R', we choose the one that has the maximum number of i,jpairs where R'(i,j)=0.
I know that was mathy, but we'll give it a practical instance below.
- Calculate the maximum flow paths between every pair of nodes in a directed, weighted graph. You can see where this problem could have practicality in traffic analysis.
Exploration #1: The-Tips problem from Topcoder, Floyd-Warshall vs DFS.
Statement
This is the 500-point problem from Qualification Round 1B of the Topcoder Open competitionfrom 2015. Here is the problem statement. Here's a summary, in case Topcoder's servers are down.- Fred's parents have hidden N easter eggs in their house. The eggs arenumbered 0 to N-1, because Fred's parents understand that the world is zero-indexed.
- You are given a vector of integers called probability. Its size is N.
- Without any clues, Fred's probability of finding egg i is probability[i] percent.
- Inside of some eggs, Fred's parents have written on paper the location of some other eggs.
- The information about these eggs is stored in a vector of strings called clues.
- For each i, j, if clues[i][j] is 'Y', then egg i contains the location of egg j. If clues[i][j] is 'N', then egg i does not contain the location of egg j.
- Your job is to return the expected number of eggs that you will find.
- N is a number between 1 and 50.
Examples
# | clues | probability | Answer | Comments |
0 | { "Y" } | { 50 } | 0.5 | Pretty straightforward. One egg. 50% chance of finding it. |
1 | { "YN", "NY" } | { 100, 50 } | 1.5 | The clues are worthless. 100% chance of egg 0, and 50% chance of egg 1. |
2 | { "YYY", "NYY", "NNY" } | { 100, 0, 0 } | 3.0 | Fred is finding egg 0, and with the clues, will also find eggs 1 and 2 |
3 | { "NNN", "NNN", "NNN" } | { 0, 0, 0 } | 0.0 | Fred's parents are mean. |
4 | { "NYYNYYNNNN", "NNNNYNNNYN", "YNNYYYYYNN", "YYNYNNNNYN", "NYNNNNNNNY", "YNYYNNYNNY", "NYNNYYYYYY", "NYYYYNNNNN", "YYNYNNYYYN", "NNYYNYNYYY"} | { 11, 66, 99, 37, 64, 45, 21, 67, 71, 62 } | 9.999891558057332 | I'm not working through this one by hand. |
5 | { "NNY", "NNY", "NNN" } | { 50, 50, 1 } | 1.7525 | I'll go over this one below. |
6 | { "NY", "NN" } | { 50, 50 } | 1.25 | If you find egg 0 (50%), you'll find egg 1. If you don't find egg 0 (50%), then you have a 50% chance of finding egg 1. The eggspected value is .5*2 + .25*1 = 1.25. |
Solution
This problem has a solution based on connectivity, that you can hack up with DFS or with Floyd-Warshall. It's not the easiest solution to see, so let me try to walk you through it.We are going to define p_{i} to be the probability that you find egg i, either independently or as the result of finding another egg and using clues. Let's use example 6 above to give real numbers to the problem:
- p_{0} equals 0.5. You either find it or you don't.
- p_{1} equals 0.75. You either find it on your own (50%), or you don't find it on your own (50%) but you do find egg 0 and can use the clue (50)%. 0.5 + 0.25 = 0.75.
First calculate a connectivity matrixnamed C of zero's and ones: if C[j][i] is equal to one, then egg i is reachablefrom egg j using clues. Now, you go through the following algorithm to find p_{i}:
- Set p_{i} to 0:
- For each egg j, if egg i is reachable from egg j, then increment p_{i} by (1 - p_{i}) * probability[j]. Read this as follows:
- My probability of finding egg i because I found eggs 0 through j-1 is p_{i}.
- Therefore, the probability of not finding egg i because of eggs 0 through j-1 is (1 - p_{i}). This probability encompasses either not finding eggs with indices < j or finding such eggs, but there are not clues to get to egg i.
- My probability of finding egg j without any clues is probability[j].
- So, if egg i is reachable from egg j, then my probability of finding egg i because I found eggs 0 through j is p_{i} + (1 - p_{i})*probability[j].
- If egg i is not reachable from egg j, then my probability of finding egg i because I found eggs 0 through j remains p_{i}.
Let's now walk through example 5, which is the following graph:
Our connectivity matrix is:
1 0 10 1 10 0 1We're going to calculate all of the p_{i}'s with one pass through the eggs:
- We start with p = { 0, 0, 0 }.
- We note from the matrix, that nodes 0 and 2 are reachable from node 0, whose probability is 0.50. This makes p become { 0.5, 0, 0.5 }.
- Nodes 1 and 2 are reachable from node 1, whose probability is also 0.50. This makes p become { 0.5, 0.5, 0.75 }. The 0.75 is calculated as .5 + (1 - .5)*.5.
- Finally, node 2 is reachable from node 2, whose probability is 0.01. That means that p_{2} is set to 0.75 + (1-0.75)*0.01 = 0.7525, and the final valueof p is { 0.5, 0.5, 0.7525 }.
- So the final expected value is the sum of the values in p, which is 1.7525.
Now, where does Floyd-Warshall come in? You use it to create the connectivity matrix.The clues are a binary relationship, and the connectivity matrix is the transitive closure.So, you start with the adjacency matrix (clues), and then your Floyd-Warshall calculationis the following, which feels too much like Python for my tastes (I'm using variables tomatch the code below):
For all nodes v: For all pairs of nodes i,j: If you can't get from node i to node j (i.e. C[i][j] is zero): But you can get from i to v and v to j: Then set C[i][j] to one, saying you can get from node i to j after all.
I programmed a Floyd-Warshall solution for this inThe-Tips-Floyd-If.cpp.Here's the Floyd-Warshall code to calculate C, then the calculation of p, and the calculation of the final return value:
/* The Floyd-Warshall Calculation -- sorry that I've changed variable names from above: */ for (v = 0; v < C.size(); v++) { for (i = 0; i < C.size(); i++) { for (j = 0; j < C.size(); j++) { if (!C[i][j] && C[i][v] && C[v][j]) C[i][j] = 1; } } } /* Calculate the values of p from the probabilities and reachability: */ p.resize(C.size(), 0); for (i = 0; i < C.size(); i++) { x = probability[i]; x /= 100.0; for (j = 0; j < C.size(); j++) { if (C[i][j]) p[j] += ((1 - p[j]) * x); } } /* Calculate the final return value */ x = 0; for (i = 0; i < C.size(); i++) x += p[i]; return x;} |
I have also programmed up a solution that uses bit arithmetic instead of the ifstatement. This solution doesn't do anything fancy with the bits; it simply replacesthe if statement with bit arithmetic. This is inThe-Tips-Floyd-Bits.cpp. Here's the Floyd-Warshall part:
for (v = 0; v < C.size(); v++) { for (i = 0; i < C.size(); i++) { for (j = 0; j < C.size(); j++) { C[i][j] |= (C[i][v] & C[v][j]); } } } |
Think about the tradeoffs with the two pieces of code. With the if statement,you don't evaluate C[i][v] or C[j][v] when C[i][j] is one. That saves some work. On the flip side, if statements involve comparisons andjumps, which require more instructions than the simple bit arithmetic.
Here's the relevant assembly code for each (Intel core i5, compiled with -O2). I find it hard to read, but it does back up the comments above.
if (!C[i][j] && C[i][v] && C[v][j]) C[i][j] = 1;LFB2189: pushq %rbpLCFI6: movq %rsp, %rbpLCFI7: movl %ecx, %eax movq (%rdi), %rcx movslq %esi,%rsi leaq (%rsi,%rsi,2), %rsi movq (%rcx,%rsi,8), %rsi movslq %edx,%rdx leaq (%rsi,%rdx), %r8 cmpb $0, (%r8) jne L17 cltq cmpb $0, (%rsi,%rax) je L17 leaq (%rax,%rax,2), %rax movq (%rcx,%rax,8), %rax cmpb $0, (%rdx,%rax) je L17 movb $1, (%r8) .align 4,0x90L17: leave ret | C[i][j] |= (C[i][v] & C[v][j]);LFB2189:pushq%rbpLCFI6:movq%rsp, %rbpLCFI7:movq(%rdi), %rdimovslq%esi,%rsileaq(%rsi,%rsi,2), %rsimovq(%rdi,%rsi,8), %rsimovslq%edx,%rdxmovslq%ecx,%rcxleaq(%rcx,%rcx,2), %raxmovq(%rdi,%rax,8), %raxmovzbl(%rdx,%rax), %eaxandb(%rsi,%rcx), %alorb%al, (%rsi,%rdx)leaveret |
One last tweak -- suppose I want to move the c[i][v] lookup out of the inner loop.Why? Because if it's false, I won't have to do the inner loop at all. This isin The-Tips-Floyd-Bits-2.cpp:
for (v = 0; v < C.size(); v++) { for (i = 0; i < C.size(); i++) { if (C[i][v]) { for (j = 0; j < C.size(); j++) { C[i][j] |= C[v][j]; } } } } |
Remember this -- it will come in handy for your lab. All of these work on the topcoderexamples:
UNIX> for i in 0 1 2 3 4 5 6 ; do # I'm using bash here> tip-fw-if $i | tail -n 1> tip-fw-bits $i | tail -n 1> tip-fw-bits-2 $i | tail -n 1> done0.50000000000.50000000000.50000000001.50000000001.50000000001.50000000003.00000000003.00000000003.00000000000.00000000000.00000000000.00000000009.99989155819.99989155819.99989155811.75250000001.75250000001.75250000001.25000000001.25000000001.2500000000UNIX>I've timed these on my MacBook (2.4 Ghz Core i5), and the results make sense. Because these matrices are dense, C[i][j] is true a lot of the time, and that makes the if statement exit early a lot. On the flip side,C[i][v] is also true a lot, so moving it out of the inner loop doesn'teliminate the inner loop very much.
Using DFS Instead of Floyd-Warshall
Instead of Floyd-Warshall, you can simply do a DFS for every starting node. To do that, I have programmed four variants. They all make an adjacency list representation of thegraph in Adj, and then do a DFS from each node v to calculate which nodes are reachable from v. Here are the variants:- The-Tips-DFS-R.cpp: This is a standard recursiveDFS, which you call on every starting node (clearing out the visited vector each time):
void DFS::DoDFS(int n, int from){ int j; if (visited[n]) return; p[n] += ((1 - p[n]) * pr[from]); visited[n] = 1; for (j = 0; j < Adj[n].size(); j++) { DoDFS(Adj[n][j], from); }}
- The-Tips-DFS-R2.cpp: This does an extra level of pruning. It checks visited *before* you make the DFS call:
for (j = 0; j < Adj[n].size(); j++) { if (!visited[Adj[n][j]]) DoDFS(Adj[n][j], from); }
- The-Tips-DFS-Stack-1.cpp: This one manages a stack explicitly instead of doing recursion. It calls push_back() to push a valueonto the stack and pop_back() to pop it.
- The-Tips-DFS-Stack-2.cpp: This one also manages a stack, but it keeps the current index in sp, rather than calling push_back() and pop_back().
The timing results may surprise you:
They surprise me, as I thought that the stack version with the index would be the best, butinstead, it's the pruned recursion. I don't have time to explore it.
Packing Bits Together for Instruction-Level Parallelism
Now, recall this version of the Floyd-Warshall algorithm, which used bit arithmetic,and moved the checking of C[i][v] out of the inner loop (in The-Tips-Floyd-Bits-2.cpp):for (v = 0; v < C.size(); v++) { for (i = 0; i < C.size(); i++) { if (C[i][v]) { for (j = 0; j < C.size(); j++) { C[i][j] |= C[v][j]; } } } } |
Instead of storing one bit per char, what if we packed each char with 8 bits? In other words, we can store an entire row C with ceil(C.size()/8) char's.Let's use an example:
01234567 ---------0 | 100010011 | 011000002 | 101010003 | 000100004 | 100010005 | 001001006 | 000000107 | 00100001In our previous Floyd-Warshall implementations, each entry of the adjacency matrix consumeda char. What if instead we had each row of the adjacency matrix be a single byte,and each entry consumes a bit? Look and see what that lets us do in that inner loop.Suppose v equals 0 and i equals 2. You'll note that C[i][v] equals one.That means that for each value of j, we will OR C[i][j] with C[v][j] and set the result in C[i][j]. Here's row i (2) of C:
2 | 10101000And here's row v (0):
0 | 10001001Since both rows are bytes, we can do the OR in one instruction, which will set row 2to be:
2 | 10101001Do you see how that can improve performance? Instead of doing 8 OR operations in the innerloop, you only have to do one!
The graph below shows how we use this trick for 1, 2, 4, 8 and 16-byte data types. In that lattercase, I'm using the Microsoft Intrinsic instruction _mm_or_si128(), which compiles to code that uses the 128-bit OR instruction from the Intel SSE2 instruction set. Make sure you pay attention to the units on theY axis -- the savings are HUGE.
This last graph shows the original Floyd-Warshall implementation, the fastest DFS implementation,and the 128-bit SIMD implementation -- that's an amazing difference, is it not?
An aside for 2017 -- a student, Gregory Rouleau, got a bit enthused pursuing the understandingof performance in The Tips. I'm including his writeups. I love it when students lock intoa research question!
- Writeup #1
- Writeup #2
- His code "TheTips-DFS-Stack-2-V3"
All Pairs Max Flow Paths
This is too much fun -- read on.The problem that we're going to solve is, given an adjacency matrix of a weighted, directed graph, compute the flow of themaximum flow path between every pair of nodes. We're going to constrain flow values so that theyare between 0 and 255. I've drawn an example graph on the left, and its all-pairs max-flow path graph on the right. I've drawn the edges that have been updated with better flow values in red:
As you can imagine, we'll solve the problem with the Floyd-Warshall algorithm. To do so, I'vewritten a driver program in AP-Flow-Main.cpp. I have three programs that solve theproblem:
- ap-flow-fw, implemented in AP-Flow-FW.cpp, solves itwith the Floyd-Warshall algorithm.
- ap-flow-d, implemented in AP-Flow-Dijkstra.cpp, solves it by applying Dijkstra's algorithm to every starting node (this is similar to my Network Flowlecture notes in CS302, if you remember).
- ap-flow-simd, solves it with Floyd-Warshall, but it uses SIMD instructions to speed up the inner loop. You'll get to write this in your lab.
usage: AP-Flow nodes seed(- to read from stdin) print(Y|N|H) |
If you give it a seed, it will simply create a random graph. Otherwise, it reads the graph from standard input. If print is "Y", then it will print the graph's adjacency matrixbefore the flow calculation, and the all-pairs max-flow path matrix after the flow calculation.For example, the graph above was generated with a seed of 12.
UNIX> ap-flow-fw 3 12 YAdjacency Matrix: 255 37 64 93 255 52 98 62 255Flow Matrix: 255 62 64 93 255 64 98 62 255UNIX>
There is a header file inAP-Flow.h:class APFlow { public: int N; uint8_t *Adj; uint8_t *Flow; void CalcFlow();}; |
N is the number of nodes. Adj is the adjacency matrix, where Adj[i*N+j]stores the weight of the edge from i to j.Flow will be the flow matrix. You will create it by calling CalcFlow(). WhenCalcFlow() finishes then Flow[i*N+j]will contain the flow of the maximum flow path from i to j.AP-Flow-Main.cpp creates an instance of the class, andinitializes N and Adj. It then calls CalcFlow(). If print is 'Y',then it prints out the two matrices. If print is 'H', then it prints out the MD5 hash ofFlow. That allows us to check correctness without having to look at giant matrices:
UNIX> ap-flow-fw 5 1 YAdjacency Matrix: 255 8 41 52 19 44 255 1 11 5 27 44 255 49 60 29 12 108 255 115 53 29 11 29 255Flow Matrix: 255 44 52 52 52 44 255 44 44 44 53 44 255 52 60 53 44 108 255 115 53 44 52 52 255UNIX> ap-flow-fw 5 1 H3B65A77BC185B3A15603EF2268873233UNIX> ap-flow-d 5 1 H3B65A77BC185B3A15603EF2268873233UNIX>BTW, the diagonal elements are always given a weight of 255.
The Floyd-Warshall solution in AP-Flow-FW.cpp is straightforward, and is very much like our other Floyd-Warshall solutions. What we're doing now, is saying thatif the path from i to j through v has a higher flow value thanwhat we currently know (which is in Flow[i*N+j]), then update it:
#include "AP-Flow.h"void APFlow::CalcFlow(){ int i, j, v, f; for (i = 0; i < N*N; i++) Flow[i] = Adj[i]; for (v = 0; v < N; v++) { for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { /* f is the flow from i to j through v */ f = (Flow[i*N+v] < Flow[v*N+j]) ? Flow[i*N+v] : Flow[v*N+j]; if (f > Flow[i*N+j]) Flow[i*N+j] = f; } } }} |
I won't go through the Dijkstra solution, but suffice it to say that it is slower than Floyd-Warshall:
UNIX> shsh-3.2$ for i in 160 320 480 960 ; do csh -c "time ap-flow-fw $i 1 N"; done0.016u 0.000s 0:00.01 100.0%0+0k 0+0io 0pf+0w0.118u 0.000s 0:00.12 91.6%0+0k 0+0io 0pf+0w0.381u 0.001s 0:00.38 100.0%0+0k 0+0io 0pf+0w2.905u 0.004s 0:02.91 99.6%0+0k 0+0io 0pf+0wsh-3.2$ for i in 160 320 480 960 ; do csh -c "time ap-flow-d $i 1 N"; done0.039u 0.000s 0:00.04 75.0%0+0k 0+0io 0pf+0w0.193u 0.000s 0:00.19 100.0%0+0k 0+0io 0pf+0w0.516u 0.001s 0:00.51 100.0%0+0k 0+0io 0pf+0w3.036u 0.004s 0:03.04 99.6%0+0k 0+0io 0pf+0wsh-3.2$ exitUNIX>Now, as with the connectivity problem above, we can use SIMD to make this faster. I'm going to illustrate by annotating the Floyd-Warshall solution:
#include "AP-Flow.h"void APFlow::CalcFlow(){ int i, j, v, f; for (i = 0; i < N*N; i++) Flow[i] = Adj[i]; for (v = 0; v < N; v++) { for (i = 0; i < N; i++) { /* Create a vector alli, which is 16 instances of Flow[i*N+v] */ for (j = 0; j < N; j += 16) { /* Load Flow[v*N+j] through Flow[v*N+j+15] to vector vv */ /* Create fv, which is the flow from i to each of j through j+15 through v. This is simply the min of alli and vv. */ /* Load Flow[i*N+j] through Flow[i*N+j+15] to vector iv */ /* Create rv, which is the max of each value of fv and iv. */ /* Store rv into Flow[i*N+j] through Flow[i*N+j+15] */ } } }} |
Let me illustrate. Suppose that N is 16, and suppose that row i is:
30 95 101 255 104 106 69 106 11 109 73 75 108 7 15 37Let's suppose that v is 2, and let's also suppose that row v is:
119 66 255 62 80 4 47 123 48 99 22 35 100 31 13 99Now, the flow from i to v is 101. The current flow from i to 0 is 30.However, I can get from i to v with a flow of 101, and from v to 0 with a flow of 119. That means that I can get from i to 0 through v with a flow of 101, and I should update entry zero in row i to be 101.
On the flip side, the flow from i to 1 is 95, and the flow from v to 1 is 66, so I can't improve my flow by going through v.
This should give you a flavor of how row i gets updated using the flow from i to v and row v. Now, let's do it with SIMD:
alli 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101 101vv 119 66 255 62 80 4 47 123 48 99 22 35 100 31 13 99fv 101 66 101 62 80 4 47 101 48 99 22 35 100 31 13 99iv 30 95 101 255 104 106 69 106 11 109 73 75 108 7 15 37rv 101 95 101 255 104 106 69 106 48 109 73 75 108 31 15 99You'll store rv as the new value of row i.
That is your job -- to write AP-Flow-SIMD.cpp, which solves the all-pairs max-flow paths problem using Floyd-Warshalland SIMD. You may assume that N is always a multiple of 16. My program simply exitsif it is not:
UNIX> ap-flow-simd 17 1 HFor SIMD, N must be a multiple of 16UNIX>The only SIMD routines that you need to solve this problem are_mm_set1_epi8(),_mm_min_epu8() and_mm_max_epu8().
How much faster is this? Oh my.....