On link-informed graph traversal¶
The all-simple-paths algorithm as implemented in nx.all_simple_paths()
inside the
networkx
(abbreviated below as nx
) package version 2.2 uses a
depth-first traversal scheme to find all possible paths from a start node to one or more
end nodes [1].
For example, let nodes A
-F
represent unitigs in a De Bruijn graph created from sequencing
reads of transcripts:
An all-simple-paths traversal starting at A
will return the paths ABDE
, ABDF
, ACDE
,
and ACDF
. However, what if the sequenced reads that were used to create this graph only
originated from two paths: ABDE
and
ACDF
? Can some of these sequencing reads be used to restrict the paths returned by the
all-simple-paths algorithm?
Mccortex provides a data structure called “links” for annotating De Bruijn graphs in Cortex
format. In the example above, links can be used to store information on a read that covers both the
A
–> B
and D
–> E
junctions. Cortexpy can use these links to performed a
“link-informed” (that is, a restricted) all-simple-paths traversal.
[1] | Cortexpy currently uses a copy of nx.all_simple_paths()
named _all_simple_paths_graph() . |
Link-informed graph traversal in cortexpy
¶
Cortexpy uses networkx
algorithms¶
Cortexpy represents Cortex graphs as nx.DiGraph
objects [2]. This
allows the easy application of networkx
algorithms to Cortex graphs.
cortexpy
achieves link-informed traversal by wrapping a Cortex graph in a
LinkedGraphTraverser
object, which modifies the behavior of the
__getitem__()
method. To understand why this works, let us first take a look at the
all-simple-paths algorithm:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | def _all_simple_paths_graph(G, source, targets, cutoff):
"""From networkx.algorithms.simple_paths"""
visited = collections.OrderedDict.fromkeys([source])
stack = [iter(G[source])]
while stack:
children = stack[-1]
child = next(children, None)
if child is None:
stack.pop()
visited.popitem()
elif len(visited) < cutoff:
if child in visited:
continue
if child in targets:
yield list(visited) + [child]
visited[child] = None
if targets - set(visited.keys()): # expand stack until find all targets
stack.append(iter(G[child]))
else:
visited.popitem() # maybe other ways to child
else: # len(visited) == cutoff:
for target in (targets & (set(children) | {child})) - set(visited.keys()):
yield list(visited) + [target]
stack.pop()
visited.popitem()
|
The key line here is the highlighted line 18. This is the line that appends an iterator of a node’s
successors to the stack of nodes to visit. The algorithm asks the graph object G
for the successor nodes of child
by calling the
__getitem__()
method of G
:
G[child]
This means that we can restrict the paths returned by _all_simple_paths_graph()
by
restricting the successor nodes returned by G
.
[2] | The implementation is not perfect and could use some improvement. |
LinkedGraphTraverser
restricts simple paths using links¶
The __getitem__()
method of
LinkedGraphTraverser
restricts the returned
successors using the following rules:
- If no link information exists for the query node, then return all successors.
- Otherwise, return only successors that are consistent with the links encountered on the path from start to query node.
- If the query node is annotated with links, pick up all links.
- For each successor node, only retain links that are consistent with the path taken from the start to this successor node.
- For each successor node, drop links that are no longer relevant to the successor node (i.e. links that have expired)