On link-informed graph traversal¶
The all-simple-paths algorithm as implemented in nx.all_simple_paths() inside the
networkx (abbreviated below as nx) package version 2.2 uses a
depth-first traversal scheme to find all possible paths from a start node to one or more
end nodes [1].
For example, let nodes A-F represent unitigs in a De Bruijn graph created from sequencing
reads of transcripts:
An all-simple-paths traversal starting at A will return the paths ABDE, ABDF, ACDE,
and ACDF. However, what if the sequenced reads that were used to create this graph only
originated from two paths: ABDE and
ACDF? Can some of these sequencing reads be used to restrict the paths returned by the
all-simple-paths algorithm?
Mccortex provides a data structure called “links” for annotating De Bruijn graphs in Cortex
format. In the example above, links can be used to store information on a read that covers both the
A –> B and D –> E junctions. Cortexpy can use these links to performed a
“link-informed” (that is, a restricted) all-simple-paths traversal.
| [1] | Cortexpy currently uses a copy of nx.all_simple_paths()
named _all_simple_paths_graph(). |
Link-informed graph traversal in cortexpy¶
Cortexpy uses networkx algorithms¶
Cortexpy represents Cortex graphs as nx.DiGraph objects [2]. This
allows the easy application of networkx algorithms to Cortex graphs.
cortexpy achieves link-informed traversal by wrapping a Cortex graph in a
LinkedGraphTraverser object, which modifies the behavior of the
__getitem__() method. To understand why this works, let us first take a look at the
all-simple-paths algorithm:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | def _all_simple_paths_graph(G, source, targets, cutoff):
"""From networkx.algorithms.simple_paths"""
visited = collections.OrderedDict.fromkeys([source])
stack = [iter(G[source])]
while stack:
children = stack[-1]
child = next(children, None)
if child is None:
stack.pop()
visited.popitem()
elif len(visited) < cutoff:
if child in visited:
continue
if child in targets:
yield list(visited) + [child]
visited[child] = None
if targets - set(visited.keys()): # expand stack until find all targets
stack.append(iter(G[child]))
else:
visited.popitem() # maybe other ways to child
else: # len(visited) == cutoff:
for target in (targets & (set(children) | {child})) - set(visited.keys()):
yield list(visited) + [target]
stack.pop()
visited.popitem()
|
The key line here is the highlighted line 18. This is the line that appends an iterator of a node’s
successors to the stack of nodes to visit. The algorithm asks the graph object G
for the successor nodes of child by calling the
__getitem__() method of G:
G[child]
This means that we can restrict the paths returned by _all_simple_paths_graph() by
restricting the successor nodes returned by G.
| [2] | The implementation is not perfect and could use some improvement. |
LinkedGraphTraverser restricts simple paths using links¶
The __getitem__() method of
LinkedGraphTraverser restricts the returned
successors using the following rules:
- If no link information exists for the query node, then return all successors.
- Otherwise, return only successors that are consistent with the links encountered on the path from start to query node.
- If the query node is annotated with links, pick up all links.
- For each successor node, only retain links that are consistent with the path taken from the start to this successor node.
- For each successor node, drop links that are no longer relevant to the successor node (i.e. links that have expired)