I have been working on a couple of Rust applications lately, boarder - a clipboard MCP and mcp-coder-bench - a runner for claude-code that allows me to attempt to benchmark a task with an MCP present or not. I use Claude Code for a lot of development and find that it creates a fair amount of duplicate code. The agent constantly creates new code, reading code requires context, as such we sometimes end up with the same function multiple times. I wanted something to catch this quickly and easily.
Why build another tool?
There are already a number of solutions - rust-code-analysis from Mozilla, jscpd for general purpose duplicate finding, SonarQube / SonarCloud for enterprise grade analysis. They are all good tools and I used most of them, but none of them really felt right in a Rust codebase. Even rust-code-analysis, which is written in Rust, isn’t a cargo subcommand. I basically wanted cargo clippy but for duplication, a native tool that understands Rust’s AST and can plug in to CI with a single command. It didn’t exist, so I built it.
What is code duplication, actually?
This depends on what you perceive as duplication. The simplest definition is exact duplicate text, when lines fully match, then code is duplicated. Text-based tools like jscpd do this well, and this is enough when we are dealing with copy-paste duplication.
But what about different names and the same behaviour? sum_positive and count_above_zero could do the same thing, for example, and each of them could have been written because of the same need by a model, but since it is non-deterministic it created two different methods with the same structure. I wanted to catch these even more than the copy-paste cases.
AST Normalization
Take the following two examples:
| |
Looking at it we can see that the variable names are different, so are the names of the functions, but the pattern is the same. Taking a vector and summing it. A tool that analyzes it by matching exact text would fail to see the similarity as each line is different.
cargo-dupes parses each file into a Rust AST using syn, it then walks every function, method, closure and transforms it into a normalized AST - a tree where we replace the names in a way where we allow the structure to shine through.
Replacing identifiers with placeholders
The normalizer replaces the identifiers (variable, function, type, lifetime, label) with values in a sequential index. So in the first example, when we see sum that becomes Placeholder(Variable, 0). When we see item that becomes Placeholder(Variable, 1).
If we do the same for the second function, we end up getting total -> Placeholder(Variable, 0) and val -> Placeholder(Variable, 1). We now only have structural relationships, the names are gone, and the two functions have an identical placeholder structure.
Erasing literal values
A literal value will have the value erased, but the type will be preserved, so 0 and 49 both become Literal(Int). This way two functions using the same literal type but with two different constants end up being considered duplicates (as they could usually be replaced with a function that takes the literal value as a parameter).
Preserving structure
Operators and control flow are preserved exactly + stays distinct from *, if/else stays distinct from match, for stays distinct from while. This is where semantic meaning lives — changing an operator changes what the code does.
Types as placeholders
Types use the same system, paths are replaced with positional TypePlaceholder (eg. i32 and String will become such positional TypePlaceholder nodes). Single segment types will become single placeholders, multi segment paths like std::io::Err will become a sequence of TypePlaceholder. This means fn foo(x: Foo) -> Bar and fn bar(a: Baz) -> Qux are structurally identical — they both take one type-parameter and return another.
Macros as opaque nodes
Macro invocations become opaque nodes. When the normalizer encounters something like println!("hello"), it doesn’t try to expand the macro — it just records Opaque. This means println!("hello") and println!("world") are treated as identical (both are just Opaque), and println!("x") and eprintln!("x") are also identical. This is the trade-off of working with syn at the pre-expansion level — macros are essentially black boxes. It is definitely not perfect, but the alternative - running full macro-expansion - would be much more complicated and would require including something like cargo-expand.
The result
After normalization is finished, each function becomes a tree of NormalizedNode variants - around 50 of them, covering all the concepts present in the code (LetBinding, ForLoop etc.). This ends up providing the exact structural skeleton of the code.
Here is what this would look like for our two functions above
| |
The names are gone, so are the values, but the structure is preserved.
From Normalized tree to Fingerprint
Now that we have the normalized tree, we can take the signature and the tree together to create a hash - this is our fingerprint. It’s a straightforward Hash of the NormalizedNode tree using Rust’s DefaultHasher, giving us a 64-bit value. Two functions with the same fingerprint are structurally the same after normalization.
Since the fingerprint will include both the function signature and the body, two functions with the same body but different arities or parameter structures won’t be falsely grouped together. Grouping by fingerprint ends up being just a hash map lookup - we normalize it once, hash it once, and then group.
Looking back at our examples: sum_positive and count_above_zero have the same fingerprint despite them having completely different names, variable names, and function names. We found exact structural duplicates.
How to use it?
cargo-dupes is available as a cargo subcommand. After installing it, you can run it on any Rust project:
| |
It tells you which functions are duplicates, where they are, and groups them by fingerprint. You can also get JSON output for machine consumption, statistics only, or use the check subcommand in CI:
| |
There’s also support for percentage-based thresholds (--max-exact-percent 5.0), excluding test code (--exclude-tests), ignoring specific fingerprints with documented reasons, and configuration through dupes.toml or Cargo.toml metadata. The full details are in the README.
What’s next?
This post covered exact duplicate detection - two functions that are structurally identical after normalization. But the more interesting cases are often near-duplicates: functions that are 90% the same but differ in one branch or one expression. These are where the real refactoring opportunities hide.
In Part 2, I’ll cover how cargo-dupes uses the Sørensen-Dice coefficient for near-duplicate detection, how parallel tree traversal works for scoring structural similarity, and dive deeper into the macro trade-off and what can be done about it.
The project is open source at mpecan/cargo-dupes. Contributions, issues, and feedback are welcome.
Comments
Loading comments...