I've released cdbscan, a C library implementing the DBSCAN clustering algorithm from the original KDD-96 paper: https://www.dbs.ifi.lmu.de/Publikationen/Papers/KDD-96.final.frame.pdf
The project is hosted on GitHub: https://github.com/charmitro/cdbscan
Why Another DBSCAN Implementation?
I created cdbscan while studying density-based clustering algorithms. I needed a lightweight, dependency-free C implementation that:
- Provides a clean C API without external dependencies
- Achieves O(n log n) performance through KD-tree acceleration
- Supports multiple distance metrics out of the box
Getting Started
Installation is straightforward with make:
git clone https://github.com/charmitro/cdbscan
cd cdbscan
make
sudo make install
Then use it in your C programs:
#include <cdbscan.h> // Create points cdbscan_point_t *points = cdbscan_create_points(100, 2); // ... fill in point coordinates ... // Set parameters cdbscan_params_t params = { .eps = 0.5, .min_pts = 5, .dist_type = CDBSCAN_DIST_EUCLIDEAN, .use_kdtree = 1 // Enable O(n log n) performance }; // Cluster int num_clusters = cdbscan_cluster(points, 100, params);
Key Features
Multiple Distance Metrics: Choose from Euclidean, Manhattan, Minkowski, Cosine, or provide your own:
// Use Manhattan distance params.dist_type = CDBSCAN_DIST_MANHATTAN; // Or provide custom distance function params.dist_type = CDBSCAN_DIST_CUSTOM; params.custom_dist = my_distance_func;
KD-tree Acceleration: Enable O(n log n) performance for large datasets:
params.use_kdtree = 1; // Consistent speedup across dataset sizes
Data Normalization: Built-in preprocessing functions:
// Min-max normalization cdbscan_normalize_minmax(points, num_points); // Z-score normalization cdbscan_normalize_zscore(points, num_points);
Parameter Estimation: Find a good eps value using k-distance graphs:
cdbscan_kdist_result_t *result = cdbscan_estimate_eps(points, num_points, 4); printf("Suggested eps: %f\n", result->suggested_eps);
Testing: The library includes tests that verify it follows the paper's definitions:
make test
Running specification tests...
==============================
Test: Core Point Identification
[PASS] Core point test PASSED
Test: Density Reachability
[PASS] Density reachability test PASSED
[SUCCESS] All specification tests passed!
How It Works
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies clusters as dense regions separated by sparse regions. Unlike k-means, it:
- Discovers clusters of arbitrary shape
- Identifies outliers as noise
- Doesn't require specifying the number of clusters upfront
The algorithm works by classifying points as: - Core points: Have at least MinPts neighbors within eps radius - Border points: Within eps of a core point but not core themselves - Noise: Neither core nor border
The KD-tree implementation uses median selection for balanced trees, providing O(n log n) average case performance compared to O(n²) for the brute force approach.
Use Cases
Use cases include:
- Geographic data clustering: Finding hotspots in location data
- Anomaly detection: Identifying outliers in sensor readings
- Image segmentation: Grouping similar pixels
- Customer segmentation: Finding natural groupings in behavioral data
Performance
Performance features:
- Zero external dependencies
- Optional KD-tree acceleration
- Memory-efficient point representation
- Thread-safe for parallel processing multiple datasets
Benchmarks show consistent speedup with KD-tree: - 1,000 points: 1.89x speedup - 5,000 points: 1.68x speedup - 10,000 points: 1.37x speedup - 20,000 points: 1.24x speedup
Contributing
Contributions welcome. Ideas:
- R-tree implementation for better performance
- Python/Rust bindings
- Additional distance metrics
- GPU acceleration support
Submit pull requests to: https://github.com/charmitro/cdbscan
GPL-3.0 licensed.