Charalampos Mitrodimas

DBSCAN clustering in C

I've released cdbscan, a C library implementing the DBSCAN clustering algorithm from the original KDD-96 paper: https://www.dbs.ifi.lmu.de/Publikationen/Papers/KDD-96.final.frame.pdf

The project is hosted on GitHub: https://github.com/charmitro/cdbscan

Why Another DBSCAN Implementation?

I created cdbscan while studying density-based clustering algorithms. I needed a lightweight, dependency-free C implementation that:

  • Provides a clean C API without external dependencies
  • Achieves O(n log n) performance through KD-tree acceleration
  • Supports multiple distance metrics out of the box

Getting Started

Installation is straightforward with make:

git clone https://github.com/charmitro/cdbscan
cd cdbscan
make
sudo make install

Then use it in your C programs:

#include <cdbscan.h>

// Create points
cdbscan_point_t *points = cdbscan_create_points(100, 2);
// ... fill in point coordinates ...

// Set parameters
cdbscan_params_t params = {
    .eps = 0.5,
    .min_pts = 5,
    .dist_type = CDBSCAN_DIST_EUCLIDEAN,
    .use_kdtree = 1  // Enable O(n log n) performance
};

// Cluster
int num_clusters = cdbscan_cluster(points, 100, params);

Key Features

Multiple Distance Metrics: Choose from Euclidean, Manhattan, Minkowski, Cosine, or provide your own:

// Use Manhattan distance
params.dist_type = CDBSCAN_DIST_MANHATTAN;

// Or provide custom distance function
params.dist_type = CDBSCAN_DIST_CUSTOM;
params.custom_dist = my_distance_func;

KD-tree Acceleration: Enable O(n log n) performance for large datasets:

params.use_kdtree = 1;  // Consistent speedup across dataset sizes

Data Normalization: Built-in preprocessing functions:

// Min-max normalization
cdbscan_normalize_minmax(points, num_points);

// Z-score normalization
cdbscan_normalize_zscore(points, num_points);

Parameter Estimation: Find a good eps value using k-distance graphs:

cdbscan_kdist_result_t *result = cdbscan_estimate_eps(points, num_points, 4);
printf("Suggested eps: %f\n", result->suggested_eps);

Testing: The library includes tests that verify it follows the paper's definitions:

make test

Running specification tests...
==============================
Test: Core Point Identification
[PASS] Core point test PASSED

Test: Density Reachability
[PASS] Density reachability test PASSED

[SUCCESS] All specification tests passed!

How It Works

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies clusters as dense regions separated by sparse regions. Unlike k-means, it:

  • Discovers clusters of arbitrary shape
  • Identifies outliers as noise
  • Doesn't require specifying the number of clusters upfront

The algorithm works by classifying points as: - Core points: Have at least MinPts neighbors within eps radius - Border points: Within eps of a core point but not core themselves - Noise: Neither core nor border

The KD-tree implementation uses median selection for balanced trees, providing O(n log n) average case performance compared to O(n²) for the brute force approach.

Use Cases

Use cases include:

  • Geographic data clustering: Finding hotspots in location data
  • Anomaly detection: Identifying outliers in sensor readings
  • Image segmentation: Grouping similar pixels
  • Customer segmentation: Finding natural groupings in behavioral data

Performance

Performance features:

  • Zero external dependencies
  • Optional KD-tree acceleration
  • Memory-efficient point representation
  • Thread-safe for parallel processing multiple datasets

Benchmarks show consistent speedup with KD-tree: - 1,000 points: 1.89x speedup - 5,000 points: 1.68x speedup - 10,000 points: 1.37x speedup - 20,000 points: 1.24x speedup

Contributing

Contributions welcome. Ideas:

  • R-tree implementation for better performance
  • Python/Rust bindings
  • Additional distance metrics
  • GPU acceleration support

Submit pull requests to: https://github.com/charmitro/cdbscan

GPL-3.0 licensed.

Thoughts? Leave a comment