Geopandas

Geopandas

Geopandas automation and integration for geospatial data analysis and visualization

Category: productivity Source: K-Dense-AI/claude-scientific-skills

GeoPandas is a community skill for geospatial data analysis using the GeoPandas library, covering spatial DataFrames, geometry operations, coordinate projections, spatial joins, and choropleth map visualization for geographic data science.

What Is This?

Overview

GeoPandas provides patterns for analyzing geographic data with a pandas interface extended for spatial operations. It covers GeoDataFrame creation from shapefiles, GeoJSON, and other spatial formats, geometry operations including buffer, union, intersection, and overlay, coordinate reference system management with projection transformations, spatial joins that merge datasets based on geographic relationships, and choropleth and point map visualization for geographic data presentation. The skill enables data scientists to perform spatial analysis with familiar pandas syntax while leveraging the full power of the Shapely geometry engine underneath.

Who Should Use This

This skill serves data scientists analyzing datasets with geographic components, urban planners working with zoning and infrastructure spatial data, and researchers performing spatial statistics on environmental or demographic datasets. It is also well suited for analysts who need to combine tabular business data with geographic boundaries for reporting.

Why Use It?

Problems It Solves

Standard pandas cannot handle geometry columns or perform spatial operations like intersection and buffering. Reading spatial file formats like shapefiles and GeoJSON requires specialized parsers. Joining datasets by geographic proximity rather than shared keys needs spatial index support. Creating thematic maps from data requires integrating geometry rendering with statistical coloring.

Core Highlights

GeoDataFrame extends pandas DataFrame with a geometry column for spatial operations. Geometry methods perform buffer, dissolve, overlay, and spatial filtering. CRS management handles projection transformations between coordinate systems. Plot method generates choropleth maps with built-in matplotlib integration.

How to Use It?

Basic Usage

import geopandas as gpd
from shapely.geometry import Point

counties = gpd.read_file("counties.shp")
print(f"Rows: {len(counties)}")
print(f"CRS: {counties.crs}")
print(f"Columns: {list(counties.columns)}")

points = gpd.GeoDataFrame({
    "name": ["Office", "Warehouse"],
    "value": [100, 200]},
    geometry=[Point(-74.006, 40.713),
              Point(-73.935, 40.730)],
    crs="EPSG:4326")

buffered = points.buffer(0.01)
print(f"Buffer area: {buffered.area.sum():.6f}")

projected = points.to_crs(epsg=32618)
print(f"Projected CRS: {projected.crs}")

Real-World Examples

import geopandas as gpd

class SpatialAnalyzer:
    def __init__(self, regions_path: str):
        self.regions = gpd.read_file(regions_path)

    def spatial_join(self,
                     points: gpd.GeoDataFrame
                     ) -> gpd.GeoDataFrame:
        if points.crs != self.regions.crs:
            points = points.to_crs(self.regions.crs)
        return gpd.sjoin(points, self.regions,
                         how="left",
                         predicate="within")

    def aggregate_by_region(
            self,
            points: gpd.GeoDataFrame,
            value_col: str,
            region_col: str) -> gpd.GeoDataFrame:
        joined = self.spatial_join(points)
        agg = joined.groupby(region_col).agg(
            count=(value_col, "size"),
            total=(value_col, "sum"),
            mean=(value_col, "mean")
        ).reset_index()
        return self.regions.merge(
            agg, on=region_col, how="left")

    def find_within(self, geometry,
                     target: gpd.GeoDataFrame
                     ) -> gpd.GeoDataFrame:
        return target[
            target.geometry.within(geometry)]

analyzer = SpatialAnalyzer("districts.geojson")
result = analyzer.spatial_join(points)
print(f"Joined: {len(result)} rows")

Advanced Tips

Project to a local metric CRS before computing areas or distances to get results in meters instead of degrees. For example, use EPSG:32618 for UTM Zone 18N when working with northeastern US data. Use spatial indexing with sindex for efficient spatial queries on large GeoDataFrames. Apply the dissolve method with aggregate functions to summarize data by geographic regions. Dissolve boundaries by attribute columns to create aggregate regions from detailed polygons, which is useful when merging census tracts into counties.

When to Use It?

Use Cases

Build a demographic analysis tool that joins census data to geographic boundaries for choropleth visualization. Create a site selection model that scores locations based on proximity to amenities such as transit stops or retail centers. Implement an environmental analysis pipeline that overlays habitat regions with development zones.

Related Topics

Geospatial data science, spatial joins, choropleth mapping, coordinate reference systems, and shapefile processing.

Important Notes

Requirements

Python with geopandas, shapely, and fiona installed. GDAL library for reading spatial file formats. Matplotlib for built-in plotting. PyProj for coordinate reference system transformations between geographic and projected systems.

Usage Recommendations

Do: always check and align CRS before spatial operations between datasets. Use projected coordinate systems for metric calculations. Validate geometry validity with is_valid before operations to avoid topology errors. Use dissolve to merge adjacent polygons that share attribute values.

Don't: perform area or distance calculations in geographic coordinates (EPSG:4326), which gives results in degrees. Ignore CRS mismatches between datasets in spatial joins. Load very large shapefiles entirely into memory when bounding box filtering with the bbox parameter can reduce the loaded dataset. Ignore geometry validity checks that prevent silent failures in overlay operations.

Limitations

Large spatial datasets may consume significant memory as GeoDataFrames. Complex geometry operations like overlay can be slow on detailed polygon boundaries with many vertices. Interactive map rendering requires additional libraries like folium beyond the built-in matplotlib plotting. Very large vector datasets with millions of features may benefit from spatial databases such as PostGIS rather than in-memory GeoDataFrames.