Flocode: Engineering Insights 🌊
Posts
#027 - Data Structures for Civil/Structural Engineers: Pandas 01

#027 - Data Structures for Civil/Structural Engineers: Pandas 01

Learn How to Streamline Your Engineering Workflow with Pandas, a Popular Python Data Science Library.

April 12, 2024

Hi everyone,

This is a longer read than usual because I want to give you a comprehensive overview of Data Structures and pandas, specifically for engineering. The use cases are broad and highly applicable to many engineering scenarios. This is a 15 minute read, it might feel pretty dense at times but I highly recommend you take the time and go through this article with an open mind, it could change how you approach your work.

This is just a small snapshot of Pandas, Flocode’s courses will be digging into the details in more granular detail. The code from the examples in this article is available here.

Data Structures for Engineers

As engineers, we spend a lot of time dealing with with complex, large or partial datasets – whether experimental results, sensor measurements, or survey data. Efficiently storing, manipulating, cleaning and analyzing this structured data is how we gain insights and make informed decisions. This is where data structures like Python's pandas come into play.

Pandas’ intuitive syntax and purposeful integration with other Python libraries allows us to tackle data problems, uncover hidden insights, and make data-driven decisions with a new level of efficiency and accuracy. It’s absolutely worthwhile learning for engineering.

Understanding Data Structures

Data structures are the fundamental building blocks that organize and store data in a computer's memory. In the context of data analysis, well-designed data structures facilitate efficient manipulation, retrieval, and processing of data.

Different data structures are optimized for various use cases and operations. One such powerful data structure is the DataFrame provided by the pandas library. DataFrames are our focus in this article.

DataFrames are the workhorse of Pandas, providing a flexible and powerful way to store and manipulate structured data.

What is Pandas?

Pandas is a very popular open-source Python library for data analysis that introduces two key data structures: ‘Series’ (1 Dimensional) and ‘DataFrame’ (2 Dimensional). While Series is ideal for working with one-dimensional labeled data, the true star is the DataFrame.

A pandas DataFrame represents a tabular, spreadsheet-style dataset with rows and columns. However, unlike spreadsheets with rigid data types, each DataFrame column can contain data of different types – integers, floats, strings, dates, and more (even other DataFrames, wait what?).

Under the hood, a DataFrame is essentially a collection of Series objects sharing the same index (row labels). This labeled data architecture, combined with vectorized operations and automatic data alignment, makes DataFrames incredibly flexible and expressive for working with heterogeneous, real-world datasets.

Think of a DataFrame as a table where each column is a Series, and all Series share the same row labels. This organized structure, along with optimized calculations (i.e. vectorizing operations) and automatic data matching, makes DataFrames incredibly flexible and easy to use.

Why Pandas?

Working with data using pandas simplifies routine tasks when working with data and provides powerful data manipulation capabilities. Here are a few key reasons why you should consider pandas:

Performance: DataFrames leverage fast NumPy arrays for data storage, enabling high speed computation when compared to traditional Python data structures like lists, tuples, sets, and dictionaries. This is related to vectorization.
Data Wrangling: With hundreds of built-in methods (special pre-built formulas), pandas excels at data cleaning, preprocessing, reshaping, merging, and transformation tasks. It’s designed to streamline typical data analysis tasks. I’ve included numerous examples in the methods section at the end of this article.
Missing Data Handling: Methods to identify, fill, or drop missing data with ease.
Data Analysis: Robust group by, resampling, window operations, and statistical functionality out-of-the-box.
Time Series Support: Optimized for working with date/time data and time-based operations. This can be a nightmare if you operate within the standard Gregorian calendar here on earth.
Data Integration: Read/write data to various flat files (CSV, Excel, SQL databases) and formats (JSON, HDF5).
Data Visualization: Easy integration with Python visualization libraries like Matplotlib, Seaborn, Plotly and a billion others.

Diving into DataFrames

Creating DataFrames from various sources

One of the key strengths of Pandas is its ability to create DataFrames from a wide range of data sources, making it an invaluable tool for engineers working with diverse data formats.

CSV files: Comma-separated value (CSV) files are a common data exchange format, and Pandas provides a straightforward way to read them into a DataFrame using the pd.read_csv() function.
Excel files: Engineering data often resides in Excel spreadsheets, and Pandas seamlessly handles this with the pd.read_excel() function, allowing engineers to import data from various sheet formats effortlessly.
SQL databases: Pandas integrates well with relational databases through the pd.read_sql() function, enabling engineers to query and retrieve data directly from SQL databases into a DataFrame. This is likely less common for most engineers but the use cases for SQL are vast and increasing. More on this in another post in the future.

Exploring DataFrames

Once a DataFrame is created, Pandas offers a rich set of tools for exploring and understanding the data. You can easily access rows and columns using intuitive indexing and slicing techniques, similar to those used in Python lists and NumPy arrays.

Accessing rows and columns: Pandas provides multiple methods for selecting specific rows and columns, such as using labels, integer positions, or boolean indexing.
Filtering and sorting data: Engineers can filter DataFrames based on complex conditions, ensuring they work with only the relevant data. Sorting data by one or more columns is also a breeze, enabling efficient data exploration and analysis.
Handling missing values: Real-world data often contains missing or incomplete values, and Pandas offers robust mechanisms for detecting, filtering, and imputing missing data, ensuring data integrity and reliability. This is very helpful when trying to clean up visualizations or remove noise from plots.

Manipulating DataFrames

Beyond data exploration, Pandas empowers engineers to transform and manipulate DataFrames in powerful ways, unlocking new possibilities for data analysis and modeling.

Adding and deleting columns: Engineers can easily add new columns to a DataFrame, either by calculating values from existing columns or by incorporating external data sources. Removing unnecessary columns is equally straightforward, streamlining the data for specific analyses.
Merging and concatenating DataFrames: Pandas excels at combining multiple DataFrames, either by merging based on common keys (similar to database joins) or by concatenating along rows or columns, enabling engineers to integrate data from diverse sources.
Applying functions and calculations: Pandas provides a rich set of built-in functions for performing complex calculations and transformations on DataFrames. Additionally, engineers can apply custom functions element-wise, row-wise, or column-wise, enabling advanced data processing and feature engineering.

I’m just providing snapshots of the capabilities here, Pandas can do so many different things.

Real-World Engineering Applications

My opinion on this topic is that pandas transforms data analysis from a chore into a powerful, insightful experience. Unlike Excel, which can become cumbersome with large datasets, pandas offers the flexibility to deeply understand your data. This is especially valuable in forensic engineering (and i mean in the broader sense of exploring data), where it enables the manipulation and analysis of complex datasets.

With pandas, you can swiftly derive informed solutions that might otherwise be deemed too complex or time-consuming, making data not just manageable but a potent tool in your analytical arsenal.

The value of fast and iterative visualizations is huge and one of my favourite aspects of using Python for engineering. I feel like I can get an intuitive feel for data/analyses much faster when I’m visually engaged.

Let’s look a at a few potential use cases.

Case Study 1: Structural Health Monitoring

Structural Health Monitoring (SHM) is a common practice in structural engineering, ensuring the safety and integrity of infrastructure such as bridges, buildings, and dams. Pandas is a useful tool for this purpose, streamlining the data analysis process and enabling engineers to manage and analyze data, helping stakeholder make informed decisions.

Collecting sensor data: SHM systems typically involve a network of sensors that continuously collect data on various structural parameters, such as strain, displacement, and vibration. These sensors generate vast amounts of data, which can be challenging to manage and analyze using traditional tools. For example, when you download a temporal log of beam/member vibrations over a 1 week period, you could have millions of datapoints. Vectorizing this data with pandas can have a big impact on the usefulness and flexibility of such data. You can actually use the data instead of watching your Excel sheet grind your work laptop to a halt. No more restarting your machine every 10 minutes!
Preprocessing and cleaning data with Pandas: Pandas offers powerful data cleaning and preprocessing capabilities, allowing engineers to handle missing values, remove outliers, and transform raw sensor data into a format suitable for analysis. With Pandas' built-in functions, engineers can efficiently clean and prepare their data, ensuring the integrity and reliability of their analyses. This is night and day more efficient than in Excel.
Analyzing trends and anomalies: Once the data is cleaned and preprocessed, Pandas provides a robust set of tools for analyzing trends, detecting anomalies, and identifying potential structural issues. Engineers can leverage Pandas' powerful data manipulation and visualization capabilities to uncover patterns, correlations, and deviations that might indicate structural degradation or potential failures.
The vast majority of Pythons visualization libraries work with Pandas.You can check out that massive list here: PyViz - All Tools.

Case Study 2: Analyzing Flow Data

This case study performs typical data analysis operations on synthetic time series data representing flow measurements over time (e.g., river flow rates). First take a look at the code below and then I’ll summarize it.

If you want to check out the raw data and play with it, it’s saved here.

Key Steps:

1. Load Data: Load 'flows.csv' into a DataFrame, parsing 'time' column as datetime.

2. Initial Exploration: Glimpse data structure (`df.head()`), check data types and missing values (`df.info()`), and get summary statistics (`df.describe()`).

3. Filtering: Filter rows with 'flow' > 35 to identify high flow rates.

4. Grouping and Aggregation: Group data by day and calculate daily mean flow.

5. Visualization: Plot flow data (using matplotlib) over time to visualize trends and patterns.

6. Resampling: Resample hourly data to daily frequency using `resample()`.

7. Missing Data Handling: Identify (`df.isnull().sum()`) and drop rows with missing values (`df.dropna()`).

8. Output Clean Data: Write cleaned DataFrame to 'cleaned_flows.csv'.

With Pandas' built-in functions, engineers can efficiently clean and prepare their data, it makes life much easier. Managing data through code, makes it easier to stay organized and sane.

Hydrology data is one of my most common use cases for Pandas. More examples on this in the future.

Case Study 3: Construction Project Management

Most of us get buried in paperwork, submittals and documentation processes during the construction of large projects. Pandas can play a pivotal role in streamlining data management and analysis for construction project managers. It’s worth thinking about how to automate mind numbingly boring aspects of your work load. Is it possible? Does it make sense for you? If so, get on with it!

Importing project data from various sources: Construction projects typically involve data from multiple sources, such as schedule updates, material inventories, cost reports, and subcontractor records. Pandas' ability to read data from various formats, including CSV, Excel, and databases, enables project managers to consolidate and integrate data from diverse sources.
Merging and aggregating data with Pandas: Once the data is imported, Pandas' powerful merging and concatenation capabilities allow project managers to combine and aggregate data from different sources and types, providing a comprehensive view of the project's status. This integrated data can be further analyzed and manipulated to identify potential bottlenecks, resource conflicts, efficiencies or budget problems.
Generating reports and visualizations: Pandas integrates with other Python libraries, such as Matplotlib and Seaborn, enabling project managers to generate reports and visualizations from their data. These visual representations can help communicate project progress, highlight critical issues, and support decision-making.
Automation and Integration: I get asked about this a lot. Automating documentation control processes. We will talk more about this in the future. Your success in this arena is all about how you structure and organize your data. You need to understand the process from start to finish in excruciating detail. Pandas can be an important cog in this machine, some others to help you include:
1. Selenium: A powerful tool used for controlling web browsers through the program. It's often used for automating testing for web applications.
2. pdfplumber: A library that makes extracting text, images, and metadata from PDFs easy.
3. smtplib: Python's built-in library for sending emails using the Simple Mail Transfer Protocol (SMTP).
4. python-docx: A Python library for creating and updating Microsoft Word (.docx) files.
5. xlwings: A Python library that allows you to automate Excel on Windows and macOS and perform tasks such as reading and writing data, formatting cells, and more.
6. pyexcel: A library that provides one API to read, manipulate and write data in various excel formats.

Case Study 4: Analyzing AISC Steel Beam Data with Pandas

The American Institute of Steel Construction (AISC) provides a database of steel beam shapes and their properties. This database is often used to select appropriate beam sections for designs.

Let's explore how Pandas can be leveraged to analyze and manipulate this data effectively.

First, we import the necessary Python libraries:

Next, we read the AISC database from an Excel file into a Pandas DataFrame, while performing several useful operations, which I will explain below.

The Problem:

Let’s say you need to find a steel beam that can withstand a specified point load applied at midspan. The beam must not only be strong enough to handle the load but also must comply with space constraints that limit its depth. Additionally, to optimize for cost and efficiency, you want the lightest beam that meets these requirements.

Load: 80 kips (point load).
Live Load Factor: 1.6 (used to factor the live load).
Beam Span: 50 feet (the length over which the load is supported).
Maximum Allowable Beam Depth: 36 inches (to fit within spatial constraints in the design).
Beam Type: W shapes (standard structural shape for steel beams in construction).

This is the output:

Required Section Modulus: 533.3333333333334 in^3The lightest suitable W shape beam that meets the requirements:   AISC_Manual_Label     d     Zx      W57           W36X150  35.9  581.0  150.0

Here's a brief summary of each pandas method used in the script and their roles in processing the dataset to find the suitable beam:

pd.read_excel():
1. Purpose: Reads data from an Excel file into a pandas DataFrame.
2. Usage: Loads the AISC shapes database from an Excel file, specifying the sheet name to ensure the correct data is accessed.
apply():
1. Purpose: Applies a function along an axis of the DataFrame.
2. Usage: Converts specified columns ('d', 'Z_x', 'W') to numeric types to ensure that calculations on these columns are accurate and do not run into type-related errors.
sort_values():
1. Purpose: Sorts the DataFrame based on the values of one or more columns.
2. Usage: Sorts the DataFrame by the 'W' column in ascending order to find the lightest beam, making it easier to select the beam with the minimum weight that meets the specified criteria.
query():
1. Purpose: Queries the DataFrame using a boolean expression.
2. Usage: Filters the DataFrame to include only those rows where the depth ('d') is less than or equal to the maximum allowable depth, the plastic section modulus ('Z_x') multiplied by the assumed material yield stress is greater than or equal to the required moment capacity, and the beam type starts with 'W' (for W shapes).
head():
1. Purpose: Returns the first n rows of the DataFrame.
2. Usage: After sorting by weight, head(1) is used to select the lightest beam that meets all the specified conditions.

These methods work together to load, clean, filter, sort, and display the data necessary for selecting an appropriate beam based on the given requirements and dimensional constraints.

These are just hypothetical operations but the point is that by leveraging Pandas' powerful data manipulation and analysis capabilities, engineers can efficiently explore and gain insights from any data source.

Advantages of using Pandas over traditional tools

While traditional tools like spreadsheets and databases have their strengths, Pandas offers several advantages for engineers:

Handling large datasets efficiently: Pandas is designed to handle large datasets with ease, leveraging efficient data structures and optimized algorithms. This capability is particularly valuable in engineering applications, where growing data volumes can quickly become overwhelming for traditional tools.
Automating repetitive tasks: With Pandas' powerful data manipulation and analysis tools, engineers can automate many repetitive tasks, such as data cleaning, transformation, and calculations. This automation not only saves time but also reduces the risk of human error, ensuring consistent and reliable results. You can read more about automation here: #018 - Automation for Civil/Structural Engineers | Selenium
Integration with other Python libraries: Pandas is part of the larger Python ecosystem, integrating with libraries like NumPy for numerical computing, Matplotlib for data visualization, and scikit-learn for machine learning. This integration enables engineers to leverage the full power of Python for end-to-end data analysis and modeling workflows.

Pandas Methods

Below I have compiled some of the most commonly used methods in pandas.

There are many more methods (currently 398!) provided by the pandas library. I had a look and I have no idea what most of them do! In reality, I use about 30 of them quite often.

These are just a few of the many methods provided by the pandas library.

“Investing time in learning pandas can pay big dividends in terms of productivity and efficiency when working with large, complex datasets.”

- Me

Closing

It’s a lot to take in, I know. But we are only scratching the surface of what’s possible with pandas, and data structures in general.

We will be digging much deeper and looking at more specific applications in the future.

By embracing pandas, you'll not only accelerate your data-driven engineering workflows but also open doors to scalable, Pythonic data analysis and automation.

Flocode is here to empower you with practical Python skills and a collaborative learning environment to progress your designs, streamline workflows and automate tasks in your day-to-day engineering.

Explore Flocode's course library and join a supportive network of engineers taking their skills to new heights.

This was a long one, thanks for your time.

I am working hard on the course curriculum. It’s coming together and being shaped by terrific feedback from some Beta participants. If you are interested in joining the Beta Program, click here.

Keep innovating and I’ll see you in the next one.

James 🌊

Useful Links

10 minutes to pandas — pandas 2.2.1 documentation

Reply

or to participate.