Beginner 30 min Mac & PC

Essential File Formats for DH

Choose the right file formats for your digital humanities projects. Focus on Python-compatible formats for analysis.

What You'll Learn:

  • Understand text formats for DH research and Python analysis
  • Choose appropriate data formats (CSV, JSON, XML) for different projects
  • Handle encoding issues and character set problems
  • Convert between formats effectively for cross-platform compatibility

Understanding file formats is crucial for DH work, especially when you’ll be using Python for text analysis, data visualization, and mapping. This guide focuses on the formats you’ll encounter most often and how to work with them effectively.

Why File Formats Matter in DH

Knowledge Check: Format Problems

You're trying to analyze 500 Victorian novels in Python, but your data is stored in Microsoft Word documents. What's the main problem?

Common Problems from Wrong Formats

  • Data that won’t open in Python: Binary formats, proprietary formats
  • Text with garbled characters: Encoding mismatches, special characters
  • Images too large for web display: Uncompressed formats, wrong resolution
  • Files that colleagues can’t open: Platform-specific formats, missing software

Text Formats for DH Research

Plain Text (.txt) - The Foundation

📄 Plain Text

Best for: Research notes, cleaned text data, Python scripts

✅ Advantages:
  • Universal compatibility
  • Small file size
  • Version control friendly
  • Python reads easily
❌ Limitations:
  • No formatting
  • No images
  • No structure beyond paragraphs

Markdown (.md) - Structured Text

Perfect for: Documentation, research notes with basic formatting

📝 Markdown Example:

```markdown # Research Notes: Victorian Novels ## Data Sources - **Project Gutenberg**: Public domain texts - **HathiTrust**: Digitized books - *Last updated: 2024-09-26* ### Analysis Todo 1. Clean OCR errors 2. Extract character names 3. Run sentiment analysis ```

Tools: VS Code (excellent support), Typora, any text editor

Data Formats for Analysis

CSV (Comma-Separated Values) - Your Best Friend

🧮 Interactive: Build a CSV

You're cataloging a collection of medieval manuscripts. Let's build a proper CSV structure:

Column Headers (first row):
Sample Data Row:
CSV Preview:
manuscript_id,title,date_created,language
MS001,Book of Kells,800,Latin

Why CSV is Perfect for DH:

  • Python loves it: pandas.read_csv()
  • Spreadsheet compatible: Excel, Google Sheets
  • Human readable: Can edit in any text editor
  • Version control: Git tracks changes line by line

JSON (JavaScript Object Notation) - Structured Data

📊 Same Data, Different Formats:

CSV Format:
title,author,year,genre
"Pride and Prejudice","Jane Austen",1813,"Romance"
"Frankenstein","Mary Shelley",1818,"Gothic"
JSON Format:
{
  "novels": [
    {
      "title": "Pride and Prejudice",
      "author": "Jane Austen", 
      "year": 1813,
      "genre": "Romance"
    },
    {
      "title": "Frankenstein",
      "author": "Mary Shelley",
      "year": 1818, 
      "genre": "Gothic"
    }
  ]
}

When to use JSON:

  • Nested data: Books with multiple authors, chapters, etc.
  • API data: Many digital libraries provide JSON
  • Configuration files: Settings for your Python projects
  • Web applications: JavaScript can read JSON directly

Encoding and Character Issues

The UTF-8 Standard

Encoding Detective

You open a text file containing French poetry and see: "caf√©" instead of "café". What's the problem?

Python Encoding Solutions

🐍 Python Encoding Best Practices:

```python # Always specify encoding when opening files with open('french_poetry.txt', 'r', encoding='utf-8') as file: text = file.read() # Check encoding if you're not sure import chardet with open('mystery_file.txt', 'rb') as file: raw_data = file.read() encoding = chardet.detect(raw_data) print(f"Detected encoding: {encoding['encoding']}") # Convert if necessary with open('mystery_file.txt', 'r', encoding=encoding['encoding']) as file: text = file.read() # Save as UTF-8 with open('clean_file.txt', 'w', encoding='utf-8') as file: file.write(text) ```

Practical Exercise: Format Decision Tree

🌳 Choose the Right Format

For each scenario, select the best file format:

Scenario 1: Research Notes

You're taking notes while reading 50 academic articles about Victorian literature. You want to include quotes, citations, and your own analysis.

Scenario 2: Metadata Collection

You're cataloging 200 historical photographs with information like date, location, people, and keywords for each image.

Scenario 3: Text Analysis Data

You have 1000 poems with complex metadata including multiple authors, publication info, themes, and full text for Python analysis.

Format Conversion Tools

Command Line Conversion

🖥️ Practice Format Conversion

Try these conversion commands:

pandoc notes.md -o notes.pdf ← Convert Markdown to PDF
csvkit in.xlsx --sheet "Sheet1" > out.csv ← Excel to CSV
jq '.' data.json ← Pretty-print JSON

Python Conversion

🐍 Convert Formats with Python:

```python import pandas as pd import json # Excel to CSV df = pd.read_excel('research_data.xlsx') df.to_csv('research_data.csv', index=False) # CSV to JSON df = pd.read_csv('research_data.csv') df.to_json('research_data.json', orient='records', indent=2) # JSON to CSV with open('research_data.json', 'r') as f: data = json.load(f) df = pd.DataFrame(data) df.to_csv('converted_data.csv', index=False) ```

Real-World DH Workflow

📚 Case Study: Digital Edition Project

1
Original Documents

Format: Scanned PDFs

Challenge: Not machine-readable

2
OCR Processing

Output: Plain text (.txt)

Issue: OCR errors, inconsistent formatting

3
Manual Correction

Format: Markdown (.md)

Benefit: Structure + human readability

4
Analysis Preparation

Format: UTF-8 Plain Text

Ready for: Python analysis, NLP tools

5
Publication

Formats: HTML (web), PDF (print)

Generated from: Markdown source

Troubleshooting Common Issues

Problem 1: “File won’t open in Python”

🚨 Error Messages:
  • UnicodeDecodeError
  • PermissionError
  • pandas.errors.EmptyDataError
✅ Solutions:
```python # Try different encodings encodings = ['utf-8', 'latin-1', 'cp1252'] for encoding in encodings: try: with open('problematic_file.txt', 'r', encoding=encoding) as f: text = f.read() print(f"Success with {encoding}") break except UnicodeDecodeError: print(f"Failed with {encoding}") ```

Problem 2: “Garbled characters”

🔧 Character Fixing:
```python # Fix common encoding issues import ftfy garbled_text = "caf√©" fixed_text = ftfy.fix_text(garbled_text) print(fixed_text) # Output: "café" ```

Best Practices Summary

📋 DH File Format Guidelines

📝 Text & Notes
  • Research notes: Markdown
  • Clean text data: UTF-8 plain text
  • Documentation: Markdown or plain text
📊 Data & Analysis
  • Tabular data: CSV (UTF-8)
  • Hierarchical data: JSON
  • Large datasets: Parquet or HDF5
🔧 Technical
  • Always specify encoding in Python
  • Use UTF-8 for new files
  • Test with sample data first

Proficiency Check: File Formats

You're starting a project to analyze 10,000 historical newspaper articles. The data includes article text, publication date, newspaper name, and topic tags. What's the BEST format for Python analysis?

Next Steps

Understanding file formats prepares you for command line work, where you’ll learn to navigate files and run Python scripts efficiently from the terminal.


Remember: The right file format can save hours of debugging and conversion work. When in doubt, choose simple, open formats that Python can read easily.