Beginner 30 min Mac & PC

Essential File Formats for DH

Choose the right file formats for your digital humanities projects. Focus on Python-compatible formats for analysis.

What You'll Learn:

Understand text formats for DH research and Python analysis
Choose appropriate data formats (CSV, JSON, XML) for different projects
Handle encoding issues and character set problems
Convert between formats effectively for cross-platform compatibility

Understanding file formats is crucial for DH work, especially when you’ll be using Python for text analysis, data visualization, and mapping. This guide focuses on the formats you’ll encounter most often and how to work with them effectively.

Why File Formats Matter in DH

Knowledge Check: Format Problems

You're trying to analyze 500 Victorian novels in Python, but your data is stored in Microsoft Word documents. What's the main problem?

The files are too large Python can't easily read Word documents for text analysis Word documents are expensive to store

Exactly! Python excels at analyzing plain text, CSV, and JSON files. Word documents contain formatting and metadata that make text extraction complex and unreliable.

The main issue is accessibility - Python needs simple, structured formats for reliable text analysis. Word documents are designed for human reading, not programmatic analysis.

Common Problems from Wrong Formats

Data that won’t open in Python: Binary formats, proprietary formats
Text with garbled characters: Encoding mismatches, special characters
Images too large for web display: Uncompressed formats, wrong resolution
Files that colleagues can’t open: Platform-specific formats, missing software

Text Formats for DH Research

Plain Text (.txt) - The Foundation

📄 Plain Text

Best for: Research notes, cleaned text data, Python scripts

✅ Advantages:

Universal compatibility
Small file size
Version control friendly
Python reads easily

❌ Limitations:

No formatting
No images
No structure beyond paragraphs

Markdown (.md) - Structured Text

Perfect for: Documentation, research notes with basic formatting

  📝 Markdown Example:
  ```markdown
  # Research Notes: Victorian Novels
  ## Data Sources
  - **Project Gutenberg**: Public domain texts
  - **HathiTrust**: Digitized books
  - *Last updated: 2024-09-26*
  
  ### Analysis Todo
  1. Clean OCR errors
  2. Extract character names
  3. Run sentiment analysis
  ```

Tools: VS Code (excellent support), Typora, any text editor

Data Formats for Analysis

CSV (Comma-Separated Values) - Your Best Friend

🧮 Interactive: Build a CSV

You're cataloging a collection of medieval manuscripts. Let's build a proper CSV structure:

Column Headers (first row):

Sample Data Row:

CSV Preview:

manuscript_id,title,date_created,language
MS001,Book of Kells,800,Latin

Why CSV is Perfect for DH:

Python loves it: pandas.read_csv()
Spreadsheet compatible: Excel, Google Sheets
Human readable: Can edit in any text editor
Version control: Git tracks changes line by line

JSON (JavaScript Object Notation) - Structured Data

📊 Same Data, Different Formats:

CSV Format:

title,author,year,genre
"Pride and Prejudice","Jane Austen",1813,"Romance"
"Frankenstein","Mary Shelley",1818,"Gothic"

JSON Format:

{
  "novels": [
    {
      "title": "Pride and Prejudice",
      "author": "Jane Austen", 
      "year": 1813,
      "genre": "Romance"
    },
    {
      "title": "Frankenstein",
      "author": "Mary Shelley",
      "year": 1818, 
      "genre": "Gothic"
    }
  ]
}

When to use JSON:

Nested data: Books with multiple authors, chapters, etc.
API data: Many digital libraries provide JSON
Configuration files: Settings for your Python projects
Web applications: JavaScript can read JSON directly

Encoding and Character Issues

The UTF-8 Standard

Encoding Detective

You open a text file containing French poetry and see: "caf√©" instead of "café". What's the problem?

The file is corrupted Wrong character encoding (probably Latin-1 vs UTF-8) Wrong font is being used

Correct! The é character (UTF-8: 0xC3A9) is being interpreted as two Latin-1 characters. This is a classic encoding mismatch.

This is a character encoding issue - the same bytes are being interpreted using different character sets, causing special characters to display incorrectly.

Python Encoding Solutions

  🐍 Python Encoding Best Practices:
  ```python
  # Always specify encoding when opening files
  with open('french_poetry.txt', 'r', encoding='utf-8') as file:
      text = file.read()
  
  # Check encoding if you're not sure
  import chardet
  with open('mystery_file.txt', 'rb') as file:
      raw_data = file.read()
      encoding = chardet.detect(raw_data)
      print(f"Detected encoding: {encoding['encoding']}")
  
  # Convert if necessary
  with open('mystery_file.txt', 'r', encoding=encoding['encoding']) as file:
      text = file.read()
  
  # Save as UTF-8
  with open('clean_file.txt', 'w', encoding='utf-8') as file:
      file.write(text)
  ```

Practical Exercise: Format Decision Tree

🌳 Choose the Right Format

For each scenario, select the best file format:

Scenario 1: Research Notes

You're taking notes while reading 50 academic articles about Victorian literature. You want to include quotes, citations, and your own analysis.

Word Document Markdown Plain Text

Scenario 2: Metadata Collection

You're cataloging 200 historical photographs with information like date, location, people, and keywords for each image.

Excel Spreadsheet CSV File JSON File

Scenario 3: Text Analysis Data

You have 1000 poems with complex metadata including multiple authors, publication info, themes, and full text for Python analysis.

CSV File JSON File XML File

Format Conversion Tools

Command Line Conversion

🖥️ Practice Format Conversion

Try these conversion commands:

pandoc notes.md -o notes.pdf ← Convert Markdown to PDF

csvkit in.xlsx --sheet "Sheet1" > out.csv ← Excel to CSV

jq '.' data.json ← Pretty-print JSON

Python Conversion

  🐍 Convert Formats with Python:
  ```python
  import pandas as pd
  import json
  
  # Excel to CSV
  df = pd.read_excel('research_data.xlsx')
  df.to_csv('research_data.csv', index=False)
  
  # CSV to JSON
  df = pd.read_csv('research_data.csv')
  df.to_json('research_data.json', orient='records', indent=2)
  
  # JSON to CSV  
  with open('research_data.json', 'r') as f:
      data = json.load(f)
  df = pd.DataFrame(data)
  df.to_csv('converted_data.csv', index=False)
  ```

Real-World DH Workflow

📚 Case Study: Digital Edition Project

Original Documents

Format: Scanned PDFs

Challenge: Not machine-readable

OCR Processing

Output: Plain text (.txt)

Issue: OCR errors, inconsistent formatting

Manual Correction

Format: Markdown (.md)

Benefit: Structure + human readability

Analysis Preparation

Format: UTF-8 Plain Text

Ready for: Python analysis, NLP tools

Publication

Formats: HTML (web), PDF (print)

Generated from: Markdown source

Troubleshooting Common Issues

Problem 1: “File won’t open in Python”

🚨 Error Messages:

UnicodeDecodeError
PermissionError
pandas.errors.EmptyDataError

✅ Solutions:

```python # Try different encodings encodings = ['utf-8', 'latin-1', 'cp1252'] for encoding in encodings: try: with open('problematic_file.txt', 'r', encoding=encoding) as f: text = f.read() print(f"Success with {encoding}") break except UnicodeDecodeError: print(f"Failed with {encoding}") ```

Problem 2: “Garbled characters”

🔧 Character Fixing:

```python # Fix common encoding issues import ftfy garbled_text = "caf√©" fixed_text = ftfy.fix_text(garbled_text) print(fixed_text) # Output: "café" ```

Best Practices Summary

📋 DH File Format Guidelines

📝 Text & Notes

Research notes: Markdown
Clean text data: UTF-8 plain text
Documentation: Markdown or plain text

📊 Data & Analysis

Tabular data: CSV (UTF-8)
Hierarchical data: JSON
Large datasets: Parquet or HDF5

🔧 Technical

Always specify encoding in Python
Use UTF-8 for new files
Test with sample data first

Proficiency Check: File Formats

You're starting a project to analyze 10,000 historical newspaper articles. The data includes article text, publication date, newspaper name, and topic tags. What's the BEST format for Python analysis?

Individual Word documents for each article Excel spreadsheet with all data CSV file with UTF-8 encoding JSON with nested topic arrays

Excellent choice! CSV is perfect for tabular data like this. It loads quickly into pandas, works across platforms, and UTF-8 handles any special characters in historical text.

For this type of tabular data, CSV with UTF-8 encoding is ideal. It's Python-friendly, efficient for large datasets, and handles the text encoding issues common in historical documents.

Next Steps

Understanding file formats prepares you for command line work, where you’ll learn to navigate files and run Python scripts efficiently from the terminal.

Remember: The right file format can save hours of debugging and conversion work. When in doubt, choose simple, open formats that Python can read easily.