Essential File Formats for DH
Choose the right file formats for your digital humanities projects. Focus on Python-compatible formats for analysis.
What You'll Learn:
- Understand text formats for DH research and Python analysis
- Choose appropriate data formats (CSV, JSON, XML) for different projects
- Handle encoding issues and character set problems
- Convert between formats effectively for cross-platform compatibility
Understanding file formats is crucial for DH work, especially when you’ll be using Python for text analysis, data visualization, and mapping. This guide focuses on the formats you’ll encounter most often and how to work with them effectively.
Why File Formats Matter in DH
Knowledge Check: Format Problems
You're trying to analyze 500 Victorian novels in Python, but your data is stored in Microsoft Word documents. What's the main problem?
Exactly! Python excels at analyzing plain text, CSV, and JSON files. Word documents contain formatting and metadata that make text extraction complex and unreliable.
The main issue is accessibility - Python needs simple, structured formats for reliable text analysis. Word documents are designed for human reading, not programmatic analysis.
Common Problems from Wrong Formats
- Data that won’t open in Python: Binary formats, proprietary formats
- Text with garbled characters: Encoding mismatches, special characters
- Images too large for web display: Uncompressed formats, wrong resolution
- Files that colleagues can’t open: Platform-specific formats, missing software
Text Formats for DH Research
Plain Text (.txt) - The Foundation
📄 Plain Text
Best for: Research notes, cleaned text data, Python scripts
✅ Advantages:
- Universal compatibility
- Small file size
- Version control friendly
- Python reads easily
❌ Limitations:
- No formatting
- No images
- No structure beyond paragraphs
Markdown (.md) - Structured Text
Perfect for: Documentation, research notes with basic formatting
📝 Markdown Example:
```markdown # Research Notes: Victorian Novels ## Data Sources - **Project Gutenberg**: Public domain texts - **HathiTrust**: Digitized books - *Last updated: 2024-09-26* ### Analysis Todo 1. Clean OCR errors 2. Extract character names 3. Run sentiment analysis ```Tools: VS Code (excellent support), Typora, any text editor
Data Formats for Analysis
CSV (Comma-Separated Values) - Your Best Friend
🧮 Interactive: Build a CSV
You're cataloging a collection of medieval manuscripts. Let's build a proper CSV structure:
Column Headers (first row):
Sample Data Row:
CSV Preview:
manuscript_id,title,date_created,language MS001,Book of Kells,800,Latin
Why CSV is Perfect for DH:
- Python loves it:
pandas.read_csv() - Spreadsheet compatible: Excel, Google Sheets
- Human readable: Can edit in any text editor
- Version control: Git tracks changes line by line
JSON (JavaScript Object Notation) - Structured Data
📊 Same Data, Different Formats:
CSV Format:
title,author,year,genre "Pride and Prejudice","Jane Austen",1813,"Romance" "Frankenstein","Mary Shelley",1818,"Gothic"
JSON Format:
{
"novels": [
{
"title": "Pride and Prejudice",
"author": "Jane Austen",
"year": 1813,
"genre": "Romance"
},
{
"title": "Frankenstein",
"author": "Mary Shelley",
"year": 1818,
"genre": "Gothic"
}
]
}
When to use JSON:
- Nested data: Books with multiple authors, chapters, etc.
- API data: Many digital libraries provide JSON
- Configuration files: Settings for your Python projects
- Web applications: JavaScript can read JSON directly
Encoding and Character Issues
The UTF-8 Standard
Encoding Detective
You open a text file containing French poetry and see: "caf√©" instead of "café". What's the problem?
Correct! The é character (UTF-8: 0xC3A9) is being interpreted as two Latin-1 characters. This is a classic encoding mismatch.
This is a character encoding issue - the same bytes are being interpreted using different character sets, causing special characters to display incorrectly.
Python Encoding Solutions
🐍 Python Encoding Best Practices:
```python # Always specify encoding when opening files with open('french_poetry.txt', 'r', encoding='utf-8') as file: text = file.read() # Check encoding if you're not sure import chardet with open('mystery_file.txt', 'rb') as file: raw_data = file.read() encoding = chardet.detect(raw_data) print(f"Detected encoding: {encoding['encoding']}") # Convert if necessary with open('mystery_file.txt', 'r', encoding=encoding['encoding']) as file: text = file.read() # Save as UTF-8 with open('clean_file.txt', 'w', encoding='utf-8') as file: file.write(text) ```Practical Exercise: Format Decision Tree
🌳 Choose the Right Format
For each scenario, select the best file format:
Scenario 1: Research Notes
You're taking notes while reading 50 academic articles about Victorian literature. You want to include quotes, citations, and your own analysis.
Scenario 2: Metadata Collection
You're cataloging 200 historical photographs with information like date, location, people, and keywords for each image.
Scenario 3: Text Analysis Data
You have 1000 poems with complex metadata including multiple authors, publication info, themes, and full text for Python analysis.
Format Conversion Tools
Command Line Conversion
🖥️ Practice Format Conversion
Try these conversion commands:
pandoc notes.md -o notes.pdf
← Convert Markdown to PDF
csvkit in.xlsx --sheet "Sheet1" > out.csv
← Excel to CSV
jq '.' data.json
← Pretty-print JSON
Python Conversion
🐍 Convert Formats with Python:
```python import pandas as pd import json # Excel to CSV df = pd.read_excel('research_data.xlsx') df.to_csv('research_data.csv', index=False) # CSV to JSON df = pd.read_csv('research_data.csv') df.to_json('research_data.json', orient='records', indent=2) # JSON to CSV with open('research_data.json', 'r') as f: data = json.load(f) df = pd.DataFrame(data) df.to_csv('converted_data.csv', index=False) ```Real-World DH Workflow
📚 Case Study: Digital Edition Project
Original Documents
Format: Scanned PDFs
Challenge: Not machine-readable
OCR Processing
Output: Plain text (.txt)
Issue: OCR errors, inconsistent formatting
Manual Correction
Format: Markdown (.md)
Benefit: Structure + human readability
Analysis Preparation
Format: UTF-8 Plain Text
Ready for: Python analysis, NLP tools
Publication
Formats: HTML (web), PDF (print)
Generated from: Markdown source
Troubleshooting Common Issues
Problem 1: “File won’t open in Python”
🚨 Error Messages:
UnicodeDecodeErrorPermissionErrorpandas.errors.EmptyDataError
✅ Solutions:
```python # Try different encodings encodings = ['utf-8', 'latin-1', 'cp1252'] for encoding in encodings: try: with open('problematic_file.txt', 'r', encoding=encoding) as f: text = f.read() print(f"Success with {encoding}") break except UnicodeDecodeError: print(f"Failed with {encoding}") ```Problem 2: “Garbled characters”
🔧 Character Fixing:
```python # Fix common encoding issues import ftfy garbled_text = "caf√©" fixed_text = ftfy.fix_text(garbled_text) print(fixed_text) # Output: "café" ```Best Practices Summary
📋 DH File Format Guidelines
📝 Text & Notes
- Research notes: Markdown
- Clean text data: UTF-8 plain text
- Documentation: Markdown or plain text
📊 Data & Analysis
- Tabular data: CSV (UTF-8)
- Hierarchical data: JSON
- Large datasets: Parquet or HDF5
🔧 Technical
- Always specify encoding in Python
- Use UTF-8 for new files
- Test with sample data first
Proficiency Check: File Formats
You're starting a project to analyze 10,000 historical newspaper articles. The data includes article text, publication date, newspaper name, and topic tags. What's the BEST format for Python analysis?
Excellent choice! CSV is perfect for tabular data like this. It loads quickly into pandas, works across platforms, and UTF-8 handles any special characters in historical text.
For this type of tabular data, CSV with UTF-8 encoding is ideal. It's Python-friendly, efficient for large datasets, and handles the text encoding issues common in historical documents.
Next Steps
Understanding file formats prepares you for command line work, where you’ll learn to navigate files and run Python scripts efficiently from the terminal.
Remember: The right file format can save hours of debugging and conversion work. When in doubt, choose simple, open formats that Python can read easily.