How you organize your data files is dependent on the research / project requirements.
Use consistent methods when naming, describing (metadata), and storing you data files.
Prior to data collection, consider the following:
Top- level folder should include a title, unique ID, and year (YYYY).
Subfolders should use a concise, documented naming convention that could include, experiment runs, dataset versions, group personal.
Ensure that your data files are organized and labeled / named correctly so that they are easily identifiable.
To maximize the ability to share, preserve and re-use datasets / digital files, carefully consider the format you use for digital files.
Researchers should plan for both hardware and software obsolescence and consider file format choices that ensure long term operability and access.
Formats more likely to be accessible in the future are:
Consider migrating your data into a format with the above characteristics, in addition to keeping a copy in the original software format.
Format | Highest Confidence | Medium Confidence | Lowest Confidence |
---|---|---|---|
Text |
Plain text -- US-ASCII, UTF-8, UTF-16 with BOM (.txt) SGML with included DTD (.sgm, .sgml) XML with included schema (.xml) PDF/A-1 -- ISO 19005-1 (.pdf) |
Plain text -- ISO 8859-x (.txt) Rich Text Format 1.x (.rtf) Cascading Style Sheets (.css) HTML (.html, .htm) LaTeX with referenced files (.latex) OpenDocument Text (.odt, .sxw) MS Word 2007+ (OOXML) (.docx) PDF with fonts embedded (.pdf) |
Microsoft Word (.doc) WordPerfect (.wpd) all others |
Spreadsheet or Database |
Comma- or tab-separated Values (.csv, .tsv, .txt) Delimited text SIARD: Software Independent Archiving of Relational Databases (.siard) |
dBASE (.dbf) OpenDocument Spreadsheet (.ods) MS Excel 2007+ (OOXML) (.xlsx) |
Excel (.xls) all others |
Digitized Books, Maps, Paper etc. |
JPEG2000 -- lossless (.jp2) TIFF -- uncompressed (.tiff) PDF/A-1 -- ISO 19005-1 (.pdf) |
n/a | All others |
Graphics and Images |
TIFF -- uncompressed or CCITT 4 compressed (.tiff) JPEG2000 -- lossless compression (.jp2) PNG (.png)--24bit true color |
TIFF -- compressed (.tiff) JPEG (.jpg) JPEG2000 -- lossy compression (.jp2) GIF (.gif) Digital Negative DNG (.dng) BMP (.bmp) PNG (.png)--8 bit indexed |
PhotoShop (.psd) Encapsulated Postscript (.eps) MrSID (.sid) RAW files All others |
Digital Audio |
BWAV LPCM (.bwav, .wav) 24-bit, 96kHz AIFF -- PCM (aif, aiff) LPCM codec. WAV -- PCM (.wav) LPCM codec |
SUN audio -- uncompressed (.au, .snd) Standard MIDI (.mid) Free Lossless Audio Codec (.flac) Apple Lossless Audio Codec (ALAC) (.m4a) MP3 (.mp3) Advance Audio Coding (.mp4) |
SUN audio -- uncompressed (.au, .snd) Standard MIDI (.mid) Free Lossless Audio Codec (.flac) Apple Lossless Audio Codec (ALAC) (.m4a) MP3 (.mp3) Advance Audio Coding (.mp4) |
Digital Video |
FFV1/Matroska (.mkv) AVI -- uncompressed (.avi) QuickTime -- uncompressed, motion JPEG (.mov) Uncompressed .mxf MPEG-4 (.mp4) H.264 |
MPEG-1, MPEG-2 (.mp1, .mp2) Ogg Theora (.ogv, .ogg) ProRes (.mov) Motion JPEG 2000 (.jp2) |
Windows Media Video (.wmv) RealVideo (.rm, .rv) all others |
Presentation | PDF/A-1 -- ISO 19005-1 (.pdf) |
OpenDocument Presentation (.odp) MS Powerpoint 2007+ (OOXML) (.pptx) |
PowerPoint (.ppt) all others |
Containers |
Zip --no compression .tar |
Zip- compressed | All others |
Quantitative and Statistical Data |
Comma- or tab-separated Values (.csv, .tsv, .txt) Structured text or markup file containing metadata information: Data Documentation Initiative (.ddi), XML (.xml), JSON (.json) SIARD: Software Independent Archiving of Relational Databases (.siard) HDF5 (.hdf) |
SPSS (.sav, .sps, .spv, .spo) SAS (.sas, .sas7dat) R (.R) HDF4 (.hdf) |
Excel (.xls) Other proprietary formats |
MBOX EML |
MSG PST |
All others |
Table reused courtesy of University of Washington Libraries: Preferred File Formats—UW Libraries. (n.d.). Retrieved March 6, 2023, from https://www.lib.washington.edu/preservation/preservation_services/digitization-and-digital-preservation/preferred-file-formats