Friday, August 17, 2007

About SourceSafe physical file names and file numbers

If you worked long enough with SourceSafe, you probably heard by now about the physical file names associated with files and projects from a SourceSafe database. The SourceSafe command line ss physical can be used to display the physical file name associated with a logical file or folder from the VSS database.

If you look at the the free SourceSafe tools from the www.ezds.com site you'll find a SSNPL utility whose description says that every SourceSafe file has a number, a physical file name, and a logical file name; ssnpl converts from one to the other.

If you asked yourself why is the number good for, the answer is simple: when storing a reference to a physical file in another database file, VSS will use the file number instead of the physical file name, because it requires less space on disk.

So, the question that remains is how does VSS generate these physical file names, and what is this file number calculated?

Let's start our investigation by creating a new database ("mkss.exe C:\temp\vss"). The database is initially empty, and in the data folder there is only one file, data\a\aaaaaaaa. If you use the "ss.exe physical $/" command you'll find out that AAAAAAAA physical name is used by the database root.

From this MSDN article you'll find that the content of the data\aaaaaaaa.cnt file in the SourceSafe database reflects the physical name of the last file added to the database. Indeed, the initial content of this file is 'AAAAAAAA' - sign that last created file or folder was the database root.

Let's add a file into the database. You'll see that the content of the aaaaaaaa.cnt file is now BAAAAAAA, and SourceSafe has created a new file on disk, data\b\baaaaaaa.

Let's continue adding one-by-one new files into the database and each time look at the content of the aaaaaaaa.cnt to see how the new physical files are named. It's easy to see the files are named BAAAAAAA, CAAAAAAA, DAAAAAAA, etc., each file ending up in the data subfolder identified by the first letter of the filename - (data\b\baaaaaaa, data\c\caaaaaaa, data\d\daaaaaaa), etc. After the last letter of the alphabet is reached (ZAAAAAAA), the filenames wrap back, using the second letter: ABAAAAAA, BBAAAAAA, CBAAAAAAA, etc., with files being created on disk in data\a\abaaaaaaa, data\b\bbaaaaaa, data\c\cbaaaaaa, etc.
So, the SourceSafe naming scheme and location of physical files uses a 26-way hashtable to distribute the files on disk in the a-z subfolders of the data folder.

If you run the SSNPLtool on the files added into the database ("ssnpl.exe $/ C:\temp\vss\data", etc.) you'll see their numbers are incremental: 0 (the database root), then 1, 2, 3, 25, 26, 27, .... etc. for each added file.

It's failry easy now to deduce how the numbering scheme works:

  • each file added into the database gets next available number in sequence: 0, 1, 2, etc.
  • the physical file names are composed of 8 letter characters, using letters A-Z: [L0][L1][L2][L3][L4][L5][L6][L7]
  • the number associated with a file is: (L0 - 'A') + 26 * (L1 -'A') + 26^2 * (L2 - 'A') + 26^3 * (L3 - 'A') + ....

Basically the physical file name is a base-26 representation of the file number, with each base-26-digit represented in A-Z range instead of 0-9A-P.

You should be now able to convert easily between the file numbers and physical file names.

As for the mapping between physical file names and logical paths in the database, that is a bit more complex, so I'll leave it for another time...

6 comments:

Tony Steer said...

Did you ever follow up this rather brilliant article with info on how to get from physical file name to logical file name?

Alin Constantin said...

Hi Tony,

Unfortunately I didn't have time.

I know how to get the mapping, but since the structures involved are internal I can't simply publish their format (for legal considerations). I have to do some reverse engineering, which takes time...

In case you want to do it, here are some ideas (double check the numbers for yourself, I may have miscalculated):

1) figure out the format of the physical file for a file in database (at least the parent folder). In an empty database, add a new file with a distinct name in the database root. The file will have physical ID==BAAAAAA. Open this file with a hex editor, you'll find the name of the parent folder (AAAAAAA) at offset 0x450 in the file, and the real name of the file at offset 0x4BD.

2) figure out the format of a physical folder in the database. Create now a folder in the root with a distinct name, e.g. 123456789012. Open it's phyical file (CAAAAAAA), and notice at offset 0x190 there is AAAAAAA - the physical name of the parent (the root). At offset 0x42 there is the name of the file. And at offset 0x8C there is $/, which seems to be the real name of the parent.
A physical file for a folder has a header of a fixed size, followed by fixed-size entries for subfolder/files. Add more files/folders in the root, notice AAAAAAA increases each time with 412 (0x19C) bytes. Walking backwards and subtracting the entries size, it seems the entries start at offset 828 (0x33C) in the physical file. In each entry, the real name of the subfolder/file seems to be at offset 0x62, and its physical name at offset 0x88

If you want to figure out the physical name of $/Folder1/Folder2/Folder3/..., you'd have to start with the root (AAAAAAA), open this physical file, start from offset 0x33C and look in entries of size 0x19C, locate the directory entry for Folder1 by looking in each entry at offset 0x62. Once you find the entry, get it's physical file name from offset 0x88, open this new physical file and repeat the steps to locate the entry for Folder2, etc.

Reverse, starting with a physical file and figuring out the real path is done like this:
- open the physical file. Figure out if it's a phisical file for a folder or for a file (I don't remember how - exercise for reader :-))
- if it's for a file, get the file name from the header at 0x4BD, get the physical name of the parent at 0x450. Continue to next step for the parent.
- if it's for a folder, get the name of the folder at 0x42 and the name of its parent at 0x8C (you can also use it's physical name of the parent and repeat this step until you reach the root)

This works if the folders have names shorter than 31 characters - longer names are stored in names.dat, so you'd have to reverse engineer the format of that file, and figure out where in the directory entry is stored the index of the names.dat entry with the long name.

Anonymous said...

Well, I can tell you how to get from Physical filename to logical filename:

SS.exe physical $/ -r -o"C:\FileNames.log"

It will produce a logfile containing all the names.

Alin Constantin said...

@David: I think you're missing the point. The article was about how "ss physical" and ssnpl.exe get to display their results...

Fwiw, if you're interested in the logical path of just one physical file, I wouldn't be enumerating all the files in the database with ss physical then searching in the result. It's inefficient if you have a large database. You can simply use "ssnpl physical_filename database_path"

Anonymous said...

Alin,

Great blog post! I am extending an existing program a little bit that does SourceSafe conversions by reading the physical files directly. I haven't seen any documentation on the SourceSafe file structure out there besides some existing code that reverse engineers it. So it was nice to stumble across your blog post.

In your comment to Tony you mentioned that the offset 0x450 for files contains the parent physical name. This appears to be the case on many of the files I open, but on at least one of the files I have the parent offset doesn't occur until 0x60d. Is this an indicator of a corrupt file, or is there a certain algorithm to be taken into account to find the correct offset?

Thanks!
Abe

Alin Constantin said...

@Abe: Sorry I don't remember much about the VSS file format structure, and I have much time lately to figure it out again.
If you suspect something is wrong with that file, run analyze tool on the database and see if it makes any changes to it. If it doesn't, the file is ok.
You can also use "analyze -pss" on the file to see how that parses some of the data structures.