-
Notifications
You must be signed in to change notification settings - Fork 9
GetAllFilesRecursivelyFromPath: Speed it up with custom extension #2403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e10465a
to
7d13220
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I think also the recursion in ResolveAlias can be ditched for Windows as shortcuts to shortcuts can not be created by normal means.
-
Also the line
folders = folders + foldersFromAlias
should use AddListItem -
Generally I also think that these filesystem targetting function should not use recursion because that couples stack size to file system depth, that will fail.
-
Also the function could use a rewrite, such that we directly convert the returned file lists from IndexedFile/IndexedDir to waves and work with waves, also return a text wave.
This also preventsallFiles
growing over 2 GB in size.
The function could also use an optional extension where duplicates are removed at the end.
|
This complicates things as you then need to check that foldersFromAlias is not empty as adding an empty element is something different than adding an empty string. But this is also moot as I'm moving to text waves. |
@t-b I had some more thoughts: IndexedFile is traversing all files anyway in a folder. So if we change the extension option of GetAllFilesRecursivelyFromPath to a regex option. Then we change the call of IndexedFile to not use any file extension filter such that it returns all files in the list. And then we apply a GrepList to that list with the regex. Advantage: For cases like the AB where we want to have .pxp and .uxp we only need one call and one file traversal. I think the overhead handling a longer list string is ok. |
Added. |
7d13220
to
adeb219
Compare
Some numbers: Local drives: Fast Samsung SSD Regular HDD: With the new function: *******************************************************************************************
Function: MIES_Utilities_File.ipf GetAllFilesRecursivelyFromPath; Percent total 99%
*******************************************************************************************
[00] |Function/WAVE GetAllFilesRecursivelyFromPath(string pathName, [string regex, variable resolveAliases])
[00] |
[00] | string files, subFoldersList, subFilesList, cdf, workPath
[00] | variable err
[00] | variable numFolders, numFiles, needsAliasResolving, i, needsFiltering
[00] |
[00] | if(ParamIsDefault(regex))
[00] | regex = ".*"
[00] | else
[00] | ASSERT(IsValidRegexp(regex), "Expected a valid regex")
[00] | needsFiltering = 1
[00] | endif
[00] |
[00] | if(ParamIsDefault(resolveAliases))
[00] | resolveAliases = 0
[00] | else
[00] | resolveAliases = !!resolveAliases
[00] | endif
[00] |
[00] | Make/FREE/N=(MINIMUM_WAVE_SIZE_LARGE)/T resultFiles
[00] | SetNumberInWaveNote(resultFiles, NOTE_INDEX, 0)
[00] |
[00] | Make/FREE/N=(MINIMUM_WAVE_SIZE)/T folders
[00] | SetNumberInWaveNote(folders, NOTE_INDEX, 0)
[00] |
[00] | PathInfo $pathName
[00] | ASSERT(V_flag, "Given symbolic path does not exist")
[00] | Make/T/FREE startFolder = {S_path}
[00] | numFolders = ConcatenateWavesWithNoteIndex(folders, startFolder)
[00] |
[00] | workPath = GetUniqueSymbolicPath()
[00] |
[00] | AssertOnAndClearRTError()
[00]* | for(i = 0; i < numFolders; i += 1)
[00] |
[00]* | cdf = folders[i]
[68]******* | NewPath/Q/O/Z $workPath, cdf
[00] |
[02]* | subFoldersList = IndexedDir($workPath, -1, 1, FILE_LIST_SEP); err = GetRTError(1)
[00]* | WAVE/T subFolders = ListToTextWave(subFoldersList, FILE_LIST_SEP)
[00] | // add trailing colon
[00]* | Multithread subFolders[] += ":"
[00] |
[00] | ConcatenateWavesWithNoteIndex(folders, subFolders)
[00] |
[29]*** | subFilesList = IndexedFile($workPath, -1, "????", "????", FILE_LIST_SEP); err = GetRTError(1)
[00]* | WAVE/T subFiles = ListToTextWave(subFilesList, FILE_LIST_SEP)
[00] | // make them absolute
[00]* | Multithread subFiles[] = cdf + subFiles[p]
[00] |
[00] | if(resolveAliases)
[00] | [WAVE/T filesResolved, WAVE/T foldersResolved] = GetAllFilesAliasHelper(subFiles)
[00] | else
[00]* | WAVE/T filesResolved = subFiles
[00] | WAVE/ZZ/T foldersResolved
[00] | endif
[00] |
[00]* | numFiles = ConcatenateWavesWithNoteIndex(resultFiles, filesResolved)
[00] | numFolders = ConcatenateWavesWithNoteIndex(folders, foldersResolved)
[00] | endfor
[00] |
[00] | if(!numFiles)
[00] | return $""
[00] | endif
[00] |
[00] | Redimension/N=(numFiles) resultFiles
[00] | Note/K resultFiles
[00] |
[00] | if(needsFiltering)
[00] | WAVE/Z/T resultFilesFiltered = GrepTextWave(resultFiles, regex)
[00] | else
[00] | WAVE/T resultFilesFiltered = resultFiles
[00] | endif
[00] |
[00] | return resultFilesFiltered
[00] |End
Some last bench numbers: With Igor Pro on SSD: •test() The same folder with Total Commander gathering files and sizes: 3.3 s That is roughly a factor of 125x faster. |
Jeep, the profiling clearly shows all the work is spent in IP itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment.
9880c9b
to
30b61e7
Compare
@MichaelHuth I'm assigning you already although CI is not yet finished. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs commit compilation fix.
Rest LGTM.
@t-b when attempting to add a folder, I'm getting:
|
@timjarsky Oops, I'll fix that. |
bfb875d
to
bd592d9
Compare
@timjarsky Fixed. |
There is no reason we can't support it. We have to special case it though.
As we are now using DeleteFile/DeleteFolder we also need to check that the misc settings entry is correct so that these call work.
When we get a filename with a symbolic path for an existing file, we need to return the full path as well. Use S_Path from GetFileFolderInfo for that.
This is not needed.
We used to always perform alias/shortcut handling even when we only look for certain file suffixes. But we only have to try following these if they can be part of the returned list. So for the common case of Windows OS and a extension which does not match .lnk we can skip the alias resolving. When we do have aliases we also first call GetFileFolderInfo and only if we have an alias ResolveAlias again, skipping another GetFileFolderInfo for the case when the pointed to file is not an alias. This allows to gather 100k files on a 1Gb/s network link in under 60s. ```igorpro Function Dostuff() Newpath/O testPath "Y:" beginfunctionProfiling(testTime = 50000) string result = getAllfilesRecursivelyFromPath("testPath", extension = ".mp3") endfunctionProfiling() printf "Number of files: %d\r", ItemsInList(result, "|") End ``` gives ``` Total time: 54.4173, Time in Function code: 54.3973 (100%) Top function percentages: Function MIES_Utilities_File.ipf GetAllFilesRecursivelyFromPath: 99% Annotated Top Functions: ******************************************************************************************* Function: MIES_Utilities_File.ipf MIES_Utilities_File.ipf#GetAllFilesRecursivelyFromPath; Percent total 99% ******************************************************************************************* [00] |Function/S GetAllFilesRecursivelyFromPath(string pathName, [string extension]) [00] | [00] | string fileOrPath, folders, subFolderPathName, fileName [00] | string files, allFilesList [00] | string allFiles = "" [00]* | string foldersFromAlias = "" [00] | variable err, isWindows [00] | [00] |#ifdef WINDOWS [00] | isWindows = 1 [00] |#endif [00] | [01]* | PathInfo $pathName [00] | ASSERT(V_flag, "Given symbolic path does not exist") [00] | [00] | if(ParamIsDefault(extension)) [00] | extension = "????" [00] | endif [00] | [00] | AssertOnAndClearRTError() [70]******* | allFilesList = IndexedFile($pathName, -1, extension, "????", FILE_LIST_SEP); err = GetRTError(1) [00] | [00]* | if(isWindows && cmpstr(extension, "????") && cmpstr(extension, ".lnk")) [00] | allFiles = AddPrefixToEachListItem(S_path, allFilesList, sep = FILE_LIST_SEP) [00] | else [00] | // try to resolve aliases/shortcuts when we can have them [00] | WAVE/T allFilesInDir = ListToTextWave(allFilesList, FILE_LIST_SEP) [00] | for(fileName : allFilesInDir) [00] | [00] | GetFileFolderInfo/P=$pathName/Q/Z fileName [00] | [00] | if(V_Flag != 0) [00] | // error querying file [00] | continue [00] | endif [00] | [00] | if(V_IsAliasShortcut) [00] | fileOrPath = ResolveAlias(fileName, pathName = pathName) [00] | [00] | // redo the check [00] | GetFileFolderInfo/P=$pathName/Q/Z fileOrPath [00] | [00] | if(V_Flag != 0) [00] | // error querying file/folder pointed to by alias [00] | continue [00] | endif [00] | endif [00] | [00] | if(V_isFile) [00] | allFiles = AddListItem(S_path, allFiles, FILE_LIST_SEP, Inf) [00] | elseif(V_isFolder) [00] | foldersFromAlias = AddListItem(S_path, foldersFromAlias, FILE_LIST_SEP, Inf) [00] | else [00] | ASSERT(0, "Unexpected file type") [00] | endif [00] | endfor [00] | endif [00] | [00] | AssertOnAndClearRTError() [04]* | folders = IndexedDir($pathName, -1, 1, FILE_LIST_SEP); err = GetRTError(1) [00] | folders = folders + foldersFromAlias [00]* | WAVE/T wFolders = ListToTextWave(folders, FILE_LIST_SEP) [00]* | for(folder : wFolders) [00] | [00] | subFolderPathName = GetUniqueSymbolicPath() [00] | [12]* | NewPath/Q/O $subFolderPathName, folder [00]* | files = GetAllFilesRecursivelyFromPath(subFolderPathName, extension = extension) [11]* | KillPath/Z $subFolderPathName [00] | [00]* | if(!isEmpty(files)) [01]* | allFiles = AddListItem(RemoveEnding(files, FILE_LIST_SEP), allFiles, FILE_LIST_SEP, Inf) [00] | endif [00] | endfor [00] | [00] | return allFiles [00] |End ``` Close #2402
We don't need to look at the minimum wave size when the requested index already fits. We now also ensure that we double the wave sizes so that we should end up always with a power of two when we started with one.
bd592d9
to
8e5aa1a
Compare
@timjarsky I think it might be beneficial to have this in the next release 2.9 as well. |
@t-b Sure. Here are the latest profiling results from adding folder 105 to the analysis browser
|
This does now not use recursion anymore, so the supported file system depth is independent of the IP stack depth. It also returns a text wave.
8e5aa1a
to
2436f40
Compare
We used to always perform alias/shortcut handling even when we only look for certain file suffixes.
But we only have to try following these if they can be part of the returned list.
So for the common case of Windows OS and a extension which does not match .lnk we can skip the alias resolving.
When we do have aliases we also first call GetFileFolderInfo and only if we have an alias ResolveAlias again, skipping another GetFileFolderInfo for the case when the pointed to file is not an alias.
This allows to gather 100k files on a 1Gb/s network link in under 60s.
gives
-> flaky test
Close #2402
Close #1733