Skip to content

GetAllFilesRecursivelyFromPath: Speed it up with custom extension #2403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 1, 2025

Conversation

t-b
Copy link
Collaborator

@t-b t-b commented Apr 14, 2025

We used to always perform alias/shortcut handling even when we only look for certain file suffixes.

But we only have to try following these if they can be part of the returned list.

So for the common case of Windows OS and a extension which does not match .lnk we can skip the alias resolving.

When we do have aliases we also first call GetFileFolderInfo and only if we have an alias ResolveAlias again, skipping another GetFileFolderInfo for the case when the pointed to file is not an alias.

This allows to gather 100k files on a 1Gb/s network link in under 60s.

Function Dostuff()

	Newpath/O testPath "Y:"

	beginfunctionProfiling(testTime = 50000)
	string result = getAllfilesRecursivelyFromPath("testPath", extension = ".mp3")
	endfunctionProfiling()
	printf  "Number of files: %d\r", ItemsInList(result, "|")
End

gives

Total time: 54.4173, Time in Function code: 54.3973 (100%)
Top function percentages:
Function MIES_Utilities_File.ipf GetAllFilesRecursivelyFromPath: 99%

Annotated Top Functions:

*******************************************************************************************
Function: MIES_Utilities_File.ipf MIES_Utilities_File.ipf#GetAllFilesRecursivelyFromPath; Percent total 99%
*******************************************************************************************
[00]          	|Function/S GetAllFilesRecursivelyFromPath(string pathName, [string extension])
[00]          	|
[00]          	|	string fileOrPath, folders, subFolderPathName, fileName
[00]          	|	string files, allFilesList
[00]          	|	string allFiles         = ""
[00]*         	|	string foldersFromAlias = ""
[00]          	|	variable err, isWindows
[00]          	|
[00]          	|#ifdef WINDOWS
[00]          	|	isWindows = 1
[00]          	|#endif
[00]          	|
[01]*         	|	PathInfo $pathName
[00]          	|	ASSERT(V_flag, "Given symbolic path does not exist")
[00]          	|
[00]          	|	if(ParamIsDefault(extension))
[00]          	|		extension = "????"
[00]          	|	endif
[00]          	|
[00]          	|	AssertOnAndClearRTError()
[70]*******   	|	allFilesList = IndexedFile($pathName, -1, extension, "????", FILE_LIST_SEP); err = GetRTError(1)
[00]          	|
[00]*         	|	if(isWindows && cmpstr(extension, "????") && cmpstr(extension, ".lnk"))
[00]          	|		allFiles = AddPrefixToEachListItem(S_path, allFilesList, sep = FILE_LIST_SEP)
[00]          	|	else
[00]          	|		// try to resolve aliases/shortcuts when we can have them
[00]          	|		WAVE/T allFilesInDir = ListToTextWave(allFilesList, FILE_LIST_SEP)
[00]          	|		for(fileName : allFilesInDir)
[00]          	|
[00]          	|			GetFileFolderInfo/P=$pathName/Q/Z fileName
[00]          	|
[00]          	|			if(V_Flag != 0)
[00]          	|				// error querying file
[00]          	|				continue
[00]          	|			endif
[00]          	|
[00]          	|			if(V_IsAliasShortcut)
[00]          	|				fileOrPath = ResolveAlias(fileName, pathName = pathName)
[00]          	|
[00]          	|				// redo the check
[00]          	|				GetFileFolderInfo/P=$pathName/Q/Z fileOrPath
[00]          	|
[00]          	|				if(V_Flag != 0)
[00]          	|					// error querying file/folder pointed to by alias
[00]          	|					continue
[00]          	|				endif
[00]          	|			endif
[00]          	|
[00]          	|			if(V_isFile)
[00]          	|				allFiles = AddListItem(S_path, allFiles, FILE_LIST_SEP, Inf)
[00]          	|			elseif(V_isFolder)
[00]          	|				foldersFromAlias = AddListItem(S_path, foldersFromAlias, FILE_LIST_SEP, Inf)
[00]          	|			else
[00]          	|				ASSERT(0, "Unexpected file type")
[00]          	|			endif
[00]          	|		endfor
[00]          	|	endif
[00]          	|
[00]          	|	AssertOnAndClearRTError()
[04]*         	|	folders = IndexedDir($pathName, -1, 1, FILE_LIST_SEP); err = GetRTError(1)
[00]          	|	folders = folders + foldersFromAlias
[00]*         	|	WAVE/T wFolders = ListToTextWave(folders, FILE_LIST_SEP)
[00]*         	|	for(folder : wFolders)
[00]          	|
[00]          	|		subFolderPathName = GetUniqueSymbolicPath()
[00]          	|
[12]*         	|		NewPath/Q/O $subFolderPathName, folder
[00]*         	|		files = GetAllFilesRecursivelyFromPath(subFolderPathName, extension = extension)
[11]*         	|		KillPath/Z $subFolderPathName
[00]          	|
[00]*         	|		if(!isEmpty(files))
[01]*         	|			allFiles = AddListItem(RemoveEnding(files, FILE_LIST_SEP), allFiles, FILE_LIST_SEP, Inf)
[00]          	|		endif
[00]          	|	endfor
[00]          	|
[00]          	|	return allFiles
[00]          	|End
  • Review comments
  • Check Mac behaviour
  • Fix
 Test finished with no errors
  ? Free wave leak detected (leaked waves: 6) in "ZeroMQPublishingTests#CheckTPPublishing:period_now"
End of test "MIES with Basic"

-> flaky test

Close #2402
Close #1733

@t-b t-b force-pushed the bugfix/2403-speedup-get-all-files-recursively branch from e10465a to 7d13220 Compare April 14, 2025 18:10
@t-b t-b mentioned this pull request Apr 14, 2025
14 tasks
Copy link
Collaborator

@MichaelHuth MichaelHuth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think also the recursion in ResolveAlias can be ditched for Windows as shortcuts to shortcuts can not be created by normal means.

  • Also the line
    folders = folders + foldersFromAlias
    should use AddListItem

  • Generally I also think that these filesystem targetting function should not use recursion because that couples stack size to file system depth, that will fail.

  • Also the function could use a rewrite, such that we directly convert the returned file lists from IndexedFile/IndexedDir to waves and work with waves, also return a text wave.
    This also prevents allFiles growing over 2 GB in size.

The function could also use an optional extension where duplicates are removed at the end.

@MichaelHuth MichaelHuth assigned t-b and unassigned MichaelHuth Apr 15, 2025
@t-b
Copy link
Collaborator Author

t-b commented Apr 15, 2025

  • Document that GetAllFilesRecursivelyFromPath does only resolve symbolic links when the requested extension is ???? or .lnk
  • Add test with GetAllFilesRecursivelyFromPath where a ".lnk" file points into the woods. Failed on Michael's machine.

@t-b
Copy link
Collaborator Author

t-b commented Apr 16, 2025

Also the line
folders = folders + foldersFromAlias
should use AddListItem

This complicates things as you then need to check that foldersFromAlias is not empty as adding an empty element is something different than adding an empty string. But this is also moot as I'm moving to text waves.

@MichaelHuth
Copy link
Collaborator

@t-b I had some more thoughts: IndexedFile is traversing all files anyway in a folder.

So if we change the extension option of GetAllFilesRecursivelyFromPath to a regex option. Then we change the call of IndexedFile to not use any file extension filter such that it returns all files in the list. And then we apply a GrepList to that list with the regex.

Advantage: For cases like the AB where we want to have .pxp and .uxp we only need one call and one file traversal.

I think the overhead handling a longer list string is ok.

@t-b
Copy link
Collaborator Author

t-b commented Apr 16, 2025

So if we change the extension option of GetAllFilesRecursivelyFromPath to a regex option. Then we change the call of IndexedFile to not use any file extension filter such that it returns all files in the list. And then we apply a GrepList to that list with the regex.

Added.

@t-b t-b force-pushed the bugfix/2403-speedup-get-all-files-recursively branch from 7d13220 to adeb219 Compare April 16, 2025 19:11
@t-b t-b assigned MichaelHuth and unassigned t-b Apr 16, 2025
@t-b t-b requested a review from MichaelHuth April 16, 2025 19:12
@MichaelHuth
Copy link
Collaborator

MichaelHuth commented Apr 17, 2025

Some numbers:

Local drives:

Fast Samsung SSD
•test()
Time: 1751
Number of files: 793393
Files per second 453.108

Regular HDD:
•test()
Time: 7630.9
Number of files: 2.80506e+06
Files per second 367.592

With the new function:

*******************************************************************************************
Function: MIES_Utilities_File.ipf GetAllFilesRecursivelyFromPath; Percent total 99%
*******************************************************************************************
[00]          	|Function/WAVE GetAllFilesRecursivelyFromPath(string pathName, [string regex, variable resolveAliases])
[00]          	|
[00]          	|	string files, subFoldersList, subFilesList, cdf, workPath
[00]          	|	variable err
[00]          	|	variable numFolders, numFiles, needsAliasResolving, i, needsFiltering
[00]          	|
[00]          	|	if(ParamIsDefault(regex))
[00]          	|		regex = ".*"
[00]          	|	else
[00]          	|		ASSERT(IsValidRegexp(regex), "Expected a valid regex")
[00]          	|		needsFiltering = 1
[00]          	|	endif
[00]          	|
[00]          	|	if(ParamIsDefault(resolveAliases))
[00]          	|		resolveAliases = 0
[00]          	|	else
[00]          	|		resolveAliases = !!resolveAliases
[00]          	|	endif
[00]          	|
[00]          	|	Make/FREE/N=(MINIMUM_WAVE_SIZE_LARGE)/T resultFiles
[00]          	|	SetNumberInWaveNote(resultFiles, NOTE_INDEX, 0)
[00]          	|
[00]          	|	Make/FREE/N=(MINIMUM_WAVE_SIZE)/T folders
[00]          	|	SetNumberInWaveNote(folders, NOTE_INDEX, 0)
[00]          	|
[00]          	|	PathInfo $pathName
[00]          	|	ASSERT(V_flag, "Given symbolic path does not exist")
[00]          	|	Make/T/FREE startFolder = {S_path}
[00]          	|	numFolders = ConcatenateWavesWithNoteIndex(folders, startFolder)
[00]          	|
[00]          	|	workPath = GetUniqueSymbolicPath()
[00]          	|
[00]          	|	AssertOnAndClearRTError()
[00]*         	|	for(i = 0; i < numFolders; i += 1)
[00]          	|
[00]*         	|		cdf = folders[i]
[68]*******   	|		NewPath/Q/O/Z $workPath, cdf
[00]          	|
[02]*         	|		subFoldersList = IndexedDir($workPath, -1, 1, FILE_LIST_SEP); err = GetRTError(1)
[00]*         	|		WAVE/T subFolders = ListToTextWave(subFoldersList, FILE_LIST_SEP)
[00]          	|		// add trailing colon
[00]*         	|		Multithread subFolders[] += ":"
[00]          	|
[00]          	|		ConcatenateWavesWithNoteIndex(folders, subFolders)
[00]          	|
[29]***       	|		subFilesList = IndexedFile($workPath, -1, "????", "????", FILE_LIST_SEP); err = GetRTError(1)
[00]*         	|		WAVE/T subFiles = ListToTextWave(subFilesList, FILE_LIST_SEP)
[00]          	|		// make them absolute
[00]*         	|		Multithread subFiles[] = cdf + subFiles[p]
[00]          	|
[00]          	|		if(resolveAliases)
[00]          	|			[WAVE/T filesResolved, WAVE/T foldersResolved] = GetAllFilesAliasHelper(subFiles)
[00]          	|		else
[00]*         	|			WAVE/T filesResolved = subFiles
[00]          	|			WAVE/ZZ/T foldersResolved
[00]          	|		endif
[00]          	|
[00]*         	|		numFiles   = ConcatenateWavesWithNoteIndex(resultFiles, filesResolved)
[00]          	|		numFolders = ConcatenateWavesWithNoteIndex(folders, foldersResolved)
[00]          	|	endfor
[00]          	|
[00]          	|	if(!numFiles)
[00]          	|		return $""
[00]          	|	endif
[00]          	|
[00]          	|	Redimension/N=(numFiles) resultFiles
[00]          	|	Note/K resultFiles
[00]          	|
[00]          	|	if(needsFiltering)
[00]          	|		WAVE/Z/T resultFilesFiltered = GrepTextWave(resultFiles, regex)
[00]          	|	else
[00]          	|		WAVE/T resultFilesFiltered = resultFiles
[00]          	|	endif
[00]          	|
[00]          	|	return resultFilesFiltered
[00]          	|End

Some last bench numbers:

With Igor Pro on SSD:

•test()
Time: 415.755
Number of files: 328419
Files per second 789.9345141884879

The same folder with Total Commander gathering files and sizes: 3.3 s

That is roughly a factor of 125x faster.

@t-b
Copy link
Collaborator Author

t-b commented Apr 17, 2025

Jeep, the profiling clearly shows all the work is spent in IP itself.

Copy link
Collaborator

@MichaelHuth MichaelHuth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment.

@MichaelHuth MichaelHuth assigned t-b and unassigned MichaelHuth Apr 22, 2025
@t-b t-b force-pushed the bugfix/2403-speedup-get-all-files-recursively branch from 9880c9b to 30b61e7 Compare April 22, 2025 21:45
@t-b t-b assigned MichaelHuth and unassigned t-b Apr 22, 2025
@t-b t-b requested a review from MichaelHuth April 22, 2025 21:48
@t-b
Copy link
Collaborator Author

t-b commented Apr 22, 2025

@MichaelHuth I'm assigning you already although CI is not yet finished.

MichaelHuth
MichaelHuth previously approved these changes Apr 23, 2025
Copy link
Collaborator

@MichaelHuth MichaelHuth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs commit compilation fix.

Rest LGTM.

@timjarsky timjarsky removed their assignment Apr 24, 2025
@timjarsky
Copy link
Collaborator

@t-b when attempting to add a folder, I'm getting:

  !!! Threadsafe assertion FAILED !!!
  Message: "Invalid value"
  Please provide the following information if you contact the MIES developers:
  ################################
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Stacktrace:
  AB_OpenAnalysisBrowser(...)#L2772 [MIES_AnalysisBrowser.ipf]
PGC_SetAndActivateControl(...)#L220 [MIES_ProgrammaticGUIControl.ipf]
AB_ButtonProc_Refresh(...)#L2959 [MIES_AnalysisBrowser.ipf]
AB_AddExperimentEntries(...)#L2688 [MIES_AnalysisBrowser.ipf]
AB_AddFile(...)#L267 [MIES_AnalysisBrowser.ipf]
AB_LoadFile(...)#L401 [MIES_AnalysisBrowser.ipf]
AB_FillListWave(...)#L486 [MIES_AnalysisBrowser.ipf]
EnsureLargeEnoughWave(...)#L68 [MIES_Utilities_WaveHandling.ipf]
FindNextPower(...)#L112 [MIES_Utilities_Numeric.ipf]
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Time: 2025-04-23T19:32:36-07:00
  Experiment: Untitled ()
  Igor Pro version: 9.0.6.1 (56565)
  ################################

@t-b
Copy link
Collaborator Author

t-b commented Apr 24, 2025

@timjarsky Oops, I'll fix that.

@t-b t-b force-pushed the bugfix/2403-speedup-get-all-files-recursively branch from bfb875d to bd592d9 Compare April 25, 2025 18:33
@t-b
Copy link
Collaborator Author

t-b commented Apr 25, 2025

@timjarsky Fixed.

@t-b t-b assigned timjarsky and unassigned t-b Apr 25, 2025
t-b added 8 commits April 26, 2025 01:00
There is no reason we can't support it. We have to special case it though.
As we are now using DeleteFile/DeleteFolder we also need to check that the
misc settings entry is correct so that these call work.
When we get a filename with a symbolic path for an existing file, we need
to return the full path as well.

Use S_Path from GetFileFolderInfo for that.
This is not needed.
We used to always perform alias/shortcut handling even when we only look
for certain file suffixes.

But we only have to try following these if they can be part of the
returned list.

So for the common case of Windows OS and a extension which does not match
.lnk we can skip the alias resolving.

When we do have aliases we also first call GetFileFolderInfo and only if
we have an alias ResolveAlias again, skipping another GetFileFolderInfo
for the case when the pointed to file is not an alias.

This allows to gather 100k files on a 1Gb/s network link in under 60s.

```igorpro
Function Dostuff()

	Newpath/O testPath "Y:"

	beginfunctionProfiling(testTime = 50000)
	string result = getAllfilesRecursivelyFromPath("testPath", extension = ".mp3")
	endfunctionProfiling()
	printf  "Number of files: %d\r", ItemsInList(result, "|")
End
```

gives

```
Total time: 54.4173, Time in Function code: 54.3973 (100%)
Top function percentages:
Function MIES_Utilities_File.ipf GetAllFilesRecursivelyFromPath: 99%

Annotated Top Functions:

*******************************************************************************************
Function: MIES_Utilities_File.ipf MIES_Utilities_File.ipf#GetAllFilesRecursivelyFromPath; Percent total 99%
*******************************************************************************************
[00]          	|Function/S GetAllFilesRecursivelyFromPath(string pathName, [string extension])
[00]          	|
[00]          	|	string fileOrPath, folders, subFolderPathName, fileName
[00]          	|	string files, allFilesList
[00]          	|	string allFiles         = ""
[00]*         	|	string foldersFromAlias = ""
[00]          	|	variable err, isWindows
[00]          	|
[00]          	|#ifdef WINDOWS
[00]          	|	isWindows = 1
[00]          	|#endif
[00]          	|
[01]*         	|	PathInfo $pathName
[00]          	|	ASSERT(V_flag, "Given symbolic path does not exist")
[00]          	|
[00]          	|	if(ParamIsDefault(extension))
[00]          	|		extension = "????"
[00]          	|	endif
[00]          	|
[00]          	|	AssertOnAndClearRTError()
[70]*******   	|	allFilesList = IndexedFile($pathName, -1, extension, "????", FILE_LIST_SEP); err = GetRTError(1)
[00]          	|
[00]*         	|	if(isWindows && cmpstr(extension, "????") && cmpstr(extension, ".lnk"))
[00]          	|		allFiles = AddPrefixToEachListItem(S_path, allFilesList, sep = FILE_LIST_SEP)
[00]          	|	else
[00]          	|		// try to resolve aliases/shortcuts when we can have them
[00]          	|		WAVE/T allFilesInDir = ListToTextWave(allFilesList, FILE_LIST_SEP)
[00]          	|		for(fileName : allFilesInDir)
[00]          	|
[00]          	|			GetFileFolderInfo/P=$pathName/Q/Z fileName
[00]          	|
[00]          	|			if(V_Flag != 0)
[00]          	|				// error querying file
[00]          	|				continue
[00]          	|			endif
[00]          	|
[00]          	|			if(V_IsAliasShortcut)
[00]          	|				fileOrPath = ResolveAlias(fileName, pathName = pathName)
[00]          	|
[00]          	|				// redo the check
[00]          	|				GetFileFolderInfo/P=$pathName/Q/Z fileOrPath
[00]          	|
[00]          	|				if(V_Flag != 0)
[00]          	|					// error querying file/folder pointed to by alias
[00]          	|					continue
[00]          	|				endif
[00]          	|			endif
[00]          	|
[00]          	|			if(V_isFile)
[00]          	|				allFiles = AddListItem(S_path, allFiles, FILE_LIST_SEP, Inf)
[00]          	|			elseif(V_isFolder)
[00]          	|				foldersFromAlias = AddListItem(S_path, foldersFromAlias, FILE_LIST_SEP, Inf)
[00]          	|			else
[00]          	|				ASSERT(0, "Unexpected file type")
[00]          	|			endif
[00]          	|		endfor
[00]          	|	endif
[00]          	|
[00]          	|	AssertOnAndClearRTError()
[04]*         	|	folders = IndexedDir($pathName, -1, 1, FILE_LIST_SEP); err = GetRTError(1)
[00]          	|	folders = folders + foldersFromAlias
[00]*         	|	WAVE/T wFolders = ListToTextWave(folders, FILE_LIST_SEP)
[00]*         	|	for(folder : wFolders)
[00]          	|
[00]          	|		subFolderPathName = GetUniqueSymbolicPath()
[00]          	|
[12]*         	|		NewPath/Q/O $subFolderPathName, folder
[00]*         	|		files = GetAllFilesRecursivelyFromPath(subFolderPathName, extension = extension)
[11]*         	|		KillPath/Z $subFolderPathName
[00]          	|
[00]*         	|		if(!isEmpty(files))
[01]*         	|			allFiles = AddListItem(RemoveEnding(files, FILE_LIST_SEP), allFiles, FILE_LIST_SEP, Inf)
[00]          	|		endif
[00]          	|	endfor
[00]          	|
[00]          	|	return allFiles
[00]          	|End
```

Close #2402
We don't need to look at the minimum wave size when the requested index
already fits.

We now also ensure that we double the wave sizes so that we should end up
always with a power of two when we started with one.
@t-b t-b force-pushed the bugfix/2403-speedup-get-all-files-recursively branch from bd592d9 to 8e5aa1a Compare April 25, 2025 23:00
@t-b
Copy link
Collaborator Author

t-b commented Apr 25, 2025

@timjarsky I think it might be beneficial to have this in the next release 2.9 as well.

@timjarsky
Copy link
Collaborator

@t-b Sure. Here are the latest profiling results from adding folder 105 to the analysis browser

Total time: 506.613, Time in Function code: 453.053 (89.4%)
Top function percentages:
Function MIES_Utilities_File.ipf AskUserForExistingFolder: 84%

Annotated Top Functions:

*******************************************************************************************
Function: MIES_Utilities_File.ipf AskUserForExistingFolder; Percent total 84%
*******************************************************************************************
[00]          	|Function/S AskUserForExistingFolder(string baseFolder)
[00]          	|
[00]          	|	string symbPath, selectedFolder
[00]          	|
[00]*         	|	symbPath = GetUniqueSymbolicPath()
[00]          	|
[00]*         	|	NewPath/O/Q/Z $symbPath, baseFolder
[00]          	|	// preset next undirected NewPath/Open call using the contents of a
[00]          	|	// *symbolic* folder
[00]*         	|	PathInfo/S $symbPath
[00]          	|
[00]          	|	// let the user choose a folder, starts in $baseFolder if supplied
[84]********  	|	NewPath/O/Q/Z $symbPath
[00]*         	|	if(V_flag == -1)
[00]          	|		return ""
[00]          	|	endif
[00]*         	|	PathInfo $symbPath
[00]          	|	selectedFolder = S_path
[00]*         	|	KillPath/Z $symbPath
[00]          	|
[00]          	|	return selectedFolder
[00]          	|End

timjarsky
timjarsky previously approved these changes Apr 26, 2025
@t-b t-b mentioned this pull request Apr 28, 2025
2 tasks
@t-b t-b assigned t-b and unassigned timjarsky Apr 29, 2025
This does now not use recursion anymore, so the supported file system depth is
independent of the IP stack depth.

It also returns a text wave.
@t-b t-b force-pushed the bugfix/2403-speedup-get-all-files-recursively branch from 8e5aa1a to 2436f40 Compare April 30, 2025 09:56
@t-b t-b enabled auto-merge April 30, 2025 17:57
@t-b t-b disabled auto-merge May 1, 2025 10:18
@t-b t-b merged commit c918064 into main May 1, 2025
19 checks passed
@t-b t-b deleted the bugfix/2403-speedup-get-all-files-recursively branch May 1, 2025 10:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GetAllFilesRecursively: Make it faster over the network Revise GetAllFilesRecursivelyFromPath
3 participants