-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Fixed and Improved Gitlab Project Metadata in-memory cache #4727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| @@ -1022,77 +1028,6 @@ func (s *Source) WithScanOptions(scanOptions *git.ScanOptions) { | |||
| s.scanOptions = scanOptions | |||
| } | |||
|
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These funcs are moved at the end of the file.
Earlier, we used a map to temporarily store GitLab project metadata. While maps work well for small datasets, they don’t scale efficiently for larger ones. There was also a bug in the caching logic: when storing entries, we used the GitLab HTTPURLToRepo field as the cache key, but when retrieving entries, we used the normalized URL. As a result, cache lookups almost never succeeded, and the cache kept growing without being effectively used. With this fix, we’ve replaced the map with an LRU cache, which is better suited for this use case. The cache now stores up to 15,000 entries for one hour, after which the LRU mechanism automatically evicts old items, keeping memory usage under control. We also consistently use the normalized URL for both setting and fetching cache entries.
e6ea089 to
cfd505c
Compare
c7d7e66 to
582adbf
Compare
pkg/sources/gitlab/project_cache.go
Outdated
| return &projectMetadataCache{ | ||
| cache: expirable.NewLRU[string, *project]( | ||
| 15000, // upto 15000 entries | ||
| nil, | ||
| 60*time.Minute, // time-based expiration - 1 hour | ||
| ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm interested to know about the thought process that went into deciding these numbers. Is that based on our past experience about the rate at which we scan gitlab projects?
Can there be a possibility of an entry getting expired before we might want to use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn’t any deep science behind these numbers. They’re rough, initial choices. The 15K limit is something most organizations won’t hit at all. For organizations with more than 15K projects, I think we should be able to process roughly 15K repositories per hour. In practice, it’s very unlikely that a repository would be enumerated and not scanned within an hour. We need to start with baseline numbers, and if we run into issues, we can always adjust them based on observed behavior.
If someone has a strong alternative (though I don’t think we do) for how many repositories we should process per hour, we can start with that number instead.
Using expirable LRU cache so that entries are automatically cleaned up after their TTL via lazy deletion and a background cleanup routine. Additionally, if the cache reaches the 15K entry limit, it will evict the least recently used entries by design so the new inserts are never blocked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks for the explanation 👍
mustansir14
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some questions
rosecodym
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if a cache entry does expire? It looks like the returned metadata will be incorrect. Did you consider logging an error or something so that we know if that happens?
pkg/sources/gitlab/gitlab.go
Outdated
|
|
||
| // cache of repo URL to project info, used when generating metadata for chunks | ||
| repoToProjCache repoToProjectCache | ||
| *projectMetadataCache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you chose to embed this instead of making it a normal field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this was a mistake. I initially thought we’d need this in the git source, so I embedded it to use it directly with Source. That turned out not to be necessary, and I forgot to change it to a regular field.
184668f to
af84b60
Compare
A cache entry will expire after one hour or in case the cache limit hit 15K without being used. In both cases if cache is not present, the metadata will be empty as previously I believe. I added the error log for cache miss now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| gitlabMetadata.ProjectName = project.name | ||
| gitlabMetadata.ProjectOwner = project.owner | ||
| } else { | ||
| ctx.Logger().Error(errors.New("failed to get repo metadata from cache"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error-level logging on cache miss looks too aggressive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think so. First, this is just a log, we’re not returning an error. Second, I believe we should log an error when we fail to populate the metadata. That would also help us identify whether we need to increase the cache limit or adjust the expiration time.
pkg/sources/gitlab/project_cache.go
Outdated
| cache: expirable.NewLRU[string, *project]( | ||
| 15000, // upto 15000 entries | ||
| nil, | ||
| 60*time.Minute, // time-based expiration - 1 hour |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a cache entry expires mid-scan (60-min TTL), what will happen? The metadata callback silently produces chunks without ProjectId, ProjectName, and ProjectOwner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eef1122 to
74b33cd
Compare
|
If the LRU cache doesn’t fit our use case well and we continue to see many chunks without metadata, my next idea is to implement our own simple.cache which is a wrapper of |
Description:
Earlier, we used a map to temporarily store GitLab project metadata. While maps work well for small datasets, they don’t scale efficiently for larger ones. There was also a bug in the caching logic: when storing entries, we used the GitLab HTTPURLToRepo field as the cache key, but when retrieving entries, we used the normalized URL. As a result, cache lookups almost never succeeded, and the cache kept growing without being effectively used.
With this fix, we’ve replaced the map with an LRU cache, which is better suited for this use case. The cache now stores up to 15,000 entries for one hour, after which the LRU mechanism automatically evicts old items, keeping memory usage under control. We also consistently use the normalized URL for both setting and fetching cache entries.
I also added some comments to improve the readability :)
Checklist:
make test-community)?make lintthis requires golangci-lint)?