Skip to content

Conversation

@mvorisek
Copy link
Contributor

@mvorisek mvorisek commented Feb 12, 2026

part of #19802

assert manual steps described in #7502 and the latest #19796 update using CI

@mvorisek mvorisek requested a review from TimWolla as a code owner February 12, 2026 09:43
@mvorisek
Copy link
Contributor Author

@youkidearitai @alexdowad please review

@alexdowad
Copy link
Contributor

Dear @mvorisek, is the idea here that instead of manually updating the mbstring Unicode data and commit the generated header files, we will not commit the generated files any more, and instead automatically generate them in CI?

Personally, I like doing the updates manually, because it allows the mbstring maintainers to know when things change, add unit tests to ensure that new Unicode characters are handled correctly, etc.

@mvorisek
Copy link
Contributor Author

mvorisek commented Feb 12, 2026

Dear @mvorisek, is the idea here that instead of manually updating the mbstring Unicode data and commit the generated header files, we will not commit the generated files any more, and instead automatically generate them in CI?

The upstream version is defined in https://github.com/mvorisek/php-src/blob/47535c101ad9c84275f82e7445dcf1494a3d729c/.github/scripts/download-bundled/unicode-character-database.sh#L8 . This is part of #19802 and this PR is not about updating the files automatically/using CI, it is about asserting the bundled files match 1:1 the upstream and no mistakes were done when updating them manually.

@youkidearitai
Copy link
Contributor

Unicode version up is one time per year (Unicode 18.0 is maybe 2026). So I think downloads ucd.zip everyday is excesses.

@mvorisek
Copy link
Contributor Author

Unicode version up is one time per year (Unicode 18.0 is maybe 2026). So I think downloads ucd.zip everyday is excesses.

The job step takes about 60 seconds. Is that really a problem compared to all other CI usage even nighly only?

Also this is not something that can be easily changed in order to maintain the workflow file as short and unified as possible.

@youkidearitai
Copy link
Contributor

ucd.zip is depends on unicode.org. The PHP team is not maintenance that resource.
I think seems this CI is not important.

@mvorisek
Copy link
Contributor Author

ucd.zip is depends on unicode.org. The PHP team is not maintenance that resource. I think seems this CI is not important.

See #20800 (comment) for example. The ucd.zip is not maintained by PHP team, but the bundled files calculated from that are, contributors should not modify the bundled files, so this CI is to detect that - on PR/push and periodically using CI.

@alexdowad
Copy link
Contributor

@mvorisek Does this mean that when a new version of Unicode is released, and the mbstring maintainers haven't yet updated (which includes checking what the changes are, adding new unit tests as appropriate, etc.) every CI run for every new PR will start failing?

@youkidearitai
Copy link
Contributor

From supply chain points, If the ucd.zip is broken when unicode.org is hacked, The ucd.zip can include malware.
So I think the ucd.zip should be include this repository.
If you go that far.

@mvorisek
Copy link
Contributor Author

@mvorisek Does this mean that when a new version of Unicode is released, and the mbstring maintainers haven't yet updated (which includes checking what the changes are, adding new unit tests as appropriate, etc.) every CI run for every new PR will start failing?

No. The version is defined/fixed in https://github.com/mvorisek/php-src/blob/47535c101ad9c84275f82e7445dcf1494a3d729c/.github/scripts/download-bundled/unicode-character-database.sh#L8 [1].

When a new upstream version is tagged, you can take time and propose a PR like before [2] and update the reference version [1]. The CI would check then your changes.

[2] The new .github/scripts/download-bundled/unicode-character-database.sh can be run localy and help you with that.

From supply chain points, If the ucd.zip is broken when unicode.org is hacked, The ucd.zip can include malware. So I think the ucd.zip should be include this repository. If you go that far.

Not necessary. The ucd.zip is used as a source to check the bundled files against, so in this case, CI would fail, but no changes would be pushed into this repo by the CI.

@youkidearitai
Copy link
Contributor

No, necessary. ucd.zip is zip file. Attacker free to use CI computing resources from malicious attacker.
Even if attacker not pollution this repository(git commit, git push and etc...) , CI computing resource may be stepping stone from this way.

@mvorisek
Copy link
Contributor Author

This is not a problem, a) CI has timeout, b) it is hosted at official unicode.org domain, c) job timeouts happen time by time already and this repo is still working :)

@alexdowad
Copy link
Contributor

@mvorisek Thanks for patiently answering our questions.

This is something new to me and I would like to ask for a bit more time to look at it. After doing so, I might have some more questions.

Thanks very much for your hard work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants